Design
Current State
The reference RS already has the core blob substrate:
POST /v1/blobsstores content-addressed bytes for owner-authorized ingestion.GET /v1/blobs/{blob_id}serves bytes only when the caller can see a record that references the blob through a visibleblob_ref.- Read responses decorate visible
blob_refobjects withfetch_url. - Query tests already prove grant-visible
blob_refenforcement and Gmail attachment expansion for synthetic hydrated records.
The missing layer is connector hydration. Real first-party connectors still often emit records such as Gmail attachments, Slack files, statements, or receipts as metadata without a populated blob_ref, so the transport exists but clients cannot reach actual bytes.
Principles
- Use the existing blob contract. Do not add
/attachments/:id/content,/download, or stream-specific byte routes. - Connector output remains metadata-first. A file-like record can exist without bytes; hydration status tells clients whether bytes are available.
- Hydration is grant-safe by construction. Bytes are fetchable only through a record whose
blob_reffield is visible under the bearer. - Hydration is idempotent. Re-running a connector should not duplicate stored bytes for the same content.
- Failures are records, not surprises. If byte collection is blocked, too large, expired, unavailable, or intentionally skipped, preserve a metadata record with a machine-readable hydration status/reason.
- No secrets in blobs or timelines. Connector logs and run timelines must not include byte contents or signed source URLs.
Candidate Streams
The first implementation pass should audit all shipped first-party manifests and classify each stream:
- Hydrate now: bytes are source-available, useful, bounded, and safe to fetch in normal sync.
- Metadata only: bytes may exist but are not available through the current connector auth/path, are too expensive, or need separate design.
- Not binary: no collectible byte payload.
Known candidates:
- Gmail
attachments: first vertical slice; stream already models metadata,blob_ref,content_sha256, andhydration_status. - Slack file/message attachment surfaces: likely useful, but distinguish Slack message attachments from actual uploaded files and remote files.
- Chase/USAA/Amazon statements, receipts, invoices, or exports: high assistant value, but may require browser download handling and retention-size policy.
- ChatGPT/Claude/Codex artifacts or uploaded files: audit first; many records may be text metadata or local filesystem references rather than collectible source bytes.
- GitHub gists/release assets/files: audit whether these should be records, blobs, or both.
Hydration Model
For a hydrated file-like record, connector code should:
- Fetch bytes from the source while operating under the owner-authorized connector session.
- Upload/store bytes through the reference blob seam.
- Emit a record whose
data.blob_refcontains:blob_idmime_typesize_bytessha256
- Emit a connector-specific integrity field such as
content_sha256where the manifest already declares it. - Emit
hydration_status: "hydrated".
For a non-hydrated file-like record, connector code should emit metadata plus hydration_status such as:
deferredunavailableblockedtoo_largefailed
If a stream needs finer diagnostics, add an optional non-secret hydration_error_code or equivalent field through the manifest rather than overloading logs.
Size And Retention
This change should start conservative:
- Set a per-connector default max blob size.
- Allow operator override via environment/config only after documenting the consequence.
- Keep blob storage local to the reference deployment.
- Avoid automatic OCR/text extraction in this change; extracted text can be a later capability.
Client Discovery
Clients should learn byte availability from:
/v1/schemastream fields and field capabilities,blob_reffields present in returned records,blob_ref.fetch_urlinjected by the RS at read time,hydration_statuson file-like records.
Docs and examples should explicitly say that /content, /download, or stream-specific byte URL guessing is not the PDPP contract.
First Vertical Slice
Implement Gmail attachment hydration first because:
- The stream and relationship already exist.
- Real user testing has confirmed metadata exists but
blob_refis absent. - Existing synthetic query-contract tests already define the desired grant-safe behavior.
- It unlocks high-value assistant questions involving leases, invoices, receipts, PDFs, and documents.
The Gmail slice should prove:
- A real sync can populate attachment
blob_reffor at least one attachment. expand=attachmentsexposesfetch_urlonly when the grant includesattachments.blob_ref.- Fetching
fetch_urlreturns the original bytes. - Metadata-only fallback remains valid when download fails or is skipped.
Non-Goals
- No new public blob endpoint.
- No resource-specific attachment content routes.
- No automatic OCR/PDF/docx text extraction.
- No hosted blob storage.
- No cross-connector deduplication semantics beyond content-addressed storage.
- No change to the PDPP grant model.