Add Statement Content Fingerprint
Why
The chase/statements and usaa/statements retained histories churn on every re-download even when the owner-visible statement is unchanged. The statement PDFs are content-addressed by pdf_sha256 = sha256(raw bytes), but the raw bytes are not the content: Chase statement PDFs are RC4-encrypted and the source regenerates the per-download encryption key material and embedded generation timestamps on every fetch, so pdf_sha256 (and the pdf_path/document_url that embed it) moves with zero change to the decrypted text or page count. Read-only evidence (tmp/workstreams/ri-version-rationality-evidence-v1-report.md) proved the decrypted text sha and page count are invariant across this churn for every comparable Chase blob pair, and that USAA's own PDF-derived transactions are byte-identical content across the same pdf_sha256 churn.
The current compaction policy and connector fingerprint for both streams exclude only the run-clock fetched_at, which makes every blob-identity field a fingerprint boundary. As a result these histories cannot be made rational by canonical compaction: there is no positive content signal that proves a re-download is a no-op, so excluding the blob fields would be lossy (a genuinely re-issued statement for the same key would become invisible). They must therefore stay non-collapsible, and the version-churn surface keeps re-alarming on acquisition noise.
The fix is connector-independent: emit a positive, owner-visible content fingerprint derived from the PDF content (pdf_text_sha256, pdf_page_count) and make the canonical statement fingerprint exclude the blob/acquisition-identity fields only when those positive content fields are present. With a content fingerprint in the record, excluding pdf_sha256/pdf_path/document_url becomes provably lossless, and both statement streams become canonical-compaction-eligible the same way chase/transactions already is.
What Changes
- Add two owner-visible content-fingerprint fields to the
chase/statementsandusaa/statementsrecord bodies:pdf_text_sha256(sha256 of the extracted PDF text, normalized) andpdf_page_count(integer page count). - USAA already extracts statement PDF text via
pdf-parse; it reuses that extraction to populate the fields. Chase has no text-extraction path today and gains one for statements. - Change the canonical statement fingerprint definition (connector no-op suppression and compaction policy, bound to the same rule) to exclude the blob/acquisition-identity fields (
pdf_sha256,pdf_path,document_url,fetched_at) only when both positive content fields are present; when they are absent the fingerprint falls back to excluding onlyfetched_at, so legacy/index-only versions are never silently collapsed. - Keep
account_idandaccount_referenceinside the USAA statement fingerprint so the proven null → resolved FK backfill (9 null→value, 0 regressions in the evidence) remains a version boundary. - Make both statement streams canonical-compaction-eligible (
changeModel: "immutable_semantic",representativePolicy: "current"), gated on the positive content fields, reusing the canonical mode added bycanonicalize-retained-record-history. - Validate the policy on a copied/narrowed database with dry-run, apply, and idempotence checks. No live apply.
Capabilities
New Capabilities
- None.
Modified Capabilities
reference-implementation-architecture: add the statement content-fingerprint record fields and the content-gated canonical fingerprint/exclusion rule forchase/statementsandusaa/statements; update the compaction-policy requirement so these two streams are no longer described as blob-identity-bounded run-clock-only policies.
Removed Capabilities
- None.
Impact
packages/polyfill-connectors/connectors/chase/index.ts— emitpdf_text_sha256/pdf_page_counton statements; new text-extraction path.packages/polyfill-connectors/connectors/usaa/index.tsandconnectors/usaa/statement-pdfs.ts— emit the two fields from the existingpdf-parseextraction.packages/polyfill-connectors/src/fingerprint-cursor.tsconsumers — content-gated exclusion list for the statements cursors.- Connector manifests for Chase and USAA — declare the two new stream fields.
reference-implementation/scripts/compact-record-history.mjs— updatechase/statementsandusaa/statementspolicies (content-gated exclusion,changeModel/representativePolicy).reference-implementation/test/compact-record-history-*.test.*and connector fingerprint/parity tests.- Operator evidence artifacts under
tmp/workstreams/for copied-database validation.