Add polyfill connector system
In progress — 8 connectors pristine (951k records), 3 open-question notes on partial-run honesty mechanism, Chase scaffolded (2026-04-21)
Artifacts
Official change artifacts tracked under openspec/.
The reference implementation today has sample polyfill connectors (Spotify, GitHub, Reddit) backed by seed fixtures. It does not yet have living polyfill connectors against real platforms for a real user, running on a real schedule, with a real human-in-the-loop interaction channel.
Supersession note (2026-04-25): browser-backed connector details in this early design are historical where they mention a shared ~/.pdpp/browser-profile/, browser bootstrap/probe CLI, or browser daemon. Current implementation uses per-connector isolated Patchright profiles via packages/polyfill-connectors/src/browser-launch.ts; see openspec/changes/retire-browser-daemon.
Affected capabilities
Capability specs this change proposes to modify.
Project notes
Change-local notes that support this workstream but have not been promoted into the official change artifacts.
Add polyfill connector system
55 notes · 24 open questions · 1 plan · 2 strategy notes · 2 audits · 1 research note · 14 connector notes · 11 working notes · updated
Status: researching Owner: owner/runtime Created: 2026-04-19 Updated: 2026-04-25 Related: add-polyfill-connector-system; add-reference-runtime-spec; credential-bootstrap-automation-open-question.md; raw-provenance-capture-open-question.md; external-tool-dependencies-open-question.md; connector-configuration-open-question.md
"We are unable to complete your request. Our system is currently unavailable. Please try again later."
If the answer is "yes, with constraints," PDPP could support a new lane between:
| | Activity streams | Authored artifacts | |---|---|---| | Volume | High (thousands → millions) | Low (tens → hundreds) | | Mutation | Append-only (sometimes tombstones) | Mutable; edited over time | | Cursor | Timestamp / monotonic | Revision or content-hash | | Consent weight | Sensitive by volume (bulk access) | Sensitive by leverage (encodes user strategy) | | Disclosure framing | "Your Gmail messages" | "Your custom ChatGPT prompts" | | Restoration | Server retains source of truth; re-collectable | Lost forever if not preserved |
Status: sprint-needed Owner: project owner Created: 2026-04-19 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/tasks.md (Gmail attachment blob collection), openspec/changes/add-polyfill-connector-system/design-notes/layer-2-coverage-gmail-ynab-usaa-github.md, pdpp-trust-model-framing.md
Status: sprint-needed Owner: project owner Created: 2026-04-19 Updated: 2026-04-24 Related: design-notes/source-instances-and-multi-account-configurations-2026-04-24.md (repo root)
Layer 2 coverage audits (see layer-2-coverage-gmail-ynab-usaa-github.md, layer-2-coverage-chatgpt-claude-codex.md) surfaced identity/social data in every connector inspected:
Status: sprint-needed Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/partial-run-semantics-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/gap-recovery-execution-open-question.md, pdpp-trust-model-framing.md
Several connectors in today's fleet (Gmail, ChatGPT, USAA, Slack) have had hours of debugging that similar infrastructure could have cut to minutes. This pattern will repeat every time we fix a bug or add a connector.
| Class | Example | Spec'd today? | |---|---|---| | 1. Runtime bindings | network, filesystem, interactive | ✅ yes — runtime_requirements.bindings | | 2. Language-level deps | npm packages, Go modules, Python imports | ❌ no — implicit in the connector package | | 3. External tool binaries | slackdump, osxphotos, ffmpeg, pandoc, playwright browsers | ❌ no — invisible to spec |
Status: sprint-needed Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/partial-run-semantics-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/cursor-finality-and-gap-awareness-open-question.md, pdpp-trust-model-framing.md
The short-term answer to "human-attended browser access for connectors in Docker" is specifically host-browser control, not noVNC, not browser streaming, not a connector-worker protocol. The follow-up work should be scoped around a deliberately configured local bridge:
Catches broken extractors, undeclared columns, null-where-required, declared fields that never populate. Recent run surfaced ChatGPT dropping ~67% of messages, Gmail snippet/references/content_type broken, USAA manifest drift. The conformance harness already runs this.
One row per setting: {id, key, category, value, valuetype: "string" | "number" | "boolean" | "json" | "enum", valueenumoptions, description, lastmodified, source_connector}. Pro: queryable across sources; portable; renders cleanly in disclosure UI. Con: nested/collection settings (Gmail filters, muted-channel lists) collapse into opaque JSON, losing typing where it matters most.
reference-implementation/server/index.js:1054:
spec-core.md Tier 1 RS requirement #12:
Status: sprint-needed Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/cursor-finality-and-gap-awareness-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/gap-recovery-execution-open-question.md, pdpp-trust-model-framing.md
Most consumer platforms expose at least one of the following surfaces for personal data:
Re-extraction cost, audit fidelity, and self-export completeness all pivot on whether the RS holds the upstream artifact or only the parsed record. Owners who self-export raw receive something qualitatively different from owners who receive only the extractor's output: raw is auditable against the source and re-parseable when the extractor improves; the parsed record is frozen at whatever shape the extractor happened to produce on ingest day.
GET /v1/streams without a connector_id query parameter returns:
The question: is one-DB-per-owner a PDPP spec requirement, a reference-implementation convention, or an incidental choice?
Meanwhile, anyone who builds on PDPP will re-embed the same 800k records. That's a staggering amount of duplicated work across implementations and the forcing function the blob-hydration note already named for binaries ("don't make every consumer re-derive expensive things that are identical across consumers"). Embeddings and BM25 indexes fall in the same class.
Many platforms expose a three-step web flow for obtaining durable API credentials:
Status: sprint-needed Owner: project owner Created: 2026-04-19 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/credential-bootstrap-automation-open-question.md, pdpp-trust-model-framing.md
Status: captured Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/partial-run-semantics-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/cursor-finality-and-gap-awareness-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/gap-recovery-execution-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/blob-hydration-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/credential-storage-open-question.md
The open question note establishes that PDPP could plausibly support a lane for:
Status: captured Owner: connector worker Created: 2026-04-24 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system
Status: audit complete; reverified by query-api-gap-audit Original branch: audit-query-api-readiness Verification branch: query-api-gap-audit Verification worktree: /home/tnunamak/code/pdpp-query-api-gap-audit Scope: read-only audit of reference-implementation/server, query docs/specs, OpenSpec artifacts, and packages/polyfill-connectors/manifests.
Year-freezing. A year's orders don't change once closed. Once we've scraped year Y and the count matches for 2 consecutive runs, Y is "frozen" — skip it on future runs. Only the current year + last 60 days of the prior year need re-scraping each run.
The common assumption — and what our initial research pointed at — was that Chase uses Akamai Bot Manager Premier which detects headless Chromium at login. Four findings disprove that for this specific flow:
Financial transactions are a high-value polyfill stream for the same reason USAA and YNAB are: reconciliation, life-history analysis, owner self-export, audit. Chase is the largest US retail bank by active checking accounts and is a natural parallel to USAA for demonstrating that PDPP polyfill connectors generalize across institutions.
All fetches to /backend-api/ MUST go through page.evaluate(fetch) inside the browser context. Node.js fetch will be 403'd by Cloudflare. Non-negotiable.
Three streams each, following Claude Code's shape for consistency where possible:
Library: imapflow (Node). Handles CONDSTORE natively, tolerates Gmail's lack of QRESYNC, clean async/await API, maintained.
Node.js v24+ readline.createInterface() treats U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line terminators. This tracks ECMA-262's definition of line terminators.
---
---
USAA deprecated OFX/QFX in mid-2023 but retains CSV export via UI. Flow per account: 1. Navigate to /my/accounts 2. Click account name 3. Click "I want to" menu (upper-left of account detail page) 4. Select "Export" 5. Choose CSV, date range, download
Evidence gathered from live session recon during the overnight run.
USAA's CSV export UI hard-caps at ~18 months. Empirically on 2026-04-19: "10/19/2024 accepted, 04/19/2024 rejected." Requesting older ranges leaves the form in "Fix From Date" state and submit button never enables. This is documented in the connector at packages/polyfill-connectors/connectors/usaa/index.js around line 350.
Built scaffolds for 13 additional connectors beyond the original 5-MVP (YNAB, Gmail, ChatGPT, USAA, Amazon). All have full manifests; implementations vary from complete (API-based with available creds) to scaffolded-pending-wiring (browser-based needing live session).
Rationale: valuable for reconciliation — GPS at which a payee was last used can match a bank-statement merchant to a specific Amazon/Uber location.
Status: open Owner: Tim Created: 2026-04-24 Updated: 2026-04-24 Related: add-polyfill-connector-system; reference runtime controller; protocol-violation diagnostics
Status: captured Owner: connector-live-smoke-triage worker Created: 2026-04-24 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system
Tim's list (verbatim, then expanded):
Slackdump's -chan-types public,private,im,mpim lets the operator say "give me public channels and DMs but skip group DMs." This is not a resources filter (that's by ID) and not a streams filter (that's by record kind). It's a sub-type within a stream.
49,173 real records across 20 streams from 4 platforms, all your actual data, ingested into PDPP RS. YNAB + Gmail + ChatGPT from last night, USAA added today (5 streams: accounts, transactions, statements, inboxmessages, creditcard_billing).
The A++ follow-up audit found that several connectors with obvious parent/child stream relationships were historically child-first:
Several auto-login helpers use page.waitForTimeout(ms) as a synchronization primitive. This is an explicit Playwright anti-pattern — the docs call it out as "strongly discouraged" because it makes tests flaky, slow, and brittle to timing variance. It's in our code because the helpers were adapted from pre-existing scrapers that used the pattern, and we preserved working behavior while extending them.
generated/private pilot.
Honest audit turned up 5 classes of gap across the connector fleet. This document tracks the fix-all pass.
Current config (reference-implementation/server/db.js):
Binding on every connector past, present, and future.