Add polyfill connector system

In progress — 8 connectors pristine (951k records), 3 open-question notes on partial-run honesty mechanism, Chase scaffolded (2026-04-21)

in progresstasks85/90add-polyfill-connector-system
openspec/changes/add-polyfill-connector-system/View on GitHub →

Artifacts

Official change artifacts tracked under openspec/.

Affected capabilities

Capability specs this change proposes to modify.

Project notes

Change-local notes that support this workstream but have not been promoted into the official change artifacts.

Add polyfill connector system

55 notes · 24 open questions · 1 plan · 2 strategy notes · 2 audits · 1 research note · 14 connector notes · 11 working notes · updated

Open questions
Browser Automation And Agent Tooling Landscape

Status: researching Owner: owner/runtime Created: 2026-04-19 Updated: 2026-04-25 Related: add-polyfill-connector-system; add-reference-runtime-spec; credential-bootstrap-automation-open-question.md; raw-provenance-capture-open-question.md; external-tool-dependencies-open-question.md; connector-configuration-open-question.md

created
Open question: account risk from repeated automation — how does the protocol protect the owner's account from being locked out of their own bank?

"We are unable to complete your request. Our system is currently unavailable. Please try again later."

created
Open question: agent-generated custom connectors with PDPP conformance

If the answer is "yes, with constraints," PDPP could support a new lane between:

created
Open question: authored artifacts vs activity streams

| | Activity streams | Authored artifacts | |---|---|---| | Volume | High (thousands → millions) | Low (tens → hundreds) | | Mutation | Append-only (sometimes tombstones) | Mutable; edited over time | | Cursor | Timestamp / monotonic | Revision or content-hash | | Consent weight | Sensitive by volume (bulk access) | Sensitive by leverage (encodes user strategy) | | Disclosure framing | "Your Gmail messages" | "Your custom ChatGPT prompts" | | Restoration | Server retains source of truth; re-collectable | Lost forever if not preserved |

created
Open question: blob-hydration as a protocol primitive

Status: sprint-needed Owner: project owner Created: 2026-04-19 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/tasks.md (Gmail attachment blob collection), openspec/changes/add-polyfill-connector-system/design-notes/layer-2-coverage-gmail-ynab-usaa-github.md, pdpp-trust-model-framing.md

created
Open question: connector configuration surface

Status: sprint-needed Owner: project owner Created: 2026-04-19 Updated: 2026-04-24 Related: design-notes/source-instances-and-multi-account-configurations-2026-04-24.md (repo root)

created
Open question: cross-connector identity graph

Layer 2 coverage audits (see layer-2-coverage-gmail-ynab-usaa-github.md, layer-2-coverage-chatgpt-claude-codex.md) surfaced identity/social data in every connector inspected:

created
Open question: cursor finality & gap-awareness — what does a connector mean when it says "I have up to X"?

Status: sprint-needed Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/partial-run-semantics-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/gap-recovery-execution-open-question.md, pdpp-trust-model-framing.md

created
Open question: debugging-leverage infrastructure for connector development

Several connectors in today's fleet (Gmail, ChatGPT, USAA, Slack) have had hours of debugging that similar infrastructure could have cut to minutes. This pattern will repeat every time we fix a bug or add a connector.

created
Open question: external-tool dependencies (subprocess binaries)

| Class | Example | Spec'd today? | |---|---|---| | 1. Runtime bindings | network, filesystem, interactive | ✅ yes — runtime_requirements.bindings | | 2. Language-level deps | npm packages, Go modules, Python imports | ❌ no — implicit in the connector package | | 3. External tool binaries | slackdump, osxphotos, ffmpeg, pandoc, playwright browsers | ❌ no — invisible to spec |

created
Open question: gap-recovery execution — who runs the retries, and what does the connector owe them?

Status: sprint-needed Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/partial-run-semantics-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/cursor-finality-and-gap-awareness-open-question.md, pdpp-trust-model-framing.md

created
Open question: host-browser bridge for connectors running in containers

The short-term answer to "human-attended browser access for connectors in Docker" is specifically host-browser control, not noVNC, not browser streaming, not a connector-worker protocol. The follow-up work should be scoped around a deliberately configured local bridge:

created
Open question: manifest completeness vs the source's actual surface (Layer 2)

Catches broken extractors, undeclared columns, null-where-required, declared fields that never populate. Recent run surfaced ChatGPT dropping ~67% of messages, Gmail snippet/references/content_type broken, USAA manifest drift. The conformance harness already runs this.

created
Open question: normalized settings/preferences stream convention

One row per setting: {id, key, category, value, valuetype: "string" | "number" | "boolean" | "json" | "enum", valueenumoptions, description, lastmodified, source_connector}. Pro: queryable across sources; portable; renders cleanly in disclosure UI. Con: nested/collection settings (Gmail filters, muted-channel lists) collapse into opaque JSON, losing typing where it matters most.

created
Open question: owner authentication at device-approve time

reference-implementation/server/index.js:1054:

created
Open question: owner self-export needs a connector-list endpoint

spec-core.md Tier 1 RS requirement #12:

created
Open question: partial-run semantics — what does it mean for a run to "succeed" when not every record made it?

Status: sprint-needed Owner: project owner Created: 2026-04-20 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/cursor-finality-and-gap-awareness-open-question.md, openspec/changes/add-polyfill-connector-system/design-notes/gap-recovery-execution-open-question.md, pdpp-trust-model-framing.md

created
Open question: platform archive-request flows as a data source

Most consumer platforms expose at least one of the following surfaces for personal data:

created
Open question: raw-provenance capture — should connectors preserve re-extractable artifacts?

Re-extraction cost, audit fidelity, and self-export completeness all pivot on whether the RS holds the upstream artifact or only the parsed record. Owners who self-export raw receive something qualitatively different from owners who receive only the extractor's output: raw is auditable against the source and re-parseable when the extractor improves; the parsed record is frozen at whatever shape the extractor happened to produce on ingest day.

created
Open question: RS API surface — discoverability, error-shape, and query semantics

GET /v1/streams without a connector_id query parameter returns:

created
Open question: RS storage topology — one DB or many?

The question: is one-DB-per-owner a PDPP spec requirement, a reference-implementation convention, or an incidental choice?

created
Open question: semantic retrieval surface — what search primitives does the RS expose, and who owns the ranker?

Meanwhile, anyone who builds on PDPP will re-embed the same 800k records. That's a staggering amount of duplicated work across implementations and the forcing function the blob-hydration note already named for binaries ("don't make every consumer re-derive expensive things that are identical across consumers"). Embeddings and BM25 indexes fall in the same class.

created
Open question: systematizing credential-bootstrap automation across connectors

Many platforms expose a three-step web flow for obtaining durable API credentials:

created
Open question: where do connector credentials belong?

Status: sprint-needed Owner: project owner Created: 2026-04-19 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system/design-notes/credential-bootstrap-automation-open-question.md, pdpp-trust-model-framing.md

created
Plans
Strategy & framing
Audits & reviews
Research
Connector notes
Amazon connector — design notes

Year-freezing. A year's orders don't change once closed. Once we've scraped year Y and the count matches for 2 consecutive runs, Y is "frozen" — skip it on future runs. Only the current year + last 60 days of the prior year need re-scraping each run.

created
Chase anti-bot investigation (2026-04-21)

The common assumption — and what our initial research pointed at — was that Chase uses Akamai Bot Manager Premier which detects headless Chromium at login. Four findings disprove that for this specific flow:

created
Chase connector — scope and strategy

Financial transactions are a high-value polyfill stream for the same reason USAA and YNAB are: reconciliation, life-history analysis, owner self-export, audit. Chase is the largest US retail bank by active checking accounts and is a natural parallel to USAA for demonstrating that PDPP polyfill connectors generalize across institutions.

created
ChatGPT connector — design notes

All fetches to /backend-api/ MUST go through page.evaluate(fetch) inside the browser context. Node.js fetch will be 403'd by Cloudflare. Non-negotiable.

created
Claude Code + Codex connectors

Three streams each, following Claude Code's shape for consistency where possible:

created
Gmail connector — design notes

Library: imapflow (Node). Handles CONDSTORE natively, tolerates Gmail's lack of QRESYNC, clean async/await API, maintained.

created
JSONL truncation bug — ROOT CAUSE FOUND: U+2028 in Node 24+ readline

Node.js v24+ readline.createInterface() treats U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line terminators. This tracks ECMA-262's definition of line terminators.

created
Layer 2 coverage audit: ChatGPT, Claude Code, Codex

---

created
Layer 2 coverage audit: Gmail, YNAB, USAA, GitHub

---

created
USAA connector — design notes

USAA deprecated OFX/QFX in mid-2023 but retains CSV export via UI. Flow per account: 1. Navigate to /my/accounts 2. Click account name 3. Click "I want to" menu (upper-left of account detail page) 4. Select "Export" 5. Choose CSV, date range, download

created
USAA extra-streams wiring plan

Evidence gathered from live session recon during the overnight run.

created
USAA historical-coverage gap (CSV cap vs PDF statement archive)

USAA's CSV export UI hard-caps at ~18 months. Empirically on 2026-04-19: "10/19/2024 accepted, 04/19/2024 rejected." Requesting older ranges leaves the form in "Fix From Date" state and submit button never enables. This is documented in the connector at packages/polyfill-connectors/connectors/usaa/index.js around line 350.

created
Wide-build batch — overnight expansion

Built scaffolds for 13 additional connectors beyond the original 5-MVP (YNAB, Gmail, ChatGPT, USAA, Amazon). All have full manifests; implementations vary from complete (API-based with available creds) to scaffolded-pending-wiring (browser-based needing live session).

created
YNAB connector — design notes

Rationale: valuable for reconciliation — GPS at which a payee was last used can match a bank-statement merchant to a specific Amazon/Uber location.

created
Working notes
Connector Fixture Binding Shadows Polyfill Connectors

Status: open Owner: Tim Created: 2026-04-24 Updated: 2026-04-24 Related: add-polyfill-connector-system; reference runtime controller; protocol-violation diagnostics

created
Connector Live-Smoke Status Matrix

Status: captured Owner: connector-live-smoke-triage worker Created: 2026-04-24 Updated: 2026-04-24 Related: openspec/changes/add-polyfill-connector-system

created
Dashboard design upgrade — deferred until data is pristine

Tim's list (verbatim, then expanded):

created
Design gaps surfaced by slackdump

Slackdump's -chan-types public,private,im,mpim lets the operator say "give me public channels and DMs but skip group DMs." This is not a resources filter (that's by ID) and not a streams filter (that's by record kind). It's a sub-type within a stream.

created
Overnight summary — Tim, read this first when you wake up

49,173 real records across 20 streams from 4 platforms, all your actual data, ingested into PDPP RS. YNAB + Gmail + ChatGPT from last night, USAA added today (5 streams: accounts, transactions, statements, inboxmessages, creditcard_billing).

created
Parent-first emit order decision — 2026-04-23

The A++ follow-up audit found that several connectors with obvious parent/child stream relationships were historically child-first:

created
Playwright hygiene — known tech debt

Several auto-login helpers use page.waitForTimeout(ms) as a synchronization primitive. This is an explicit Playwright anti-pattern — the docs call it out as "strongly discouraged" because it makes tests flaky, slow, and brittle to timing variance. It's in our code because the helpers were adapted from pre-existing scrapers that used the pattern, and we preserved working behavior while extending them.

created
Private Generated Connector Pilot Source Selection — 2026-04-23

generated/private pilot.

created
Spec-conformance upgrade pass — 2026-04-19 overnight

Honest audit turned up 5 classes of gap across the connector fleet. This document tracks the fix-all pass.

created
SQLite performance recommendations for the reference runtime

Current config (reference-implementation/server/db.js):

created
Unattended-operation principle

Binding on every connector past, present, and future.

created