Claude Code + Codex connectors

add-polyfill-connector-systemProject noteConnector notes
Created openspec/changes/add-polyfill-connector-system/design-notes/claude-code-codex-connectors.mdView on GitHub →

Status: implemented 2026-04-19. Both ingests currently in flight against real data. Motivation: capture the user's coding-agent session history as first-class, query-able personal data. These are the richest source of "what have I been working on" for an engineer, and they already live on-disk in a parseable form.

What each source looks like

Claude Code (~/.claude/projects/<encoded-path>/*.jsonl)

  • One jsonl per session, named by session UUID.
  • Directory is projects/ with one subdir per cwd (the slashes are replaced with dashes).
  • Lines have type{user, assistant, attachment, file-history-snapshot, permission-mode, last-prompt} and UUIDs with parentUuid for threading.
  • Session metadata (cwd, gitBranch, version, userType, entrypoint) is discovered from any line that carries it.
  • User's install had 2.2 GB across 46 project directories at the time of implementation.

Codex (~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl)

  • One jsonl per session, named rollout-<iso-timestamp>-<uuid>.jsonl, organized by date.
  • First line is {type: "session_meta", payload: {id, cwd, originator, cli_version, model_provider, git: {commit_hash, branch, repository_url}, ...}}.
  • Remaining lines are {type: "response_item", payload: {type, ...}} where inner type is one of message | function_call | function_call_output | reasoning.
  • Reasoning items have encrypted_content (opaque); we record their existence via session stats but skip the payload.
  • User's install had 751 MB across 191 rollouts.

Schema decisions

Three streams each, following Claude Code's shape for consistency where possible:

Claude Code: sessions, messages, attachments. Codex: sessions, messages, function_calls.

Why not unify under one stream set? Because the semantics diverge:

  • Claude Code's attachments stream covers permission-mode changes, file-history snapshots, hook outputs. Codex doesn't have these as first-class items — the equivalent is tool calls, which are clearer as their own stream with paired arguments/output.
  • Codex's function_calls stream pairs function_call with function_call_output by call_id, producing one row per tool invocation. That's a better query surface than a flat attachment log.
  • If we ever want a cross-tool normalized view, it belongs in a downstream transformation, not in the connector schema. Keep platform-native.

Content cap: 5 KB

Originally 20 KB. Dropped because:

  • Long tool outputs dominated raw jsonl byte counts, but most are either redundant diffs or large command output that isn't useful at rest.
  • 5 KB preserves every prompt I looked at in test data with room to spare.
  • Halves DB size projections (~500 MB – 1.5 GB for Claude Code).

Downstream users who want full text can still go to the source files — mtime state lets them seek.

Incremental sync: file-mtime

No per-record cursor. Each connector remembers {file_path: mtime} in state; re-runs skip any file whose mtime is unchanged. Trade-offs:

  • ✅ Cheap — no per-line tracking, no need to de-dupe UUIDs.
  • ✅ Resilient to file-system tricks (copy, edit, truncate) — any of those change mtime.
  • ⚠️ If a file is appended to, we re-parse the whole file (mtime changed). We rely on downstream upsert-by-key to deduplicate. For Claude Code that's cheap because keys are UUIDs. For Codex the key is session_id + line_index, which is stable for the existing prefix as long as the session wasn't compacted. Good enough for append-only tools.

Runtime change: filesystem binding

The reference runtime's buildAvailableBindings() originally exposed only network + (optionally) interactive. File-based connectors require filesystem: {}. Added to the default bindings. No capability gate — the sandboxing story is "these connectors run in the user's trust domain because their data was already on the user's disk." This is orthogonal to how network-bound connectors constrain outbound calls; the Collection Profile spec should probably formalize that filesystem is an on-device-only binding with no remote analog.

Configuration tension (cross-reference)

Claude Code grew three env vars in one day (CLAUDE_CODE_PROJECTS_DIR, ..._INCLUDE, ..._EXCLUDE). This was the trigger for the open question documented in connector-configuration-open-question.md. Manifest-declared options would be a better home, but it's a spec-surface decision and shouldn't be unilaterally taken.

What we did NOT do (yet)

  • Unified "coding-agent" view across Claude Code + Codex. Deferred; would be a downstream transformation or view, not a connector.
  • Extract tool-use payloads from Claude Code's tool_use/tool_result blocks into first-class rows. Current code synthesizes [tool_use: name] markers in the content string. Would be a richer schema but requires more rationale per field.
  • Codex memories, prompts, rules, skills directories. ~/.codex/memories/, ~/.codex/prompts/, ~/.codex/rules/, ~/.codex/skills/ carry first-class data; currently ignored. Follow-up stream candidates.
  • Claude Code ~/.claude/skills/ or ~/.claude/memories/. Same — follow-up.
  • Encrypted reasoning decryption. Codex's reasoning items use a server-held key; we intentionally don't try.

Verification

After each run completes, query:

sqlite3 'file:packages/polyfill-connectors/.pdpp-data/polyfill.sqlite?mode=ro' \
  "SELECT connector_id, stream, COUNT(*) FROM records WHERE connector_id LIKE '%claude%' GROUP BY 1,2"

sqlite3 'file:packages/polyfill-connectors/.pdpp-data/codex.sqlite?mode=ro' \
  "SELECT connector_id, stream, COUNT(*) FROM records GROUP BY 1,2"