ChatGPT connector — design notes

add-polyfill-connector-systemProject noteConnector notes
Created openspec/changes/add-polyfill-connector-system/design-notes/chatgpt.mdView on GitHub →

Status: design captured 2026-04-19 overnight. Source: ChatGPT backend-api audit subagent 2026-04-19; prior art at ~/code/data-connectors/openai/chatgpt-playwright.js.

Auth

  • Shared Playwright persistent profile. Cookie-driven session; user logged in during bootstrap.
  • Extract bearer token at run start from #client-bootstrap JSON in page DOM.
  • Extract device ID from oai-did cookie.
  • Token lifetime ~30 days (verify via JWT exp claim). If expiry < now+5min, stop and emit INTERACTION kind=manual_action with a link to chatgpt.com.

Critical: TLS fingerprint preservation

All fetches to /backend-api/ MUST go through page.evaluate(fetch) inside the browser context. Node.js fetch will be 403'd by Cloudflare. Non-negotiable.

Streams

conversations (mutable_state, primary_key ["id"], consent_time_field "create_time")

  • id (UUID)
  • title
  • create_time (ISO 8601)
  • update_time (ISO 8601; cursor field)
  • is_archived (boolean)
  • is_starred (boolean)
  • workspace_id (nullable; enterprise)
  • current_node (the "tip" of the conversation tree)
  • message_count_on_current_branch
  • gizmo_id (nullable, foreign key to gizmos)

messages (append_only, primary_key ["id"], consent_time_field "create_time")

Note: semantics = append_only. A message node in ChatGPT's tree doesn't mutate once written; new user prompts create new children. However, the path from root to current_node can change if the user regenerates. For v1 we emit only messages on the current branch.

  • id (message node UUID)
  • conversation_id (foreign key)
  • parent_id (nullable; allows tree reconstruction even though we emit only current branch)
  • children_ids (array; informational — which siblings exist on alt branches)
  • role (user / assistant / system / tool)
  • content (text; joined from content.parts[])
  • content_type (text / multimodal_text)
  • model_slug (nullable; e.g. gpt-4o, gpt-5.4)
  • create_time (ISO 8601)
  • finish_reason (nullable; stop / tool_calls / length)
  • citations (array of {url, index_start, index_end, text}; nullable)
  • tool_calls (array of {name, arguments}; nullable)
  • attachment_ids (array of file IDs; nullable)
  • on_current_branch (boolean; true for all v1-emitted messages)

memories (mutable_state, primary_key ["id"], consent_time_field "created_at")

  • id
  • content
  • created_at
  • updated_at
  • type (default "memory")

Gotcha: API doesn't emit deletion events. Connector compares against previous state to detect deleted memories, emits as tombstones.

gizmos (mutable_state, primary_key ["id"], consent_time_field "created_at")

  • id
  • name
  • description
  • access (private / shared_link / public)
  • instructions (system prompt)
  • tools (array of tool specs)
  • created_at, updated_at

Captures custom GPTs the user has created.

files (append_only, primary_key ["id"], consent_time_field "created_at")

  • id
  • name
  • size
  • mime_type
  • created_at
  • origin_conversation_id (nullable)
  • origin_message_id (nullable)

Metadata only. Byte download deferred.

models (mutable_state, primary_key ["id"])

  • id (model slug)
  • name (display)
  • type (chat / gpt4 / custom_gpt)
  • available_to_tier
  • context_window
  • knowledge_cutoff (nullable)

Reference data. Refreshed once per day.

Tree flattening policy (autonomous 2026-04-19)

Emit only messages on the current branch (root → current_node path per conversation.mapping). This matches the prior art and the user's visible UI. Store parent_id and children_ids so a consumer can reconstruct the full tree if they ever want to. If audit later shows alt branches matter, flip to emitting all nodes and use on_current_branch flag to distinguish. Easy reversal.

Incremental sync

  • Global cursor: last_seen_update_time per conversation.
  • Flow:
    1. GET /backend-api/conversations?offset=0&limit=100&order=updated — iterate until the first conversation with update_time <= last_seen_update_time.
    2. For each new/updated conversation, GET /backend-api/conversation/{id} and walk the tree.
    3. Update per-conversation cursor: { conversation_id: { update_time, current_node } }.

Rate limiting

  • 5 concurrent conversation detail fetches via Promise.all.
  • 200 ms delay between batches.
  • 30 s timeout per detail fetch.
  • On 429 or Cloudflare challenge: exponential backoff (2, 4, 8 s, max 16 s), abort conversation after 3 retries.

Memory tombstone strategy

Memories don't have a delete event. Connector reads prior-run memory IDs from state, computes diff, emits tombstones for IDs no longer present.

Explicit non-goals v1

  • File byte downloads (metadata only).
  • Generated image bytes.
  • Real-time streaming events (only post-completion state).
  • Custom instructions endpoint (unverified; punt to v2 if endpoint confirmed).
  • Workspace member lists (enterprise-only).
  • Usage/billing data.
  • Plugin/action execution logs.

Risks / open questions

  • [?] Exact endpoint for custom instructions — unconfirmed. May require DOM scraping /settings.
  • [?] Archived conversations: does /conversations default include them or do we need ?is_archived=true?
  • [?] Shared conversations: does the API expose share metadata or only the content?
  • [?] Token refresh — does cookie-driven refresh work silently or does expiry always force re-auth?

Known gaps awaiting the partial-run-semantics mechanism

As of 2026-04-20, spine_events contains 4,188 run.stream_skipped events for this connector, each naming a specific conversation that 429'd. Example:

{ "stream": "messages", "reason": "http_error",
  "message": "conversation 69d71fbf-0ba8-8327-88c7-3ed3ec02058c http 429" }

These are Category 1 (transient upstream failure — see gap-recovery-execution-open-question.md). They are retriable as-is, with no connector change, given (a) a mechanism that remembers the gap across runs and (b) an invocation path that passes the 4,188 IDs back to the connector as a pre-filter.

Neither exists yet. The data is durably logged; the retry is not yet implementable at protocol level. Resolving the three-part open question (partial-run-semantics + cursor-finality-and-gap-awareness + gap-recovery-execution) is what unblocks recovery here.

What is deliberately not being done as a workaround:

  • No one-shot recovery script for the 4,188 IDs. Such a script would become a workaround masquerading as a solution and would have to be removed once the protocol mechanism lands.
  • No connector-internal retry loop. Retrying within a single run would re-hit the same rate limit and doesn't address the broader mechanism question.

The ChatGPT connector does support the scope.streams[].resources filter today, but as a post-filter (fetches everything, drops non-matches). For the recovery mechanism to be economically viable on 4,188 IDs, the connector would need to fetch via /backend-api/conversations/{id} directly — a pre-filter path. This is a future connector change gated on the protocol decision.

Testing tonight will answer these empirically; answers get folded back here.