Open question: cross-connector identity graph

add-polyfill-connector-systemProject noteOpen questions
Created openspec/changes/add-polyfill-connector-system/design-notes/identity-graph-open-question.mdView on GitHub →

Status: open Raised: 2026-04-19 Trigger: Layer 2 audits found every connector has identity/social data, but each expresses it in connector-native terms — no cross-connector queryability. This may be the most ambitious proposal in the open-question set; flagging it now so we can decide whether to scope it in or consciously defer.

The shape of the data we have

Layer 2 coverage audits (see layer-2-coverage-gmail-ynab-usaa-github.md, layer-2-coverage-chatgpt-claude-codex.md) surfaced identity/social data in every connector inspected:

  • GitHub — followers, following, organization members, collaborators
  • Gmail — contacts (implicit from message headers: From/To/Cc)
  • Slack — workspace users, channel memberships, DM relationships
  • USAA — joint account holders
  • Notion — workspace members, shared page permissions
  • Instagram (meta) — followers, following, blocked users
  • ChatGPT — shared conversation recipients (if they accepted)

Today each connector declares its own users / members / contacts stream in manifest-native terms. A consumer asking "who do I interact with across all my data" must write cross-connector joins and dedup heuristics themselves.

Use cases this would unlock

  • "Show me everyone I've interacted with in the last 30 days across Gmail, Slack, and GitHub"
  • "Reconcile this email thread with this Slack conversation with this GitHub mention — same humans?"
  • "Deduplicate Jane Smith (jsmith@co.com) with Jane Smith (Slack U123) with Jane Smith (GitHub @janey)"
  • "Find the common contacts between two people for warm-intro matching"

This is plausibly the highest-value query surface PDPP enables that no single connector can deliver alone: unified relationship intelligence derived from the user's data exhaust.

What a primitive might look like

A. Manifest convention: every identity-bearing connector declares a contacts stream with normalized fields

Connectors with identity data emit an identity / contacts stream with a shared field set: id, display_name, email, handle, source_connector, source_id, plus relationship_type (member, follower, contact, etc.). Consumers run dedup on email + handle heuristics. Spec effort is small; intelligence stays consumer-side.

B. A cross-connector index maintained by the RS

The RS computes an identity_graph virtual stream from all declared contacts-style streams. Consumer queries it once; RS does dedup and returns unified entities with per-source provenance.

C. A new top-level concept: "entity streams"

Beyond "streams within a connector," the spec allows cross-connector entity streams with their own identity-resolution rules, consent semantics, and query API. Most invasive; most powerful.

D. Punt — keep identity local per connector

Document the pattern, let consumers build the graph if they need it. No spec change.

Technical complications

  • Identity resolution is hard. Same person can be alex@example.com, alex.rivera@personal.example, arivera on GitHub, U01234567AB on Slack. No canonical cross-internet handle exists (ATProto DIDs, Keybase proofs are promising but not universal).
  • Consent surface grows. "This grant includes identity-graph access" has to be expressible separately from per-connector consent — the graph can reveal things individual streams don't (meta-information attack surface).
  • Privacy asymmetry. The people in your graph didn't consent to being aggregated across your connectors. A contacts stream already has this property; a cross-connector graph amplifies it.
  • Failure modes of dedup. False merges ("two Jane Smiths become one") are worse than false splits in most use cases, but the spec would have to take a position on precision-vs-recall defaults.

Connection to existing open questions

  • layer-2-completeness-open-question.md — identity data is systematically undercovered in today's manifests; if Layer 2 forces us to declare coverage, identity is where the gap is largest.
  • Authored-artifacts-vs-activity split — most identity data is activity-side (headers, memberships), but "custom nicknames" or "VIP markings" are authored and should ride with the owner.
  • slackdump-design-gaps.md Gap 7 (multi-instance providers) — the same person's Gmail account appearing in two workspaces is already a dedup problem at the connector level; identity graph generalizes it.

Action items

  • Decide whether identity is a spec concern or purely a consumer concern.
  • If spec concern: pick A / B / C.
  • If chosen, define the normalized identity field set and its consent semantics.
  • Revisit after Layer 2 audits complete — coverage variance may force our hand.