Open question: agent-generated custom connectors with PDPP conformance
Status: open
Raised: 2026-04-23
Trigger: Tim after reviewing Kahtaf's stagehand-connectors and OpenSteer — could a user hand generic connector tools to an agent, let it generate a user-specific schema + extractor, and still get enough PDPP conformance to support incremental collection plus generic downstream browsing/search/filter/page semantics?
Why this matters
If the answer is "yes, with constraints," PDPP could support a new lane between:
- fully hand-authored reusable connectors, and
- ad hoc one-off scripts that are useful only to their author.
That lane would be:
- private
- user-specific
- agent-assisted
- strong on protocol conformance
- weaker on semantic guarantees
This is attractive because the long tail of personal data sources is too large for hand-authored reviewed connectors alone.
The key distinction: there are three different guarantees
1. Protocol conformance
Can the artifact behave like a PDPP connector at all?
This is the strongest guarantee and the one PDPP can realistically make hard:
- valid manifest / stream metadata
- valid schemas
- stable record identity
- valid
STATE/ checkpoint behavior - valid
DONE/ partial-run / failure signaling - no unauthorized fields or secret leakage
2. Incremental collection correctness
Can later runs continue honestly rather than re-scraping blindly or silently missing changes?
This is harder, but still testable:
- stable cursor or bookmark semantics
- explicit append-only vs mutable-state behavior
- deletes / tombstones if claimed
- deterministic behavior under replay from a prior checkpoint
3. Semantic extraction correctness
Did the agent actually understand the site well enough to build the right schema and fill it correctly?
This is the weakest guarantee. It can be evaluated, sampled, and monitored, but not "proven" in the same way protocol conformance can be.
The important consequence: PDPP should not treat these three guarantees as one thing.
What the external artifacts suggest
OpenSteer: discovery-first browser automation yields durable local artifacts
OpenSteer's model is "discover, persist useful descriptors, then codify into plain TypeScript." It persists:
- browser workspaces
- saved extraction descriptors
- saved DOM targets / persist keys
- request plans
- captured network traffic
That is highly relevant because it shows a plausible artifact bundle for an agent-assisted connector authoring workflow. What it does not provide is a portable data contract, incremental sync guarantee, or PDPP-style conformance harness for personal-data records.
What is additionally useful:
- it exposes the same core semantics through CLI, SDK, protocol, cloud API, and agent skills
- it ships a dedicated conformance package for that stable automation surface
So OpenSteer is strongest as prior art for:
- multi-surface agent tooling over one core runtime
- persisted authoring artifacts
- black-box conformance cases for a stable public surface
It is still stronger on exploration and authoring than on data-contract truthfulness or incremental collection semantics.
stagehand-connectors: useful verifier pattern, weak conformance story
Kahtaf's repo is the strongest nearby precedent for "agent handles auth and navigation; deterministic code handles extraction." The most useful ideas are:
- per-scope typed outputs
- a two-gate
_verify-loginhook:- preflight can skip warm-session login
- post-login is a hard correctness gate before extraction starts
- coarse capability composition between site skills
What it does not currently provide:
- canonical exported schemas/versioning
- resumable incremental checkpoints
- a live-site conformance suite
- a stable public connector contract beyond its own local JSON envelope
That makes it a strong pattern for private extraction workflow, but not yet for "this is a trustworthy PDPP connector" on its own.
Singer / Meltano, Airbyte, Fivetran: protocol/runtime guarantees can be strong while semantic guarantees stay weaker
These systems converge on the same lesson:
- the runtime can strongly enforce message shape, schemas, state, and sync boundaries
- the platform can help with connector authoring
- but the semantic correctness of a custom connector still depends on the connector logic and source understanding
This is the clearest precedent for a PDPP answer that says:
private generated connectors may be fully protocol-conformant without being semantically equivalent to reviewed published connectors
What PDPP could plausibly support
Tier A — private generated connector
Generated by the user or their agent for one owner's use.
Properties:
- custom schema is allowed
- generic consumers can still inspect, page, filter, and search it if metadata is truthful
- no assumption of cross-user interoperability
- trust level is lower than a reviewed reusable connector
This is the easiest tier to justify first.
Tier B — published reusable connector
Reusable by other owners and clients.
Additional bar:
- stable schema and versioning
- stronger completeness / coverage claims
- broader conformance evidence
- better expectations around upgrade and compatibility
PDPP should not assume a private generated connector automatically graduates to this tier.
Minimum artifact bundle for a private generated connector
At minimum, the generated artifact likely needs:
-
A manifest-like declaration:
- connector id / provenance
- stream list
- schema per stream
- auth / options shape if applicable
-
Explicit record identity rules:
- primary key / stable id strategy
- whether keys are source-native or synthesized
-
Explicit incremental semantics:
- append-only vs mutable-state
- checkpoint / cursor shape
- delete / tombstone semantics if any
-
Extraction logic or replay plan:
- generated code
- saved request plans
- saved descriptors
- or some other durable, inspectable artifact
-
Verifier hooks:
- auth/session verifier
- optionally coverage / completeness checks
-
Evidence:
- fixtures, captured traces, sampled outputs, or replayable source material
- enough to rerun the conformance harness and inspect failures
What the conformance harness would need to grow
The existing connector-system work already points at the right seams:
- Layer 1 manifest-vs-data correctness
- Layer 2 manifest-vs-source coverage / completeness
- partial-run honesty
- configuration schema
For generated private connectors, the harness likely needs explicit categories:
Harness category 1 — wire / runtime conformance
- valid protocol messages
- valid manifests / schemas
- no forbidden fields
Harness category 2 — incremental semantics
- replay from prior state
- append behavior
- mutable-state behavior if claimed
- delete handling if claimed
Harness category 3 — verifier correctness
- auth/session verifier is present if needed
- extraction does not proceed after verifier failure
Harness category 4 — completeness honesty
Not "is this source perfectly covered?" but:
- does the connector declare itself as curated subset vs broader coverage?
- are obvious omissions named honestly?
Without this, a generated connector can be wire-valid while silently under-collecting.
What generic downstream consumers could rely on
Even with custom per-user schemas, a generic human or agent consumer could still:
- list streams
- inspect schemas
- page records
- filter declared fields
- use lexical retrieval over declared searchable fields
What they could not safely assume:
- another user's generated connector exposes the same schema
- fields with similar names carry the same semantics
- "complete coverage" unless the connector claims it explicitly
So the value proposition is real, but it is not "automatic interoperability." It is "generic operability over honest custom data."
Candidate position
The most plausible first stance is:
- PDPP supports private generated connectors as a valid class of connector artifact.
- PDPP distinguishes:
- protocol-conformant
- incrementally-correct
- semantically-reviewed
- Private generated connectors are owner-scoped by default and do not inherit the trust posture of reviewed reusable connectors.
- The first exploration should focus on append-only or snapshot-like streams, then graduate to stronger mutable-state semantics later.
Open questions
- Should PDPP name a formal trust/status ladder for connectors:
generated/private,reviewed/private,published/reusable, etc.? - Should generated custom schemas live under an explicit namespace or marker so downstream clients can distinguish them from reviewed stable schemas?
- What minimum evidence bundle is required before a generated connector may claim incremental correctness?
- How should coverage/completeness claims be expressed for generated connectors: silence, a simple curated-subset marker, or a fuller coverage-statement model?
- Should the "agent-generated" part be standardized at all, or should PDPP only standardize the artifact and harness requirements?
Low-regret exploration path
- Prototype one private generated connector lane only.
- Restrict first experiments to:
- append-only or snapshot-like streams
- explicit verifier hooks
- mandatory evidence capture
- Reuse the existing conformance harness as much as possible.
- Evaluate whether generic consumers can successfully browse/query/search the resulting custom data without source-specific knowledge.
- Only later ask whether any of this should become publishable or standardized beyond the private lane.