lexical-retrieval Specification
Purpose
Define PDPP's optional lexical retrieval extension: a discoverable, grant-safe, text-query search surface at GET /v1/search with stream-declared searchable fields and portable result-shape guarantees.
Requirements
Requirement: Lexical retrieval is an optional, advertised, named extension
PDPP SHALL define a named optional extension lexical-retrieval that implementations MAY expose. The extension SHALL NOT be assumed by clients to exist on any server unless the server explicitly advertises it via the resource-server metadata surface defined below. Core PDPP SHALL NOT require this extension. The extension SHALL NOT be exposed silently as ambient reference behavior.
Scenario: A client encounters a server that does not advertise the extension
- WHEN a client reads resource-server metadata and
capabilities.lexical_retrieval.supportedis absent orfalse - THEN the client SHALL NOT assume
GET /v1/searchis available - AND the server MAY return
404ornot_found_errorif the endpoint is requested
Scenario: A client encounters a server that advertises the extension
- WHEN resource-server metadata reports
capabilities.lexical_retrieval.supported: true - THEN the client MAY rely on
GET /v1/searchbeing available at the advertisedendpointpath - AND the client MAY rely on the
cross_stream,snippets,default_limit, andmax_limitfields when shaping requests
Scenario: The extension is not silently delivered through /_ref/search
- WHEN an implementation chooses to expose this extension
- THEN the public surface SHALL be the advertised
/v1/searchendpoint - AND the implementation SHALL NOT advertise
/_ref/search(or any other reference-only surface) as the public lexical retrieval endpoint - AND
/_ref/searchSHALL remain reference-only artifact/id-jump behavior with no public interoperability claim
Requirement: The extension SHALL expose GET /v1/search with a constrained query surface
When advertised, the extension SHALL be reachable as GET /v1/search. The endpoint SHALL accept a required q parameter and the optional parameters limit, cursor, repeated streams[], and stream-scoped filter[...] parameters. In this tranche, any request that includes filter[...] SHALL include exactly one streams[] value. The endpoint SHALL NOT accept expansion parameters, sort parameters, semantic/vector parameters, ranking parameters, connector-specific parameters, generic predicate DSL parameters, or arbitrary field filters outside the stream-scoped filter rules below.
filter[field]=value SHALL use the same exact-filter semantics as record listing for the named stream: the field SHALL be an authorized top-level scalar schema field for the caller and stream. filter[field][gte|gt|lte|lt]=value SHALL use the same declared range-filter semantics as record listing: the field and operator SHALL be declared in the stream metadata's query.range_filters. Filters SHALL constrain the candidate records that may contribute lexical matches, ranking, matched fields, and snippets.
Scenario: A request omits q
- WHEN a client calls
GET /v1/searchwithoutq - THEN the server SHALL return an
invalid_request_error - AND the response SHALL NOT include any candidate results
Scenario: A request includes only allowed unfiltered parameters
- WHEN a client calls
GET /v1/search?q=overdraft&limit=10&streams[]=messages - THEN the server SHALL accept the request
Scenario: A request includes an allowed single-stream filter
- WHEN a client calls
GET /v1/search?q=invoice&streams[]=messages&filter[received_at][gte]=2026-04-01T00:00:00Z - AND stream
messagesdeclaresquery.range_filters.received_atwith operatorgte - AND the caller is authorized to read
received_at - THEN the server SHALL accept the request
- AND every returned result SHALL identify a record whose visible
received_atsatisfies the filter
Scenario: A filtered request omits streams
- WHEN a client calls
GET /v1/search?q=invoice&filter[received_at][gte]=2026-04-01T00:00:00Z - THEN the server SHALL return an
invalid_request_error - AND the server SHALL NOT search every stream and apply the filter opportunistically
Scenario: A filtered request names multiple streams
- WHEN a client calls
GET /v1/search?q=invoice&streams[]=messages&streams[]=attachments&filter[received_at][gte]=2026-04-01T00:00:00Z - THEN the server SHALL return an
invalid_request_error - AND the server SHALL NOT silently apply the filter to only one of the streams
Scenario: A request includes an undeclared range filter
- WHEN a client calls
GET /v1/search?q=invoice&streams[]=messages&filter[size_bytes][gte]=1000 - AND stream
messagesdoes not declarequery.range_filters.size_bytes.gte - THEN the server SHALL return an
invalid_request_errororpermission_errorconsistent with record-list filter validation - AND the response SHALL NOT include partial results
Scenario: A request includes a disallowed v1 parameter
- WHEN a client calls
GET /v1/search?q=overdraft&rank=recency - THEN the server SHALL return an
invalid_request_error - AND the error SHALL identify the rejected parameter
Scenario: A client-token request names a stream the caller is not authorized to read
- WHEN a client-token caller calls
GET /v1/search?q=overdraft&streams[]=private_journaland the grant does not includeprivate_journal - THEN the server SHALL return a
permission_errorwith codegrant_stream_not_allowed - AND the unauthorized stream SHALL NOT contribute hits to any other request shape
Scenario: Cross-stream search when the server does not support it
- WHEN a client calls
GET /v1/search?q=overdraft(nostreams[]) on a server whose advertisement reportscross_stream: false - THEN the server SHALL return an
invalid_request_errorrequiring at least onestreams[]value
Requirement: The extension SHALL return candidate references, not hydrated records
GET /v1/search SHALL return a list envelope whose data[] entries are search_result objects. Each search_result SHALL identify a candidate record by stream, record_key, and the originating connector via connector_id. Each search_result SHALL NOT include the full record payload. A portable numeric relevance score SHALL NOT be exposed in v1. The record_url field is OPTIONAL: implementations MAY include it to give clients a ready-made canonical single-record read URL, and MAY omit it without changing the rest of the response shape.
connector_id is the identifier of the connector whose records contributed the hit. It is required on every result so that callers can hydrate each candidate against the correct per-connector scope. For client-token callers, connector_id mirrors the connector identity already encoded in the caller's grant for that stream. For owner-token callers, connector_id identifies which owner-visible connector the hit came from, since owner reads of records and stream metadata are scoped per connector on the resource server.
Scenario: A successful search returns candidate references
- WHEN the server returns matching results for a search query
- THEN each entry in
data[]SHALL haveobject: "search_result" - AND each entry SHALL include
stream,record_key,emitted_at, andconnector_id - AND no entry SHALL include a portable numeric relevance score field
Scenario: record_url is optional
- WHEN an implementation chooses not to emit
record_urlon a result - THEN the result SHALL still be valid as long as
stream,record_key,emitted_at, andconnector_idare present - AND the client SHALL be able to reconstruct the canonical single-record read URL from
stream,record_key, and (for owner-token callers)connector_idusing the existing record-listing convention
Scenario: record_url, when present, points to the canonical single-record read endpoint
- WHEN an implementation emits
record_urlon a result - THEN that URL SHALL resolve to the canonical
GET /v1/streams/{stream}/records/{record_key}endpoint for the same stream andrecord_key - AND when the caller is an owner-token caller and the resource server scopes owner record reads per connector, the URL SHALL include the canonical owner-mode
connector_idquery parameter for that connector - AND the URL SHALL NOT point to a different retrieval surface
Scenario: Matched fields list which declared searchable fields matched
- WHEN a result is returned for a stream whose declared
lexical_fieldsare["text", "subject"]and the caller's grant authorizes both - THEN the result's
matched_fieldsSHALL be a non-empty subset of["text", "subject"] - AND
matched_fieldsSHALL NOT include any field outside the declared searchable set
Scenario: Snippets are optional and grant-safe
- WHEN a server includes a
snippeton a result - THEN the snippet SHALL reference one entry from
matched_fields - AND the snippet text SHALL contain only substrings drawn from fields the caller is authorized to read AND that the stream has declared searchable
- AND the server MAY omit the
snippetfor any individual result without changing the rest of the response shape
Requirement: The extension SHALL enforce grant safety on every search
The extension SHALL search only over (stream, field) pairs where the stream is in the caller's grant, the field is readable under the grant's effective field projection for that stream, and the stream has declared the field in query.search.lexical_fields. Fields outside that intersection SHALL NOT contribute to matching, ranking, or snippets. Implementations SHALL NOT search over unauthorized fields and filter results afterward.
Scenario: A field is searchable but not authorized
- WHEN stream
messagesdeclareslexical_fields: ["text", "subject"]and the caller's grant authorizes onlytext - THEN matches SHALL be limited to the
textfield - AND snippets SHALL NOT include text drawn from
subject - AND
matched_fieldsSHALL NOT includesubject
Scenario: A field is authorized but not declared searchable
- WHEN stream
messagesdeclareslexical_fields: ["text"]and the grant authorizes bothtextandsubject - THEN the search SHALL NOT match the
subjectfield - AND
matched_fieldsSHALL NOT includesubject
Scenario: A stream contributes no searchable+authorized fields
- WHEN a stream is in the grant but has zero
lexical_fieldsdeclared, OR all declaredlexical_fieldsare outside the grant projection - THEN that stream SHALL contribute zero hits
- AND the response SHALL NOT signal a per-stream error for this case
Scenario: Filter-later enforcement is prohibited
- WHEN an implementation cannot compute matches without first matching against fields outside the searchable+authorized intersection
- THEN the implementation SHALL restructure its search path so unauthorized fields are not considered
- AND the implementation SHALL NOT post-filter unauthorized hits out of the result list as its enforcement strategy
Requirement: Owner-token callers SHALL search across all owner-visible connectors with no public connector-scope parameter
When the caller is an owner-token caller (the resource owner performing self-export rather than a grant-bound third-party client), GET /v1/search SHALL search across every connector the owner can read on this resource server. The endpoint SHALL NOT expose a public connector_id query parameter for owner callers in v1; the search request shape is identical for owner-token and client-token callers. Each search_result SHALL identify the originating connector via connector_id so that callers can hydrate each hit against the correct per-connector owner read scope.
The grant-safety, declared-searchable-field, and snippet-safety invariants apply identically: for each owner-visible connector, the server SHALL search only over (stream, field) pairs the owner can read AND that the connector's stream has declared in query.search.lexical_fields. Connectors with zero searchable streams contribute zero hits.
Scenario: Owner-token caller searches without naming a connector
- WHEN an owner-token caller calls
GET /v1/search?q=overdraftwithoutstreams[] - THEN the server SHALL search across every connector the owner can read on this resource server
- AND SHALL NOT require a
connector_idquery parameter
Scenario: Owner-token caller narrows by stream
- WHEN an owner-token caller calls
GET /v1/search?q=overdraft&streams[]=transactions - THEN the server SHALL search the
transactionsstream of every owner-visible connector that exposes that stream and declares searchable fields on it - AND SHALL NOT silently scope the search to a single connector
Scenario: Owner-token search results identify the originating connector
- WHEN an owner-token caller receives a
search_resultfor a hit from connectorCand streamS - THEN the
search_resultSHALL includeconnector_id: "C"andstream: "S" - AND the caller SHALL be able to use that
connector_idto hydrate the record through the owner-mode single-record read endpoint
Scenario: A connector_id query parameter is rejected on the public surface
- WHEN any caller passes
connector_id=...toGET /v1/searchin v1 - THEN the server SHALL return an
invalid_request_error - AND the parameter SHALL NOT be silently honored
Requirement: Streams that participate in lexical retrieval SHALL declare searchable fields in their stream metadata
A stream that participates in lexical retrieval SHALL declare its searchable fields under query.search.lexical_fields in its existing per-stream metadata. The declaration SHALL accept only top-level scalar string fields defined by the stream's schema in v1. Nested paths, arrays, blob content, and connector-specific search semantics SHALL NOT be expressible through lexical_fields in v1. A stream that does not participate in lexical retrieval SHALL omit query.search entirely.
Scenario: A participating stream emits the declaration
- WHEN a client reads
GET /v1/streams/messagesfor a stream that participates in lexical retrieval - THEN the response SHALL include
query.search.lexical_fields - AND every entry in that array SHALL refer to a top-level scalar string field present in the stream's schema
Scenario: A non-participating stream omits the declaration
- WHEN a stream does not participate in lexical retrieval
- THEN the stream's metadata SHALL omit the
query.searchobject - AND searches that include this stream SHALL contribute zero hits from it
Scenario: A stream attempts to declare an unsupported lexical_field shape
- WHEN an implementation would otherwise expose
query.search.lexical_fieldscontaining a nested path, an array field, a blob reference, or a name not present in the stream's schema - THEN the implementation SHALL omit that entry from the declaration in v1
- AND SHALL NOT attempt to match against that field from the public extension surface
Requirement: The resource server SHALL advertise the extension through its existing metadata document
Implementations that expose this extension SHALL publish the advertisement as a capabilities.lexical_retrieval object inside the existing resource-server metadata document (the same document already used by the resource server to publish OAuth-shaped metadata). The advertisement SHALL describe only global facts about the extension. The advertisement SHALL include the keys supported, endpoint, cross_stream, snippets, default_limit, and max_limit. The advertisement SHALL NOT enumerate per-stream lexical_fields. It SHALL NOT grow into a generalized capability-statement document.
Scenario: A server that exposes the extension publishes the advertisement
- WHEN an implementation exposes the extension on a resource server
- THEN that resource server's metadata document SHALL include a
capabilities.lexical_retrievalobject - AND the object SHALL include the keys
supported(set totrue),endpoint,cross_stream,snippets,default_limit, andmax_limit - AND
endpointSHALL be a path resolvable on the same resource server, and SHALL be/v1/searchunless the resource server is mounted under a path prefix, in which case the prefix SHALL be reflected
Scenario: A server that does not expose the extension does not publish a positive advertisement
- WHEN a server does not implement the extension
- THEN the server SHALL either omit
capabilities.lexical_retrievalfrom its resource-server metadata, OR include it withsupported: false - AND in either case clients SHALL treat the extension as unavailable on that server
Scenario: The advertisement does not duplicate per-stream declarations
- WHEN a server advertises the extension
- THEN the advertisement SHALL NOT enumerate per-stream
lexical_fields - AND clients SHALL discover per-stream searchable fields through existing per-stream metadata at
GET /v1/streams/{stream}
Scenario: The advertisement is discoverable without a grant
- WHEN an unauthenticated client requests the resource-server metadata document
- THEN the
capabilities.lexical_retrievaladvertisement, if present, SHALL be returned without requiring a bearer token - AND the advertisement SHALL NOT include grant-bound or caller-specific fields
Requirement: Search results SHALL paginate via opaque cursors that are independent of record-list and changes_since cursors
Pagination on GET /v1/search SHALL use an opaque next_cursor that clients pass back verbatim as cursor. Search cursors SHALL NOT be reused as record-list cursors and SHALL NOT be reused as changes_since values. Within a single search session (same q, same streams[], same grant), pagination SHALL progress stably enough to avoid obvious duplication and infinite loops, but SHALL NOT promise monotonic timestamp ordering, durability across server restarts, or stability across grant changes or index rebuilds.
Scenario: A client paginates a search
- WHEN a
GET /v1/searchresponse includeshas_more: trueand anext_cursor - THEN the client SHALL pass
next_cursorback ascursorto retrieve the next page - AND the server SHALL treat the cursor as opaque
Scenario: A client tries to reuse a search cursor as a record-list cursor
- WHEN a client passes a
next_cursorfrom/v1/searchtoGET /v1/streams/{stream}/records?cursor=... - THEN the record-list endpoint SHALL be free to reject it as
invalid_cursor - AND the search and record-list pagination grammars SHALL NOT be interchangeable
Scenario: A search cursor expires
- WHEN a client returns to a stale search cursor across server restart, index rebuild, grant change, or vendor-defined cursor expiry
- THEN the server MAY return
invalid_cursor - AND the client SHALL recover by issuing a fresh search
Requirement: Ranking SHALL be relevance-oriented but free of portable numeric score commitments
Search results SHALL be returned in relevance-oriented order. Higher-ranked results SHOULD generally be more relevant than lower-ranked results. The extension SHALL NOT define a portable numeric score (BM25 or otherwise) in v1, SHALL NOT define semantic reranking, and SHALL NOT define recency blending or per-connector custom weighting as portable contract.
Scenario: A client expects relevance-ordered results
- WHEN a client receives
data[]from/v1/search - THEN higher-positioned results SHALL be intended to be at least as relevant to
qas lower-positioned results - AND no entry SHALL include a portable numeric relevance score
Scenario: A client tries to influence ranking via a parameter
- WHEN a client passes a
rank=...,boost=..., or similar ranking parameter to/v1/searchin v1 - THEN the server SHALL return an
invalid_request_error - AND the parameter SHALL NOT be silently honored
Requirement: The extension SHALL be lexical-only in v1 and SHALL NOT widen into semantic retrieval, generic predicate DSL, or connector-specific search
In v1 the extension SHALL match only by lexical means over declared searchable text fields. It SHALL NOT accept embedding parameters, vector queries, semantic similarity parameters, generic boolean/predicate query algebra, or connector-specific search semantics. Implementations that wish to offer richer retrieval SHALL do so as a separate, explicitly named extension or a future change to this one — not by silently broadening GET /v1/search.
Scenario: A request includes a semantic/vector parameter
- WHEN a client passes
embedding=...,vector=...,semantic=..., or any vector/semantic-shaped parameter - THEN the server SHALL return an
invalid_request_error - AND the parameter SHALL NOT be silently mapped to lexical behavior
Scenario: A request attempts a connector-specific search semantic
- WHEN a client passes a parameter that only makes sense for a specific connector (for example, a vendor-specific search operator name) to
/v1/search - THEN the server SHALL return an
invalid_request_error - AND the public lexical retrieval surface SHALL NOT branch its behavior on connector identity
Scenario: A future richer surface is added
- WHEN an implementation later wants to offer body-DSL or semantic retrieval
- THEN that surface SHALL be introduced as a separately-named extension or a separately-versioned future revision of this one
- AND the v1
GET /v1/searchcontract SHALL remain unbroken
Requirement: Lexical retrieval SHALL advertise score support before emitting scores
If the reference implementation emits lexical search scores, it SHALL advertise score support in capabilities.lexical_retrieval before clients query /v1/search. The advertisement SHALL identify the score kind and whether higher or lower values sort better.
Scenario: Server emits lexical scores
- WHEN protected-resource metadata advertises lexical score support
- AND a client queries
/v1/search - THEN each lexical hit SHALL include a typed score object
- AND the score object SHALL identify the score kind and ordering direction
Scenario: Server does not advertise lexical scores
- WHEN protected-resource metadata omits lexical score support
- THEN clients SHALL NOT assume
/v1/searchresponses include score fields
Requirement: Lexical scores SHALL be grant-safe and implementation-relative
Lexical scores SHALL be computed only from fields visible under the active grant and SHALL be documented as implementation-relative unless a later change defines portable score calibration.
Scenario: Hidden fields are outside score computation
- WHEN a record contains a lexical-search field outside the caller's grant projection
- THEN the returned lexical score SHALL NOT include contribution from that hidden field
- AND no score explanation SHALL disclose that hidden field