Tasks — add-provider-budget-run-control

tasks42/53
Created Updated openspec/changes/add-provider-budget-run-control/tasks.mdView on GitHub →

This change began as a proposal lane and now tracks implementation tranches. Keep the spec, design, tasks, and implementation in lockstep until archive.

1. Spec delta (this lane — proposal only)

  • Write proposal.md — change rationale, scope, capability targets.
  • Write design.md — design decisions, tradeoffs, acceptance checks.
  • Write tasks.md — this file.
  • Write specs/polyfill-runtime/spec.md — normative requirement deltas.
  • openspec validate add-provider-budget-run-control --strict.
  • openspec validate --all --strict.

2. Implementation (future lanes)

The following tasks are stubs for implementation lanes. Each should be a separate lane or tranche with its own acceptance checks.

2.1 Per-provider token-bucket pacing

  • Implement a per-provider token bucket in the polyfill-runtime base.
    • Fill rate and burst depth configurable per connector; unset → conservative default.
    • AIMD adaptive fill-rate adjustment: additive increase on success, multiplicative decrease on 429/503/elevated-latency.
    • One-way ratchet: error responses may only increase delay, never decrease.
    • Conservative starting delay before first response signal.
    • Per-provider isolation: slow or rate-limited provider does not stall other providers.
  • Unit-test the token bucket with an injectable clock (no live provider required).
  • Verify that the generic primitive can run unbounded when no budget is configured.

2.2 Retry budget (ratio-based token bucket)

  • Implement a run-scoped retry budget token bucket.
    • Capacity ≈ 20% of per-run request cap (or a configurable minimum).
    • Tokens consumed on retry; refilled proportionally to successes.
    • Full jitter backoff: sleep = random(0, min(cap, base × 2^attempt)).
    • Retry only on 429, 408, 5xx. Non-retryable 4xx logs and skips (no budget consumed).
    • When bucket empty: defer run as resumable gap with reason not in source-pressure set.
  • Unit-test retry budget exhaustion path.

2.3 Circuit breaker integration

  • Integrate a circuit breaker into the provider-budget admission path.
    • Composition order in the implemented ChatGPT path: provider-budget admission gates each real API/retry attempt before the provider request.
    • Closed → Open on failure-rate threshold over a sliding window.
    • Open → Half-Open after configurable reset timeout.
    • Half-Open: probe request; success → Closed; failure → Open.
    • Minimum-throughput guard: breaker cannot open before a minimum request count.
    • When Open: propagate planned deferral immediately, without launching the provider request.
  • Expose circuit breaker state transitions as structured run progress evidence from the generic provider-budget primitive.
  • Unit-test all three state transitions.

2.4 Run budget envelope (request cap + wall-clock deadline)

  • Implement run-scoped request cap and wall-clock deadline.
    • Generic primitive supports unbounded mode; ChatGPT defaults to adaptive pacing/retry protection, with fixed caps only as explicit envelopes.
    • Wall-clock checked between fetch attempts, never mid-fetch.
    • On exhaustion: emit resumable gap record; checkpoint reflects last durable write only.
    • Gap reason is not in source-pressure reason set.
    • Does not arm source-pressure cooldown governor.
  • Unit-test with injectable clock (request cap trip, wall-clock trip, unbounded mode).

2.5 Commit-gated monotonic checkpoint

  • Audit all connectors: checkpoint advancement must follow, not precede, durable write confirmation.
  • Enforce opaque cursor storage: no reconstructed offset cursors in any first-party connector.
  • Add a CI assertion or test that fails when a connector advances its checkpoint before durable write.

2.6 Catch-up vs. steady-state separation (where applicable)

  • For connectors with a historical backfill phase, implement separate bookmarks for catch-up and steady-state modes.
  • Verify that catch-up runs do not advance the steady-state incremental cursor.
  • Document the mode-switching predicate (when to shift from catch-up windows to steady-state incremental).

2.7 Operator progress/visibility

  • Emit run-scoped structured circuit state changes from connector runtime progress.
  • Project circuit breaker state (Closed/Open/Half-Open) in the connector health view.
  • Distinguish budget-exhaustion deferrals from source-pressure deferrals in display copy.
  • Ensure run-progress reporting distinguishes: pages fetched this run, pages deferred, retry events, circuit breaker state changes.

2.8 Detail-gap recovery drain loop

  • Replace the single-page START.detail_gaps recovery behavior with a reference-only page request/response loop that drains all eligible pending detail gaps in one logical run until storage is drained or adaptive provider/run safety stops.
  • Bound internal detail-gap pages by serialized payload byte budget, adapting candidate row size from observed payload size. The remaining SQL row candidate cap is a storage safety fallback only and cannot cap recovery progress because the connector keeps requesting pages until drained or stopped.
  • Preserve connector-local provider budget, retry, and circuit-breaker state across pages by keeping one connector process alive for the whole drain.
  • Unit-test recovery beyond 100 pending gaps in one run, adaptive-stop behavior, and byte-budget paging for large gap payloads.

3. Owner closeout

  • Per-provider live calibration: run at least one connector under the new control model against a real provider, confirm pacing converges, retry budget is not exhausted on a healthy run, and wall-clock/request-cap deferrals do not arm source-pressure cooldown.
  • Archive this change once all §2 implementation tranches land and the per-provider calibration is recorded.

Acceptance Checks (proposal validation)

openspec validate add-provider-budget-run-control --strict
openspec validate --all --strict
git diff --check