Lab #379: S3 datastore: dirty sidecar, partitioned index, content hashing, and scoped sync

Problem

The @swamp/s3-datastore extension has four efficiency problems:

Push walks everything — markDirty(relPath) is called by core before every cache write, but the extension ignores relPath and flips a single localDirty: boolean. Every push walks all 15 subdirectories to find what changed.
Pull fetches the full index — .datastore-index.json is a monolithic JSON file mapping every path in the datastore. At scale this is megabytes fetched and parsed on every sync, even when only one model changed.
No content hashing — change detection uses stat.size + stat.mtime, which is unreliable across machines with clock skew and can't deduplicate.
Low transfer concurrency — batches of 10 concurrent HTTP requests. S3 supports thousands per prefix.

The community @keeb/swamp-mongodb-datastore solves all four: persistent per-path dirty sidecar for push, cursor-based incremental pull, SHA-256 content hashing, and bulk operations in batches of 500.

Proposed Solution

A. Persistent dirty sidecar (efficient push)

Track dirtyPaths: Set<string> and bulkInvalidated: boolean in memory and persist to .datastore-sync-state.json (version 2 schema). On push: if bulkInvalidated → full walk (safe fallback); if dirtyPaths.size === 0 → return 0; otherwise → push only dirty paths.

Version 2 sidecar is backward compatible: v1 sidecars read correctly (no dirtyPaths → empty set → full walk). v2 read by old client → version mismatch → null → full walk.

B. Partitioned index with dual-write (efficient pull)

Write partitioned index files alongside the monolithic .datastore-index.json:

_index/
  _meta.json                          # { version: 1, partitions: [...] }
  data--aws--ec2--vpc--abc-123.json   # entries for that model

Dual-write: on push, write BOTH monolithic and partitioned. Old clients read monolithic (always written). New clients with context.models (from #378) read the relevant partition — one small GET instead of the full index.

C. Content hashing (SHA-256)

Add optional sha256 field to index entries. Old readers ignore unknown field. New readers use hash when present, fall back to size comparison when absent.

D. Higher transfer concurrency

Raise from 10 to 50 pull / 25 push. Make configurable via extension config (pullConcurrency, pushConcurrency).

E. Advertise `scopedSync` capability

Implement capabilities() returning { scopedSync: true }. The framework wiring from #378 activates automatically — the extension translates context.models to partition keys for scoped pull, and uses the dirty sidecar for efficient push.

Backward Compatibility

Old clients: read .datastore-index.json (always written). Never see _index/.
New clients reading old data: _index/ missing → fall back to monolithic.
Mixed old+new writers: monolithic stays consistent via dual-write.
sha256 field: old clients ignore it (JSON forward compat).
v1 sidecars: no dirtyPaths → full walk (today's behavior).

Dependencies

Depends on #378 (framework contracts) for SyncCapabilities type, capabilities() method, and core passing context.models on pull/push.

This is Phase 2 of a 3-phase datastore efficiency overhaul.

Upgrade TUI graphics — better AI-generated ANSI or a Moebius hand-authored pipeline

Add assertVaultAnnotationExportConformance to @systeminit/swamp-testing

Add VaultAnnotationProvider conformance helpers to @systeminit/swamp-testing

Vault annotations: --note/--notes flag inconsistency and UX improvements

Docs: document VaultAnnotationProvider interface and extension opt-in pattern

Add VaultAnnotationProvider support to @swamp/1password

Add VaultAnnotationProvider support to @swamp/azure-kv

Add VaultAnnotationProvider support to @swamp/aws-sm

Harness detection invents env vars for kiro/opencode/codex

Annotating vault items should be a first-class swamp operation

workflow direct-execution inputs.* persist as globalArguments on auto-definitions and freeze on first run

workflow-scope report's dataRepository.getContent returns null for data written in the same workflow run

swamp-report skill references nonexistent `swamp model report` command

@swamp/digitalocean — add domain-records model for /v2/domains/{domain_name}/records

Docs: update extension scoring documentation for dependency-trust rubric factor

Warn when a ${{ }} secret expression is single-quoted in a command/shell run: script

Workflow validation should resolve modelType for direct-execution steps

dbcluster state schema is missing DBClusterMembers (writer/reader, instance class)

Add a list/discover method to dbcluster for enumerating clusters in a region

Add dependency-trust rubric factor to server-side scorer (RUBRIC_VERSION 3)

cloudidentity API calls fail with 'requires a quota project' — bundle doesn't send x-goog-user-project header

Improve idempotency match field heuristic for auto-generated name resources (tagKeys, tagValues)

@swamp/gcp/cloudresourcemanager/folders: create method has 5 blocking bugs (missing parent in body, LRO detection, post-LRO state, idempotency, projectId requirement)

Add IAM policy management (setIamPolicy/getIamPolicy) on cloudresourcemanager resources; add custom-role CRUD to @swamp/gcp/iam

@swamp/ssh exec method fails with 'ctx.createCelEnvironment is not a function'

Extension decision order should prefer @swamp/community extensions over local types

Make wheelshop-style dependency trust-gating a core swamp feature

swamp repo <unknown-subcommand> silently inits a nested repo (e.g. `swamp repo update`)

Extension METHODS table truncates Method column; short names like apply/check wrap mid-word on /extensions/@swamp/ssh

swamp extension rm leaves empty <kind>-bundles/<hash>/ dirs behind

Pre-flight checks cannot access method arguments (check context omits methodArgs/unresolvedMethodArgs)

swamp audit record --from-hook creates a stray .swamp datastore in the process cwd instead of resolving the repo root

extension push publishes model files ending in _test.ts that no consumer can load

Docs: document --extensions-dir / SWAMP_EXTENSIONS_DIR for worktree workflows

No user feedback when model method run is waiting for lock acquisition

issue-lifecycle: thank external contributors when issues are resolved

Add swamp extension prune to clean up stale catalog entries

identity_map row not updated when user renames

`swamp extension rm` leaves empty scaffold dirs behind

Many CLI commands acquire global .datastore.lock unnecessarily, causing 60s LockTimeoutError under any concurrent writer

swamp CLI commands fail silently or hang when invoked from git worktrees via SWAMP_REPO_DIR

Datastore: lazy hydration for fast cold-start on first clone

S3 datastore: dirty sidecar, partitioned index, content hashing, and scoped sync

Datastore sync: add SyncContext and SyncCapabilities framework contracts

Terminal rendering breaks at large font sizes

Expose cel-js Environment to extensions for custom CEL evaluation

Add list/search as a factory method that produces many data artifacts (Drive files.list, gmail messages.list, etc.)

files.get returns only minimal fields (id, name, kind, mimeType) because no 'fields' query parameter is sent

ADC path uses wrong gcloud token store: 'gcloud auth print-access-token' instead of 'gcloud auth application-default print-access-token'

Docs: update doctor reference and autoupdate how-to for new doctor install subcommand

createModelTestContext: storedResources not used by readResource; readResource always returns null

swamp-vault skill documents 'swamp vault read' but correct subcommand is 'read-secret'

Add a manual_approval (pause) task type to workflow steps

Autoupdate silently fails when swamp is installed system-wide via the official install.sh

Missing 'parent' field in GlobalArgsSchema for several @swamp/gcp/* models causes get to fail

bucket-policy GlobalArgsSchema requires Bucket and PolicyDocument, blocking workflow-YAML direct execution of get

Report execute throws are advisory: workflow marked succeeded, exit 0, AND report output is discarded

dataRepository.getContent rejects string type in production but docs and testing helper demonstrate strings

bucket-policy StateSchema.PolicyDocument declared z.string() but CloudControl returns it as a parsed object

Unified login input that detects email vs username by presence of '@'

Introduce `swampd`: long-running local daemon for shared cache, secrets, and extensions

workflow validate: false "Missing required inputs" when method args are set in the model definition

CEL and vault expressions not evaluated inside nested globalArguments fields

@swamp/digitalocean: 30 of 33 model types fail with version mismatch error

Add first-class Kilo Code tool support

Partitioned index for S3/GCS datastores (Phase 3)

Per-path dirty tracking in S3/GCS datastore extensions (Phase 2)

Docs: update doctor extensions JSON reference to include warnings[] field

Doctor kind-completed events should carry correct per-registry status

Surface type-extraction failures in doctor JSON output

Scoped sync and capability-gated concurrency for datastores (Phase 1)

Direct type execution fails for locally-defined extension types with pulled duplicates

Scaffold new extensions to publish-ready quality (12/12) by default

Add table width controls to swamp report get

Add a markdown output mode to `swamp report get`

Add a markdown output mode to swamp report get

swamp.club: 'Mark all read' link doesn't clear unread count on /inbox

Official @swamp/ssh extension supporting multiple SSH transport styles

W7 — unify extension failure surfaces; collapse registries.failures[] into sourceDetails[]

Surface Tombstoned transitions in doctor extensions output

Workflow-level runtime expressions (env., vault.) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings