Skip to main content
← Back to list
01Issue
FeatureShippedSwamp CLI
Assigneesstack72

#354 Per-path dirty tracking in S3/GCS datastore extensions (Phase 2)

Opened by stack72 · 5/14/2026· Shipped 5/14/2026

Problem

The S3 and GCS datastore extensions ignore the relPath parameter on markDirty — they flip a single localDirty: boolean flag. When pushChanged runs and localDirty is true, the slow path walks the entire local cache (walk(cachePath)) comparing every file against the index. For a repo with thousands of data artifacts, this is O(all-files) even when only one or two files changed.

The markDirty(relPath) contract (8 rules documented in design/datastores.md and datastore_sync_service.ts) was designed specifically to enable per-path dirty tracking. The contract has existed since swamp-club#232. The S3 and GCS extensions just haven't implemented it yet.

Proposed Solution

Add sidecar-based dirty-path tracking to both @swamp/s3-datastore and @swamp/gcs-datastore, following the pattern already proven by @keeb/mongodb-datastore's Sidecar class.

Changes per extension

markDirty implementation:

  • When relPath is provided: add it to a dirtyPaths: string[] set (deduplicated)
  • When relPath is undefined: set bulkInvalidated: true (rule 3)
  • Persist both to the existing .datastore-sync-state.json sidecar (add two fields)

pushChanged slow path:

  • When bulkInvalidated is true: full cache walk (current behavior, rule 8)
  • When dirtyPaths is non-empty and bulkInvalidated is false: only stat/upload files in the dirty set, skip the walk(cachePath) entirely
  • When dirtyPaths is empty and localDirty is false: fast path (current behavior)
  • On cold start (missing/corrupt sidecar): full walk (rule 4)
  • After successful push: clear dirtyPaths and bulkInvalidated

Sidecar schema change

Current .datastore-sync-state.json:

{ "version": 1, "remoteIndexETag": "...", "lastVerifiedAt": "...", "localDirty": true }

After:

{ "version": 2, "remoteIndexETag": "...", "lastVerifiedAt": "...", "localDirty": true, "dirtyPaths": ["data/aws-ec2/i-123/latest", "data/aws-ec2/i-123/3/raw"], "bulkInvalidated": false }

Version bump to 2. Old sidecars (version 1) are treated as bulkInvalidated: true on first read — triggers a safe full walk, then rewrites the sidecar in the new format.

Reference implementation

@keeb/mongodb-datastore sidecar.ts at https://github.com/keeb/swamp-mongodb-datastore — implements exactly this pattern with recordDirty(relPath), clearDirty(), and bulkInvalidated flag. The S3/GCS implementation can follow the same structure.

Files to change

In systeminit/swamp-extensions:

  • datastore/s3/extensions/datastores/_lib/s3_cache_sync.tsmarkDirty, pushChanged slow path, sidecar schema
  • datastore/s3/extensions/datastores/_lib/s3_cache_sync_test.ts — tests for dirty-set tracking
  • datastore/gcs/extensions/datastores/_lib/gcs_cache_sync.ts — same changes, mirrored
  • datastore/gcs/extensions/datastores/_lib/gcs_cache_sync_test.ts — same tests, mirrored

Backward compatibility

  • Sidecar version 1 → 2 migration is automatic and safe (old version = full walk + rewrite)
  • No changes to the remote index format
  • No changes to the domain contracts (this implements the existing markDirty contract)
  • The full-walk fallback is always available — per-path tracking is a performance optimization, not a correctness change

Validation

  • Unit tests: mark specific paths dirty → push only uploads those paths
  • Unit tests: mark dirty with no relPath → push does full walk
  • Unit tests: corrupt/missing sidecar → push does full walk
  • Unit tests: version 1 sidecar → treated as bulkInvalidated
  • Manual test: run one model method against S3/Minio, verify push uploads only that model's files (not the entire cache)

Relationship to other phases

This is Phase 2 of the scoped sync plan documented in design/datastores.md. It does not depend on Phase 1 (issue #350) — it implements the existing markDirty contract, not the new scope/capabilities contract. Phase 3 (partitioned index) depends on this phase.

Impact: converts most per-model pushes from O(all-files) to O(changed-files). The most common case — run one model method, write one data artifact — pushes 2-3 files instead of walking thousands.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 7 MOREREVIEW+ 3 MOREPR_MERGEDSHIPPED

Shipped

5/14/2026, 8:19:51 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack725/14/2026, 5:43:51 PM

Sign in to post a ripple.