Skip to main content
← Back to list
01Issue
FeatureOpenExtensions
AssigneesNone

#379 S3 datastore: dirty sidecar, partitioned index, content hashing, and scoped sync

Opened by stack72 · 5/19/2026

Problem

The @swamp/s3-datastore extension has four efficiency problems:

  1. Push walks everythingmarkDirty(relPath) is called by core before every cache write, but the extension ignores relPath and flips a single localDirty: boolean. Every push walks all 15 subdirectories to find what changed.
  2. Pull fetches the full index.datastore-index.json is a monolithic JSON file mapping every path in the datastore. At scale this is megabytes fetched and parsed on every sync, even when only one model changed.
  3. No content hashing — change detection uses stat.size + stat.mtime, which is unreliable across machines with clock skew and can't deduplicate.
  4. Low transfer concurrency — batches of 10 concurrent HTTP requests. S3 supports thousands per prefix.

The community @keeb/swamp-mongodb-datastore solves all four: persistent per-path dirty sidecar for push, cursor-based incremental pull, SHA-256 content hashing, and bulk operations in batches of 500.

Proposed Solution

A. Persistent dirty sidecar (efficient push)

Track dirtyPaths: Set<string> and bulkInvalidated: boolean in memory and persist to .datastore-sync-state.json (version 2 schema). On push: if bulkInvalidated → full walk (safe fallback); if dirtyPaths.size === 0 → return 0; otherwise → push only dirty paths.

Version 2 sidecar is backward compatible: v1 sidecars read correctly (no dirtyPaths → empty set → full walk). v2 read by old client → version mismatch → null → full walk.

B. Partitioned index with dual-write (efficient pull)

Write partitioned index files alongside the monolithic .datastore-index.json:

_index/
  _meta.json                          # { version: 1, partitions: [...] }
  data--aws--ec2--vpc--abc-123.json   # entries for that model

Dual-write: on push, write BOTH monolithic and partitioned. Old clients read monolithic (always written). New clients with context.models (from #378) read the relevant partition — one small GET instead of the full index.

C. Content hashing (SHA-256)

Add optional sha256 field to index entries. Old readers ignore unknown field. New readers use hash when present, fall back to size comparison when absent.

D. Higher transfer concurrency

Raise from 10 to 50 pull / 25 push. Make configurable via extension config (pullConcurrency, pushConcurrency).

E. Advertise scopedSync capability

Implement capabilities() returning { scopedSync: true }. The framework wiring from #378 activates automatically — the extension translates context.models to partition keys for scoped pull, and uses the dirty sidecar for efficient push.

Backward Compatibility

  • Old clients: read .datastore-index.json (always written). Never see _index/.
  • New clients reading old data: _index/ missing → fall back to monolithic.
  • Mixed old+new writers: monolithic stays consistent via dual-write.
  • sha256 field: old clients ignore it (JSON forward compat).
  • v1 sidecars: no dirtyPaths → full walk (today's behavior).

Dependencies

Depends on #378 (framework contracts) for SyncCapabilities type, capabilities() method, and core passing context.models on pull/push.

This is Phase 2 of a 3-phase datastore efficiency overhaul.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

5/19/2026, 8:58:20 PM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.