Skip to main content
← Back to list
01Issue
BugClosedSwamp CLI
AssigneesNone

datastore sync --push runs pushChanged() twice per invocation (coordinator dedup)

Opened by stack72 · 4/8/2026

Summary

swamp datastore sync --push invokes S3CacheSyncService.pushChanged() twice on every invocation — once from the command action and again from flushDatastoreSync() at runCli teardown. Both walks scan the full cache and both rounds do S3 PUTs. This is pure waste on the success path, and on the failure path it produces misleading log ordering where "Pushed 1 file(s)" prints before the original error surfaces.

This is a separable follow-up to swamp-club issue #29. Issue #29 covers the primary bug (_catalog.db-wal fails because the catalog lives inside the sync cache) and has its own fix path. This issue is specifically about the coordinator/command-level push duplication and should be fixed independently because the risk profile is different — getting the signalling wrong can cause silent push no-ops, which is worse than the current double-push waste.

Code path

  1. src/cli/commands/datastore_sync.ts:68-71--push mode uses requireInitializedRepo (not the read-only variant).
  2. src/cli/repo_context.ts:325-329requireInitializedRepo calls registerDatastoreSync({ service, lock, label }) for S3 datastores, registering the S3CacheSyncService with the global sync coordinator.
  3. src/cli/commands/datastore_sync.ts:76-79 — the command action then calls datastoreSync(ctx, deps, { mode: 'push' }).
  4. src/libswamp/datastores/sync.ts:172-177datastoreSync with mode: 'push' calls deps.pushSync() which calls syncService.pushChanged(). First push.
  5. src/cli/mod.ts:939 — after cli.parse returns, runCli calls flushDatastoreSync().
  6. src/infrastructure/persistence/datastore_sync_coordinator.ts:196-247flushDatastoreSync() iterates all registered entries and calls entry.service.pushChanged(). Second push.
  7. src/cli/mod.ts:991 — on error paths, the catch block calls flushDatastoreSync() again, compounding the confusion.

Evidence that the maintainer is already aware

src/cli/commands/datastore_sync.ts:61-62 explicitly comments:

``` // Pull-only mode uses read-only init to avoid acquiring the global // lock and triggering a coordinator push on flush. ```

…and uses `requireInitializedRepoReadOnly` for the pull path specifically to dodge this. The push path hasn't been similarly adjusted.

Impact

  • Every S3 `--push` invocation does 2× the S3 PUT cost and 2× the cache walks. Doubled latency, doubled API costs, doubled race surface on any mutable file in the cache (this is what makes the `_catalog.db-wal` failure in issue #29 so confusing — the first push hits the WAL mid-write, the second push runs after SQLite has checkpointed the WAL, so the logs show "Pushed 1 file(s)" success before the original error surfaces as fatal).
  • Log ordering is misleading on failures. The second push runs after the first has thrown, so the success message for the second push appears above the fatal error from the first. Anyone debugging a failed push will see conflicting evidence in the log trace.

Suggested fix approaches

Two viable directions, each with a failure mode to watch:

A. Command skips the explicit push, relies on coordinator flush. Remove the `deps.pushSync()` call from `datastoreSync({ mode: 'push' })` and let `flushDatastoreSync()` at `runCli` teardown do the push. This mirrors how normal commands (`model create`, `workflow run`) already work — the coordinator flush is the canonical push point. Risk: the coordinator flush at `datastore_sync_coordinator.ts:235-246` logs failures as `warn` rather than propagating them, so the CLI could exit 0 on a broken push. Any fix in this direction must also make flush errors propagate so the command exits non-zero on push failure.

B. Coordinator flush skips its push when an explicit push has already run. Add a flag on the coordinator entry (e.g. `pushedExplicitly: true`) that `datastoreSync` sets after a successful `pushSync()`, and that `flushDatastoreSyncNamed` checks before calling `pushChanged`. The lock still gets released on flush. Risk: state has to survive the error path — the catch at `src/cli/mod.ts:991` calls flush even when the explicit push threw, and in that case the flag is `false`, so flush would retry the push. That might actually be desirable on transient network errors, but it reintroduces the exact double-push the fix is trying to eliminate.

Approach A is cleaner architecturally — one push point, consistent with how other commands work. Approach B is more surgical but has more moving parts.

Testing requirements

Whatever approach is taken, regression tests must assert:

  1. Exactly one `pushChanged` call on the success path of `swamp datastore sync --push`.
  2. Exactly one `pushChanged` attempt on the failure path (no duplicate attempts after a throw).
  3. The CLI exits non-zero when the push fails (current behavior only exits non-zero because the first push throws; if the first push is removed, the flush-time push must still produce a non-zero exit on failure).
  4. `swamp datastore sync --pull` and full `swamp datastore sync` behavior is unchanged (the pull path already uses read-only init and shouldn't regress).

Integration tests should exercise this with a mock `DatastoreSyncService` that counts `pushChanged()` calls — `src/cli/repo_context_test.ts` already has the lock lifecycle harness.

Environment

  • swamp CLI (any recent version — this is structural, present since `requireInitializedRepo` started registering sync services with the coordinator)
  • Any `@swamp/s3-datastore`-backed repo
  • Affects all `swamp datastore sync --push` invocations, not just failing ones
  • swamp-club issue #29: "datastore sync --push fails on _catalog.db-wal: catalog SQLite DB lives inside the S3 sync cache" — the primary bug. This issue (double-push) is called out in that report as a "secondary bug" but scoped out of the main fix because the risk profile differs.
02Bog Flow
OPENTRIAGEDIN PROGRESSCLOSEDTRIAGECLASSIFICATION

Closed

4/8/2026, 8:38:53 PM

No activity in this phase yet.

03Sludge Pulse

stack72 commented 4/8/2026, 8:38:50 PM

Triage note. Reproduced the double-push end-to-end using a counting SyncableService wired through the real coordinator and the real datastoreSync generator: exactly 2 pushChanged() calls per --push invocation. All the code paths in the original write-up check out, and the maintainer-awareness comment at datastore_sync.ts:61-62 plus the pull-side fix in #983 confirm this is a symmetric gap that was left open when the pull path was fixed.

One extra finding from the repro: registerDatastoreSync also calls pullChanged() at registration time whenever a service is registered (coordinator.ts:154-186), so --push today is actually pull → push → push, not push → push. That raises a question about localMtime drift — the exact failure mode #983's commit message called out for the pull path may be biting --push here too, though I haven't verified it bites in practice.

Leaning toward deferring this rather than fixing it now. Honest cost/benefit:

Real user impact today is low.

  • swamp datastore sync --push is a manual command, not a hot path — doubled PUT cost and latency are pennies and seconds
  • The misleading log ordering only matters during a failed push, which is rare
  • The _catalog.db-wal race surface goes away once #29 lands
  • No user reports of slowness, wrong data, or S3 bills attached to this

Fix cost is non-trivial for any of the three viable options:

  1. Remove explicit push, let coordinator flush do it — cleanest in theory, but requires (a) making flushDatastoreSyncNamed propagate push failures instead of logging warn (structural change affecting every command that uses the coordinator), and (b) accepting that the DatastoreSyncEvent.completed event loses filesPushed (JSON output contract break). Wider blast radius than it looks.
  2. Flag on coordinator entry (pushedExplicitly) — surgical, but introduces an error-path policy decision: on explicit-push throw, the runCli catch-block flush would either retry (reintroducing the double-push on every failure) or skip (masking a transient recovery opportunity). Either choice is a future argument.
  3. New requireInitializedRepoForPush init variant — follows the exact pattern #983 used for --pull: acquires the lock but doesn't register a sync service with the coordinator. Narrowest change, no coordinator API churn, no JSON break, and as a side effect it also kills the unnecessary pull-on-register. Preferred shape if we do fix it.

Sign in to post a ripple.