Skip to main content
← Back to list
01Issue
BugShippedSwamp Club
AssigneesNone

discord-bot poller double-processes events with >1 replica

Opened by stack72 · 4/8/2026· Shipped 4/9/2026

Summary

The discord-bot service runs 2 replicas in production. Its poller at services/discord-bot/lib/poller.ts has a race condition: both replicas query { status: "pending" } with a non-atomic .find(), call the event handler (which posts to Discord), and then .deleteOne() the document. With two replicas polling in parallel, both can see the same pending docs and both can fire the handler before either deletion lands — producing duplicate Discord posts.

Location

services/discord-bot/lib/poller.ts:52-95

Current code (abbreviated)

const docs = await db.collection(EVENTS_COLLECTION)
  .find({
    status: "pending",
    event: { \$in: events },
  })
  .sort({ _id: 1 })
  .limit(config.batchSize)
  .toArray();

for (const doc of docs) {
  // ...
  await handler(doc, discord, config.channelId); // posts to Discord
  await db.collection(EVENTS_COLLECTION).deleteOne({ _id: doc._id });
}

Neither the read nor the handler path takes ownership of the doc. Both replicas see the same `pending` rows, both call the handler, both delete (second delete is a no-op) — but the Discord POST has already happened twice.

Impact

Every production event can be posted to the configured Discord channel twice (once per replica). Frequency depends on how often the two replicas' poll windows overlap on the same pending docs. Under low event volume this is probably rare-but-nonzero; under any burst it becomes systematic.

Discovered during a Railway → DigitalOcean migration audit on 2026-04-08. Not reproduced in production yet — flagging based on code review.

Suggested fix

Atomically claim each document before processing:

```typescript const doc = await db.collection(EVENTS_COLLECTION).findOneAndUpdate( { status: "pending", event: { $in: events } }, { $set: { status: "processing", claimedAt: new Date() } }, { sort: { _id: 1 }, returnDocument: "after" }, ); ```

Then handle and delete on success; reset to `pending` (or mark `failed`) on error. This makes the poller correct for N replicas.

Alternative (lower-effort interim)

Workers don't need HA for brief deploy-window outages. Drop the bot to `numReplicas: 1` until the atomic-claim refactor ships. Single replica has no race regardless, and redeploys are fast enough that a few seconds of paused event handling is fine.

The `cursor.ts` module (`loadCursor` / `saveCursor`) appears to be vestigial — the current poller filters by `status: "pending"` rather than by cursor position. Worth deleting or wiring back in while we're in there.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPEDTRIAGE+ 1 MORECLASSIFICATION+ 1 MOREPR_LINKEDCOMPLETE

Shipped

4/9/2026, 3:19:46 AM

Click a lifecycle step above to view its details.

03Sludge Pulse

Sign in to post a ripple.