discord-bot poller double-processes events with >1 replica
Opened by stack72 · 4/8/2026· Shipped 4/9/2026
Summary
The discord-bot service runs 2 replicas in production. Its poller at services/discord-bot/lib/poller.ts has a race condition: both replicas query { status: "pending" } with a non-atomic .find(), call the event handler (which posts to Discord), and then .deleteOne() the document. With two replicas polling in parallel, both can see the same pending docs and both can fire the handler before either deletion lands — producing duplicate Discord posts.
Location
services/discord-bot/lib/poller.ts:52-95
Current code (abbreviated)
const docs = await db.collection(EVENTS_COLLECTION)
.find({
status: "pending",
event: { \$in: events },
})
.sort({ _id: 1 })
.limit(config.batchSize)
.toArray();
for (const doc of docs) {
// ...
await handler(doc, discord, config.channelId); // posts to Discord
await db.collection(EVENTS_COLLECTION).deleteOne({ _id: doc._id });
}Neither the read nor the handler path takes ownership of the doc. Both replicas see the same `pending` rows, both call the handler, both delete (second delete is a no-op) — but the Discord POST has already happened twice.
Impact
Every production event can be posted to the configured Discord channel twice (once per replica). Frequency depends on how often the two replicas' poll windows overlap on the same pending docs. Under low event volume this is probably rare-but-nonzero; under any burst it becomes systematic.
Discovered during a Railway → DigitalOcean migration audit on 2026-04-08. Not reproduced in production yet — flagging based on code review.
Suggested fix
Atomically claim each document before processing:
```typescript const doc = await db.collection(EVENTS_COLLECTION).findOneAndUpdate( { status: "pending", event: { $in: events } }, { $set: { status: "processing", claimedAt: new Date() } }, { sort: { _id: 1 }, returnDocument: "after" }, ); ```
Then handle and delete on success; reset to `pending` (or mark `failed`) on error. This makes the poller correct for N replicas.
Alternative (lower-effort interim)
Workers don't need HA for brief deploy-window outages. Drop the bot to `numReplicas: 1` until the atomic-claim refactor ships. Single replica has no race regardless, and redeploys are fast enough that a few seconds of paused event handling is fine.
Related
The `cursor.ts` module (`loadCursor` / `saveCursor`) appears to be vestigial — the current poller filters by `status: "pending"` rather than by cursor position. Worth deleting or wiring back in while we're in there.
Shipped
Click a lifecycle step above to view its details.
Sign in to post a ripple.