Fire and forget: how we decide what to make async

When we started building Reliable Menders, the first architectural decision wasn’t which messaging technology to use. It was simpler than that: for every workflow we designed, we asked one question.

Does the user need to see the result of this action right now?

If the answer was yes, it was a synchronous call. If the answer was no, it became an event.

Why this question matters more than the technology

Most teams reach for async messaging because it sounds modern, or because they read about it being used at a scale they haven’t reached. That’s the wrong reason. Async messaging adds real complexity: you need idempotent consumers, you need to handle retries, you need to reason about ordering. None of that is free.

The right reason to go async is simpler: some work genuinely doesn’t need to happen before the user gets a response. Sending a notification email after a job is accepted doesn’t need to block the API response. Running a KYC background check doesn’t need to happen inside the user’s document upload request. These things can happen reliably in the background, and making them synchronous only creates coupling and latency.

The two failure modes

Get this wrong in one direction: everything is synchronous. Your API is now coupled to your notification service, your background check provider, your analytics pipeline. Any of them slows down or fails, your API slows down or fails. You’ve built a system that’s as reliable as its least reliable dependency.

Get it wrong in the other direction: everything is async. Simple read operations become event subscriptions. Things that users genuinely need to see immediately become eventually consistent. You’ve added complexity without solving a real problem, and now you’re explaining to users why their action “worked” but nothing changed on screen.

What we actually made async on Reliable Menders

These went async:

Sending notifications (job accepted, artisan matched, dispute opened)
KYC verification after document upload
Analytics event recording
Dispute routing to resolution handlers

These stayed synchronous:

Creating a job (user needs confirmation immediately)
Artisan authentication
Fetching job status and history
Payment initiation (user needs to see the result)

The line isn’t “user-facing vs background.” It’s “does the user need this result before we respond?” Sometimes user-facing operations are still async in terms of their side effects. Creating a job is synchronous for the user, but the notification to nearby artisans is an event.

The messaging substrate choice

Once you know what’s async, you still need to choose how. We use two things:

MQTT for event distribution where the consumer set may grow. New notification channels, new analytics pipelines, new downstream services: they subscribe to the bus without touching the publisher. The job posting service doesn’t know or care how many things consume its events.

Redis pub/sub and streams for in-process service coordination where latency matters more than durability. Lightweight, fast, and fits cases where losing an event on restart is acceptable because the state is recoverable from the database.

These are different tools for different event classes. Treating them the same would mean either over-engineering the low-latency cases or under-engineering the durable ones.

The rule of thumb

If you’re not sure whether something should be async: make it synchronous first. Synchronous code is simpler to reason about, simpler to debug, and simpler to change. Add async only when you have a clear reason: the user doesn’t need the result immediately, or the work is slow enough that making them wait is worse than eventual consistency.

Most systems need less async than engineers assume. A few workflows genuinely need it. Knowing the difference is the architectural decision.

If you’re working through this decision on a system you’re building, book a 30-minute intro call and let’s talk through the flows. Getting the async boundary wrong in either direction creates expensive rework.

Why this question matters more than the technology

The two failure modes

What we actually made async on Reliable Menders

The messaging substrate choice

The rule of thumb

Book a 30-minute intro call.