← Back to Common Backend Systems

Background Job Queue

Contents

A reference about the technology class that dispatches discrete units of work — "jobs" — to worker processes for asynchronous execution. Exemplified by Sidekiq (Ruby + Redis), Celery (Python + Redis/RabbitMQ), BullMQ (Node.js + Redis), Resque (older Ruby + Redis), Que and GoodJob (Ruby + Postgres), river (Go + Postgres), AWS SQS (Simple Queue Service), Google Cloud Tasks, Azure Service Bus Queue, and Hangfire (.NET + SQL Server). The reader should finish understanding the design space deeply enough to apply it to commerce checkouts, media transcoding pipelines, payment retry budgets, notification fan-out, ML (Machine Learning) batch inference, and webhook delivery — and defend a pick against the obvious alternatives.


§1. What background job queues ARE

A background job queue is a substrate for dispatching work-to-be-performed asynchronously: a producer (typically a web request handler or another job) enqueues a job descriptor, a pool of worker processes pulls jobs off the queue and executes them, and the broker tracks the lifecycle of each job from enqueue through completion or terminal failure. The unit of currency is a job — a typed descriptor of work to do, with arguments — not a generic event or a stream record. The broker's job is to deliver each job to exactly one worker (best effort), make sure crashed workers don't lose work, and let jobs schedule, retry, and dead-letter cleanly. Mental model: an HTTP server with deferred execution — the request becomes durable, executes minutes or hours later, and the result lands in a database.

The category overlaps in popular vocabulary with two adjacent primitives. Disambiguating them is the entire first half of choosing the right tool.

Job queue vs message queue / distributed log. Both move bytes from one place to another asynchronously. The distinction is who owns delivery state. A message queue / log (Kafka, Pulsar, Kinesis) transports events — facts that happened — for many consumers, replayable, retained for days. Each consumer maintains its own read position; the broker doesn't track per-consumer per-message state. The producer publishes once, ten different downstream pipelines each get their own copy via offset tracking. A job queue (Sidekiq, Celery, SQS) transports work-to-be-performed — instructions to execute — for one worker, non-replayable, deleted on completion. The broker tracks per-job state: enqueued, in-flight (claimed by a worker), completed, failed, retrying, dead-lettered. There is no concept of "fan-out to N consumer groups" or "replay last 7 days of jobs" — once a job runs successfully and is acked, it is gone. The mental separation: events are nouns (the past), jobs are verbs (the future). A Kafka topic of order_created events is a log of facts. A Sidekiq queue of SendOrderConfirmation jobs is a list of work to do. The same business operation often produces both — the order service writes order_created to Kafka (audit, analytics, downstream consumers) and enqueues SendOrderConfirmation to Sidekiq (one-time side effect with retries). Use both for what they are.

Job queue vs workflow engine. Both handle "do something later, retry on failure." The distinction is the shape of the work. A workflow engine (Temporal, Cadence, AWS Step Functions, Airflow) owns multi-step, long-running, stateful processes — a 30-day KYC (Know Your Customer) onboarding workflow that calls four services, waits for human approval, branches on results, and durably remembers what it has done. State is persisted between steps; replay reconstructs progress. A job queue owns short, atomic, stateless units of work — "send this email," "resize this image," "retry this webhook." A job is a one-shot function call with retries on failure; it doesn't remember intermediate state. If your "job" calls three services with rollback logic between them, you've outgrown a job queue and you want a workflow engine. The crossover comes in volume: workflow engines run ~10k-100k workflows/sec at the high end; job queues run millions of jobs/sec. For high-volume short atomic work, job queues win on throughput. For low-volume multi-step stateful work, workflow engines win on correctness. The boundary is around "does the work fit in one function under a few minutes."

Job queue vs cron. Cron is scheduling — "run this at midnight" — with no concept of distribution, retry, durability, or backlog. A cron task is a process invocation from a clock daemon. A job queue is dispatching and execution — the queue holds pending work, workers consume, the broker tracks state. Most modern systems combine: cron-style triggers (Sidekiq Cron, Celery beat, Hangfire RecurringJob) enqueue jobs into the job queue, and the queue handles the actual execution and retry. You almost never want raw cron in a multi-server backend — cron has no way to ensure exactly one of your three application servers runs the task, and it has no retry. Cron triggers + a job queue is the canonical pattern.

Job queue vs synchronous request execution. This is the comparison the job queue is built to win — see §9.

Where the job queue sits in the stack:

   ┌──────────────────────────────────────────────────────┐
   │   Web / API tier (request handlers)                   │
   │   Enqueue jobs, return immediately                    │
   └──────────────────────────────────────────────────────┘
                              │
                              ▼ enqueue
   ┌──────────────────────────────────────────────────────┐
   │  ★ BACKGROUND JOB QUEUE (this layer)                  │
   │     - durable job storage                             │
   │     - delayed / scheduled jobs                        │
   │     - priority queues, DLQ (Dead-Letter Queue)        │
   │     - visibility timeout / heartbeat                  │
   │     - retry-with-backoff machinery                    │
   └──────────────────────────────────────────────────────┘
                              │
                              ▼ dequeue
   ┌──────────────────────────────────────────────────────┐
   │   Worker fleet (executes job bodies)                  │
   │   Sidekiq workers, Celery workers, SQS consumers, ... │
   └──────────────────────────────────────────────────────┘
                              │
                              ▼ side effects
   External systems: Postgres, Redis, Stripe API, S3, SES,
                     SMTP, Twilio, Push Notification gateways

What job queues are NOT good for. (1) Strict global ordering at high throughput — FIFO (First-In-First-Out) queues sacrifice throughput for order; per-group ordering caps at ~300 TPS (Transactions Per Second) per group in SQS FIFO. (2) Sub-millisecond latency dispatch — a job queue adds ~1-5 ms enqueue + dequeue overhead, plus polling intervals; not suitable for "respond in 10 ms" paths. (3) Replayable history of past work — once a job is acked, it is deleted; you cannot ask "show me all jobs that ran yesterday" without a separate audit log. (4) Long-running stateful workflows — see workflow-engine distinction above. (5) Streaming continuous data — a Kafka log is the right shape for a click-stream firehose; a job queue is wrong because jobs aren't replayable and per-message state machine overhead doesn't pay off at sustained millions/sec. (6) Exactly-once external side effects without idempotency — see §2.


§2. Inherent guarantees — what's provided vs what must be layered on

Provided by design.

  1. At-least-once delivery. Every accepted job runs at least once. If the worker crashes mid-job, the broker returns the job to the queue (via visibility timeout or heartbeat loss); another worker re-runs it. The contract is: your job body must be idempotent. This is non-negotiable. Sidekiq, Celery, BullMQ, SQS, Cloud Tasks, Hangfire — all of them at-least-once by default.

  2. Priority and ordering within a queue. Most job queues support multiple named queues with worker-side priority weights (Sidekiq's --queue critical,5 default,3 low,1 syntax weights pulls). Within a single queue, default ordering is FIFO (Redis LIST is a FIFO queue if you LPUSH on one end and BRPOP the other).

  3. Delayed execution. Schedule a job to run at a future timestamp. "Send reminder in 24h," "retry webhook in 30 seconds." Implemented under the hood as a sorted set keyed by execute-at timestamp (Sidekiq, BullMQ) or as a row with run_at (Que, river, Hangfire), with a scheduler that polls for due jobs.

  4. Retry with backoff. When a job raises, the broker re-enqueues it with a delay computed from an exponential backoff curve (Sidekiq default: (retry_count^4) + 15 + (random_jitter * 30) seconds, capped at ~30 attempts over ~21 days). Configurable per-job.

  5. Dead-letter queue (DLQ) after N attempts. Jobs that exhaust their retry budget land in a DLQ for manual inspection — preventing a single poison-pill job from looping forever and starving the queue.

NOT provided. Must be layered on.

  1. Exactly-once delivery. No job queue gives you exactly-once across worker crashes and broker failures. Even SQS FIFO's "exactly-once messages within 5 minutes" applies only to deduplication of the enqueue event, not the processing — if a worker claims a FIFO job, partially processes it, then crashes before acking, the visibility timeout returns it and another worker re-processes. You must layer idempotency at the job-execution level: an idempotency key checked against a Postgres table, a conditional UPDATE that no-ops on retry, or an upsert with deterministic primary key.

  2. Strict total ordering across queues. Within one queue and one consumer, FIFO holds. Across multiple queues, across multiple workers per queue, or across retries (a retry of job 2 may execute after job 3 was already processed), there is no global order. Ordering requirements must be encoded into a partition key + single-consumer-per-key model (SQS FIFO's MessageGroupId).

  3. Transactional execution with external state. When a job updates Postgres AND sends an email, those two side effects are not atomic. If the email send fails after the DB update, the retry will update the DB twice (unless idempotent) and the user gets two emails. The transactional outbox pattern (§13) is the standard fix: write to DB + write to outbox table in one transaction, separate worker tails the outbox.

  4. Exactly-once side effects to external APIs. Calling Stripe in a job + worker crash + retry = potentially two charges. Stripe's Idempotency-Key header (the canonical example, §15) is how you bridge this: the client (your job) generates a deterministic key per job-attempt-group; Stripe dedupes server-side. Without this, no job queue can give you "charge exactly once."

  5. Backpressure into the producer. When workers fall behind, the queue grows unboundedly. Most job queues don't push back on the producer; the producer just keeps enqueueing. If you don't alert on queue depth and limit producer rate, the queue can swallow gigabytes of jobs and the broker degrades. Backpressure is the system designer's job.

  6. Job-level observability beyond basic metrics. Most queues track in-flight, retry count, DLQ size. End-to-end job tracing — "this job for user 42 was enqueued at T0, picked up at T1, completed at T2, called Stripe at T1.5" — must be added via OpenTelemetry, custom metrics, or structured logging.

The contract: the broker owns durability, retry machinery, scheduling, and DLQ; the application owns job-body idempotency, external-side-effect dedup, and queue-depth backpressure.


§3. The design space

Four implementation backends differentiate the named systems. Plus single-region vs multi-region.

Backend A — Redis-backed broker (Sidekiq, Celery on Redis, BullMQ, Resque). Jobs are JSON blobs in Redis data structures. The pending queue is a Redis LIST (LPUSH to enqueue, BRPOP to claim — blocking pop is what makes worker polling efficient). Delayed and scheduled jobs are in a sorted set (ZADD with score = unix epoch of execute-at; ZRANGEBYSCORE polled by a scheduler thread). Workers are external processes (Ruby/Python/Node) that connect to Redis and run business logic. Strengths: sub-millisecond enqueue + dequeue latency (Redis is in-memory), millions of jobs/sec sustainable on one well-tuned Redis, mature ecosystems (Sidekiq has ~13 years of patterns). Costs: Redis is your durability story — Redis persistence (RDB snapshots + AOF) is good but not as bulletproof as a transactional DB; Redis memory caps total queue depth; single-instance Redis is a single point of failure (Sentinel or Cluster mitigates). For most rails-shaped apps, this is the default and correct choice.

Backend B — Database-backed broker (Que, GoodJob, river, Hangfire SQL Server, Skiff). Jobs are rows in a database table (typically Postgres or MySQL 8). Workers claim rows atomically using SELECT ... FOR UPDATE SKIP LOCKED LIMIT N. Strengths: transactional consistency with your application data — you can write to Postgres + enqueue job in one transaction (the holy grail of dual-write avoidance, §13); no separate broker to operate; no separate durability story. Costs: lower peak throughput (~10k-50k jobs/sec on Postgres, vs ~1M/sec on Redis), more load on the application database. Has become competitive since Postgres 9.5 (2015) added SKIP LOCKED. River (Go, 2023+) is the modern flag-bearer; GoodJob (Ruby, 2020) is the rails-friendly variant; Que (Ruby, 2014) was first.

Backend C — Managed broker (SQS, Cloud Tasks, Azure Service Bus Queue). Cloud-provider-managed queue with no servers to run. Producers POST messages; workers poll. Broker handles durability, scaling, retries, DLQ. Strengths: zero operational overhead, transparent multi-region (SQS is regional but the abstraction is clean), pay-per-use ($0.40 per million SQS requests as of 2024), virtually unlimited throughput in Standard mode. Costs: vendor lock-in, ~20-100 ms minimum latency (vs ~1 ms Redis), per-job cost adds up at scale. AWS-heavy shops default here.

Backend D — In-memory (BullMQ in-memory mode for dev, NATS). Pure RAM, no persistence. Suitable only for dev or for genuinely ephemeral work where loss is acceptable.

Comparison table

Dimension Sidekiq (Redis) Celery (Redis/RMQ) BullMQ (Redis) Que / GoodJob / river (PG) AWS SQS Cloud Tasks Hangfire (SQL Server)
Language ecosystem Ruby Python Node.js / TypeScript Ruby (Que/GoodJob), Go (river) Any (HTTP / SDK) Any (HTTP) .NET
Broker storage Redis LIST + Sorted Set Redis / RabbitMQ / SQS Redis Streams + ZSET Postgres rows Managed Managed SQL Server rows
Persistence Redis RDB + AOF Broker-dependent Redis DB durability 11-nines 11-nines SQL Server
Throughput / instance ~10k-50k jobs/sec ~5k-20k jobs/sec ~10k-50k jobs/sec ~5k-20k jobs/sec ~100k+ Standard (per acct) ~500 / sec / queue ~5k-20k jobs/sec
Enqueue latency < 1 ms 1-2 ms < 1 ms 1-5 ms 20-50 ms 50-100 ms 1-5 ms
Delayed jobs Sorted set scheduler beat (separate process) Sorted set run_at column DelaySeconds (≤ 15 min) scheduleTime (any future) Native
Recurring / cron Sidekiq Cron (gem) Celery beat bullmq Repeat Quartz-style schedulers EventBridge → SQS Cloud Scheduler → Tasks Native RecurringJob
Priority Per-queue weights Per-queue weights Per-queue priority Per-queue / priority column Multiple queues Multiple queues Multiple queues
DLQ Sidekiq Pro (or manual) Standard Standard (failed events) Manual table RedrivePolicy native Native Native
FIFO No (Sidekiq Enterprise has Strict Order) No No Yes (single worker per group) FIFO queue variant No No
Visibility timeout Heartbeat + reservation Reservation Lock with TTL locked_at + heartbeat 30s default, configurable 10 min default, ≤ 30 min Heartbeat
Best fit Rails apps Python apps Node apps "I already use Postgres" AWS-native, no-server GCP-native, HTTP targets .NET apps

The instinct is not "Sidekiq always wins." Pick by ecosystem (Ruby? Sidekiq. Python? Celery. .NET? Hangfire.) + scale + durability requirements + operational appetite. A Rails shop running 10k jobs/sec wants Sidekiq on Redis. A Postgres-centric Go service running 5k jobs/sec wants river. A team that doesn't want to run any infrastructure wants SQS or Cloud Tasks. A team running monthly cron-style data exports wants Hangfire's recurring jobs.


§4. Byte-level mechanics — the broker on disk and in memory

This is the section where the depth lives. Three different storage models give three different byte-level realizations. We cover all three: Redis-backed (Sidekiq is canonical), database-backed (Postgres SKIP LOCKED pattern), and SQS internals.

4a. Sidekiq + Redis — the LIST and sorted set choreography

Sidekiq's broker is Redis. A Sidekiq deployment with three queues (critical, default, low) maps to Redis keys like this:

Redis key                          | Type         | Contents
-----------------------------------|--------------|---------------------------------
queue:critical                     | LIST         | JSON job blobs, FIFO order
queue:default                      | LIST         | JSON job blobs
queue:low                          | LIST         | JSON job blobs
queues                             | SET          | { "critical", "default", "low" }
schedule                           | ZSET         | (score=execute_at_unix, member=JSON job)
retry                              | ZSET         | (score=retry_at_unix, member=JSON job)
dead                               | ZSET         | (score=died_at_unix, member=JSON job)
processes                          | SET          | { "hostname:pid:nonce" }
hostname:pid:nonce                 | HASH         | worker process heartbeat metadata
hostname:pid:nonce:workers         | HASH         | jid -> JSON of currently running job
stat:processed                     | STRING       | counter
stat:failed                        | STRING       | counter

A job blob is a small JSON document:

{
  "class": "SendOrderConfirmationEmail",
  "args": [12345, "user@example.com"],
  "queue": "default",
  "jid": "b4a577edbccf1d805744efa9",
  "retry": 25,
  "created_at": 1716393600.5,
  "enqueued_at": 1716393600.6
}

Enqueue. Sidekiq::Client.push(class: "...", args: [...], queue: "default") becomes a single Redis LPUSH queue:default '{...}'. ~50 microseconds in Redis. The list grows at the head; workers pop from the tail. Default Redis MAXMEMORY is unlimited until the OOM (Out Of Memory) killer arrives.

Dequeue with blocking pop. The worker process runs a loop:

while (running) {
  job_json = BRPOP queue:critical queue:default queue:low 2
  if (job_json) {
    record_in_progress(job_json)
    execute(job_json)
    remove_from_in_progress(job_json)
    INCR stat:processed
  }
}

BRPOP queue1 queue2 queue3 2 is a Redis primitive: "block for up to 2 seconds; return the first available element from the first listed non-empty queue." It scans queues left-to-right (so critical is checked before default), pops one item, returns it. If all queues are empty, it blocks until one gets a LPUSH, then immediately returns the new item. No polling — the broker is push-driven. This is the entire reason a Redis-backed queue can sustain millions of jobs/sec with low CPU: workers are not spinning on poll loops; they are parked in a blocking Redis call.

Priority across queues. Strict priority (always drain critical before touching default) can starve lower queues if critical is always non-empty. Sidekiq's mitigation is weighted random pull--queues critical,5 default,3 low,1 means: at each pull, pick a queue weighted by its weight, so over time, critical gets 5/9 of the pulls, default gets 3/9, low gets 1/9. Strict priority is opt-in.

Delayed jobs and scheduled jobs. A job to run "in 30 seconds" is added to the schedule sorted set with score = Time.now.to_f + 30. Redis sorted sets are skip-list backed (since Redis 2.0); ZADD schedule 1716393630.0 '{...}' is O(log N). Sidekiq has a background Poller thread in each worker process that runs every ~5 seconds:

loop every 5 seconds {
  due = ZRANGEBYSCORE schedule -inf now LIMIT 0 100
  for each job in due:
    LPUSH queue:default job
    ZREM schedule job
}

(The actual implementation uses MULTI/EXEC for atomicity — ZREM and LPUSH happen in one round-trip so a job is never lost or duplicated.) The sorted set + poller is the canonical delayed-job implementation. Same structure for retry (retry-after-N-seconds) and for cron-triggered jobs.

Why a sorted set, not a sorted list? A sorted list (Redis LIST sorted by score) doesn't exist as a primitive; you'd insert in order via LINSERT, which is O(N) per insert. A sorted set is a skip list with O(log N) insertions and O(log N + M) range queries (M = elements in range). At 100k scheduled jobs, the skip list has ~17 levels, range query for "due now" is essentially instant.

Retry with backoff. When a job raises an exception, the worker catches it, increments the retry counter in the JSON, computes the backoff delay ((count^4) + 15 + jitter), and does:

ZADD retry (now + backoff) '{...updated job json with retry_count++ ...}'

The retry sorted set is polled the same way as schedule. After ~25 attempts (~21 days of backoff), Sidekiq moves the job to the dead set — the DLQ — via:

ZADD dead now '{...}'

Operators inspect dead via the Sidekiq Web UI, fix the root cause, and either delete or replay the job.

In-progress tracking — "I claimed this job, I'm working on it." When a worker pops a job, it immediately writes to its own heartbeat hash:

HSET hostname:pid:nonce:workers <jid> '{"payload": <job>, "run_at": <ts>}'

The worker process also updates hostname:pid:nonce with a heartbeat timestamp every 5 seconds. If a worker crashes (no heartbeat for >60 seconds), Sidekiq's reaper detects the dead worker and re-enqueues its in-flight jobs back to their respective queues. This is the at-least-once recovery mechanism. Without the heartbeat reaper, a crashed worker would silently swallow its in-flight job.

Memory cost. A job blob is typically ~200-500 bytes JSON. A queue depth of 1 million pending jobs occupies ~500 MB of Redis. The schedule sorted set scales the same. Sidekiq's recommended Redis sizing rule: provision for peak queue depth × 3 (working set + retry + scheduled).

Failure mode: Redis is down. Sidekiq client raises Redis::CannotConnectError on enqueue. The application MUST handle this — either fail the request, buffer to a local fallback, or accept loss. Production setups run Redis Sentinel (auto-failover from primary to replica) or Redis Cluster. The single point of failure of the Redis instance is the dominant operational risk of Redis-backed queues.

4b. Que / river / GoodJob — Postgres SKIP LOCKED

The DB-backed approach treats jobs as rows.

CREATE TABLE jobs (
  id           bigserial PRIMARY KEY,
  queue        text NOT NULL,
  job_class    text NOT NULL,
  args         jsonb,
  priority     smallint DEFAULT 100,
  run_at       timestamptz DEFAULT now(),
  attempts     smallint DEFAULT 0,
  max_attempts smallint DEFAULT 25,
  last_error   text,
  locked_at    timestamptz,
  locked_by    text,
  created_at   timestamptz DEFAULT now()
);

CREATE INDEX idx_jobs_claim
  ON jobs (queue, priority, run_at)
  WHERE locked_at IS NULL;

Enqueue. INSERT INTO jobs (queue, job_class, args, run_at) VALUES (...) RETURNING id; — a single SQL statement. In the same transaction as your business INSERT/UPDATE if needed (this is the killer feature — see outbox pattern §13).

Claim. A worker pops a job using SELECT ... FOR UPDATE SKIP LOCKED:

WITH job AS (
  SELECT id
  FROM jobs
  WHERE queue = 'default'
    AND run_at <= now()
    AND locked_at IS NULL
  ORDER BY priority, run_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
UPDATE jobs SET
  locked_at = now(),
  locked_by = 'worker-host-1234'
FROM job
WHERE jobs.id = job.id
RETURNING jobs.*;

The FOR UPDATE acquires a row-level lock on the matching row. The SKIP LOCKED is the magic: if N workers run this query concurrently, each one gets a different row — workers don't block on each other, they just skip rows that another worker has already claimed. Without SKIP LOCKED, the workers would serialize on the lock and throughput would collapse.

Postgres SKIP LOCKED deserves a moment. Added in Postgres 9.5 (January 2016), it's the feature that made DB-backed queues competitive. Before 9.5, claiming a job required either: - A row-level lock with FOR UPDATE (workers block on each other; throughput is single-worker). - An advisory lock with pg_try_advisory_xact_lock (works but awkward — you have to map row IDs to lock IDs). - A "claim by UPDATE" pattern using UPDATE jobs SET locked_at = now() WHERE id = (SELECT id FROM jobs WHERE locked_at IS NULL LIMIT 1) AND locked_at IS NULL RETURNING *; — works but contention causes most UPDATEs to claim nothing because someone else just won the race.

SKIP LOCKED collapses all of that into one clean primitive. MySQL 8.0 (2018) added equivalent semantics. Que (Ruby, 2014) initially used advisory locks because it predated SKIP LOCKED; modern systems use SKIP LOCKED. River (Go, 2023) is the cleanest current realization.

Why is DB-backed competitive? A Postgres instance on NVMe (Non-Volatile Memory Express) can sustain ~30k-50k transactions/sec on commodity hardware. Each job-claim is one transaction (one SELECT-FOR-UPDATE + one UPDATE). Throughput of 30k-50k jobs/sec is enough for the vast majority of apps. Modern Postgres with logical replication and read replicas can run an order of magnitude higher. What you get in exchange for the throughput cap: all the durability and transactional guarantees of Postgres, no separate broker to run, and the killer-feature transactional outbox.

Ack on completion. When a job succeeds:

DELETE FROM jobs WHERE id = $1;

When it fails:

UPDATE jobs
SET attempts = attempts + 1,
    last_error = $error_text,
    run_at = now() + interval '<backoff>',
    locked_at = NULL,
    locked_by = NULL
WHERE id = $1;

When it exceeds max_attempts, it's moved to a failed_jobs table or left in the table with a sentinel state. The DLQ is just another table.

Visibility timeout / orphan reclamation. If a worker crashes mid-job, locked_at remains set forever. A background reaper periodically:

UPDATE jobs
SET locked_at = NULL, locked_by = NULL
WHERE locked_at IS NOT NULL
  AND locked_at < now() - interval '5 minutes';

The 5-minute threshold is the visibility timeout; long-running jobs must heartbeat (update locked_at = now() every minute) to avoid reclamation.

4c. SQS internals — visibility timeout, MessageGroupId, FIFO vs Standard

SQS is the canonical managed job queue. The internals are not public but the contract is precise.

Standard queue. Unlimited throughput (well, "unlimited" — practically billions of messages/day per queue). At-least-once delivery; best-effort ordering (no FIFO guarantee). Each message has:

  • MessageId — globally unique UUID, generated by SQS at send.
  • ReceiptHandle — opaque token specific to one receive of one message. Used to delete or extend visibility.
  • MessageBody — up to 256 KB string payload (typically JSON).
  • MessageAttributes — typed metadata (string, number, binary).
  • VisibilityTimeout — default 30 seconds.

Send + receive lifecycle:

T0: Producer calls SendMessage(QueueUrl, MessageBody, DelaySeconds=0).
    SQS replicates to multiple AZs (Availability Zones) within the region, returns MessageId. ~20-50 ms.
T1: Worker calls ReceiveMessage(QueueUrl, MaxNumberOfMessages=10, WaitTimeSeconds=20).
    Long poll: SQS holds the connection open up to 20 sec waiting for messages.
T2: Worker receives the message + ReceiptHandle. Visibility timeout starts ticking (30 s default).
    During the visibility window, the message is INVISIBLE to other workers.
T3a: Worker processes successfully, calls DeleteMessage(QueueUrl, ReceiptHandle).
     SQS deletes. Job acked.
T3b: Worker crashes. Visibility timeout (30 s) expires. SQS makes the message visible again.
     Another worker receives it via ReceiveMessage. Re-processing. At-least-once.
T3c: Worker is still working but the 30-s window is about to expire.
     Worker calls ChangeMessageVisibility(QueueUrl, ReceiptHandle, VisibilityTimeout=60)
     to extend the window. Heartbeat pattern for long jobs.

Visibility timeout is the at-least-once mechanism. It is conceptually identical to the locked_at + reaper pattern in §4b and the heartbeat pattern in §4a. The broker doesn't trust the worker is alive; it makes the message visible to others after a timeout unless the worker explicitly extends or deletes.

FIFO queue. Strict per-MessageGroupId ordering + exactly-once messaging within a 5-minute deduplication window. The cost: throughput cap of 300 send TPS per group (3,000 with high-throughput mode for the queue as a whole). Each message has a MessageGroupId (the partition key — user_id, order_id, etc.) and a MessageDeduplicationId (the dedup key). SQS guarantees:

  • Messages with the same MessageGroupId are delivered in the order they were sent and processed by exactly one consumer at a time.
  • Within 5 minutes, a second SendMessage with the same MessageDeduplicationId is silently deduped (returns success but does not enqueue).

FIFO under the hood (inferred from contract). Each MessageGroupId has an internal lock — only one in-flight message per group at a time. Other workers receiving from the queue skip messages whose group has an in-flight receive. This is what enforces ordering. When the in-flight message is deleted (success) or visibility expires (failure), the next message in the group becomes eligible. For ordered processing of N independent groups (e.g., N users), FIFO scales horizontally by group count, not by worker count.

DLQ via RedrivePolicy. The queue is configured with RedrivePolicy = {"deadLetterTargetArn": "arn:...DLQ", "maxReceiveCount": 5}. After 5 failed receives (visibility expired without delete = 1 receive), SQS moves the message to the DLQ. Operators monitor DLQ depth and inspect / replay.

Long polling vs short polling. Short polling (WaitTimeSeconds=0) returns immediately even if no messages — generates lots of empty responses, burns money. Long polling (up to 20 seconds) holds the connection open until a message arrives or timeout expires — efficient, what you always want. Empty receives count against your bill. Long polling is the cost-optimization switch.

AWS Lambda + SQS event source. A Lambda function configured with an SQS event source is auto-polled by AWS (your Lambda receives batches of 1-10 messages per invocation). Lambda concurrency scales automatically — empty queue → 0 invocations → no cost; busy queue → up to your account concurrency limit. No worker fleet to run. This is the serverless job queue pattern, dominant in AWS-heavy shops.

4d. One job lifecycle, end-to-end (Sidekiq canonical)

Concrete example: user submits a checkout; the order service writes the order to Postgres and enqueues a SendOrderConfirmationEmail job; a worker picks it up, calls SendGrid, marks the order as confirmation_sent=true.

T0 (enqueue): Order handler in Rails calls SendOrderConfirmationEmail.perform_async(order_id: 42). Sidekiq client serializes to JSON, calls LPUSH queue:default '{"class":"SendOrderConfirmationEmail","args":[42],"jid":"b4a5...","queue":"default",...}' on Redis. Request returns to user in ~5 ms. Job is durable in Redis.

T1 (claim): Worker process is parked in BRPOP queue:critical queue:default queue:low 2. Redis hands it the JSON blob. Worker spawns a thread, hands the JSON to the dispatcher.

T2 (in-progress record): Worker writes HSET hostname:pid:nonce:workers b4a5... '{...}'. This is the visibility marker.

T3 (execute): Worker calls SendGrid::Mail.deliver(...). Network call takes ~200 ms.

T4a (success): SendGrid returns 200. Worker: - UPDATE orders SET confirmation_sent = true WHERE id = 42 (Postgres update). - HDEL hostname:pid:nonce:workers b4a5... (clear in-progress). - INCR stat:processed. - Returns to BRPOP.

T4b (worker crashes mid-execute): Process is killed by OOM or the host loses power. The in-progress hash entry remains until another worker's heartbeat reaper sees hostname:pid:nonce's heartbeat is stale (>60 sec). The reaper: - Reads hostname:pid:nonce:workers hash for orphaned jobs. - For each orphan, LPUSH back to its source queue. - Removes hostname:pid:nonce and its sub-hashes.

The job is now visible to the next worker; the email may be sent twice (worker called SendGrid, crashed before HDEL, retry calls SendGrid again). This is why job bodies must be idempotent. For email specifically, idempotency is often "accepted loss" — sending two confirmation emails is annoying but not catastrophic. For payments, idempotency is non-negotiable (§7).

T4c (worker raises an exception): Worker catches it, increments retry count, computes backoff, ZADD retry (now + backoff) '{...updated...}'. Removes from in-progress. Another worker's poller will move it back to queue:default at now + backoff.

T4d (worker exhausts retry budget): After 25 attempts, Sidekiq moves to the dead ZSET. Manual inspection / replay required.

End-to-end: broker sees the job atomically enqueued, atomically dequeued, atomically removed. No two workers ever own the same job at once except briefly during a crash-recovery race. Durability point: Redis AOF (Append-Only File) flush, default appendfsync everysec — up to 1 second of writes can be lost on Redis crash. For stronger durability, set appendfsync always (per-write fsync) at ~10x throughput cost, or run a replicated Redis (Sentinel / Cluster) so a primary loss falls over to a replica with sub-second RPO (Recovery Point Objective).


§5. Capacity envelope

Background job queues cover a huge range. Quantifying by deployment scale:

Small. A single-server Rails app, single Redis, single Sidekiq process with 25 threads. 1,000-5,000 jobs/sec sustained. Plenty for most early-stage SaaS, internal tools, mid-volume marketplaces. Operational burden: one Redis to keep alive, occasional spot-check of the Sidekiq Web UI. Total memory: ~2-4 GB Redis. This is where most apps live for years before needing to scale.

Mid. A typical Rails monolith at the 100-engineer scale — GitHub circa 2012, Shopify circa 2015, mid-stage SaaS today. Redis Cluster or Redis Sentinel for HA (High Availability), 20-50 worker processes across a worker fleet, 3-5 named queues (critical, default, low, mailers, webhooks). 10,000-50,000 jobs/sec sustained, peaks to 100k. Total Redis ~10-30 GB. Operational burden moderate: queue-depth alerts, worker scaling automation, DLQ monitoring.

Large. Shopify's Sidekiq Enterprise deployment is the canonical large-scale Sidekiq example, reported in their engineering blogs as running thousands of worker pods across pod-isolated worker fleets (pods of workers per merchant tier), with peak job rates in the hundreds of thousands per second across the fleet. During Black Friday Cyber Monday peaks, Shopify reportedly processes hundreds of millions of jobs in a 24-hour window. Total Redis is sharded across many clusters. They use Sidekiq Enterprise features (Periodic, Rate Limiting, Encryption, Unique Jobs) and have built internal tooling on top.

Giant. Major e-commerce platforms or game backends that process event-driven side effects at the scale of the platform's user actions. Amazon SQS reportedly handles billions of messages per day across all customers. A single high-volume customer (consumer game backend, news media aggregation, log delivery) can hit a few billion messages/month on one SQS account. The architecture splits into many queues by purpose (one per microservice tier, often), with autoscaling worker fleets behind each. At this scale the cost shifts from "broker capacity" to "worker compute" — workers are the dominant infrastructure spend, and queue sharding becomes a multi-team operational concern.

Where the next bottleneck appears at each scale: - Small (1k jobs/sec): Redis CPU is fine, worker concurrency is the limit. Add threads or workers. - Mid (50k jobs/sec): Redis CPU starts to climb (Redis is single-threaded for command execution). Solutions: split into multiple Redis instances by queue, or use Redis Cluster. - Large (millions/sec): Per-queue Redis is saturated. Shard by tenant / region / queue purpose. Move ultra-high-volume queues to Kafka-as-job-queue (using a topic + consumer group as the job stream — niche but valid). - Giant: Cost optimization dominates. Move to S3-backed durable streams (warpstream-like) or to managed serverless (SQS + Lambda).


§6. Architecture in context

The canonical pattern — fire and forget — where a request handler enqueues a job and returns immediately:

   User Browser / Mobile Client
              │
              │ POST /orders
              ▼
   ┌──────────────────────────────┐
   │   Web tier (Rails / Django / │
   │   Express / Spring / ASP.NET)│
   │                              │
   │   begin_db_transaction       │
   │   INSERT order INTO db       │
   │   enqueue SendConfirmation   │
   │   enqueue ChargePayment      │
   │   enqueue ReindexSearch      │
   │   commit_db_transaction      │
   │                              │
   │   return 200 to user (~5 ms) │
   └──────────────────────────────┘
                │
                │ JSON job blobs
                ▼
   ┌──────────────────────────────┐
   │  ★ JOB QUEUE BROKER          │
   │   Redis / Postgres / SQS     │
   │                              │
   │   queue:default   [j1,j2,j3] │
   │   queue:critical  [...]      │
   │   queue:low       [...]      │
   │   schedule (ZSET) {...}      │
   │   retry (ZSET)    {...}      │
   │   dead (ZSET / DLQ)          │
   └──────────────────────────────┘
                │
                │ BRPOP / SELECT FOR UPDATE / ReceiveMessage
                ▼
   ┌──────────────────────────────────────────────────────┐
   │   Worker fleet (autoscaled)                          │
   │                                                       │
   │   Worker 1: SendConfirmation -> SendGrid API         │
   │   Worker 2: ChargePayment   -> Stripe API            │
   │   Worker 3: ReindexSearch    -> Elasticsearch        │
   │   Worker N: ...                                       │
   └──────────────────────────────────────────────────────┘
                │            │            │
                ▼            ▼            ▼
            SendGrid       Stripe      Elasticsearch
              (SMTP)      (Payment)     (Search)

What this pattern buys: the user's POST returns in 5 ms instead of 5 seconds (SendGrid + Stripe + ES would have been synchronous). The user doesn't wait for the network calls. The browser doesn't time out if Stripe is slow. The worker can retry transient failures without the user knowing. If the web tier crashes after enqueue, the jobs are still in the queue — they will run when workers come back.

Partition key / sharding placement. Most job queues don't shard at the broker (Redis is one logical instance). Sharding happens at the queue level — name your queues by tenant ID modulo N, route enqueues to the right Redis cluster, route workers to the right queue. This is the manual sharding model. SQS Standard hides this — it's transparently partitioned across the AZ-internal storage.

The replication boundary. Redis Sentinel or Cluster: one primary + N replicas, async replication. Postgres job tables: streaming replication, async or sync depending on durability target. SQS: multi-AZ replication is part of the contract, no operator config needed.


§7. Hard problems inherent to this technology

The fundamental challenges anyone using a job queue will encounter, in roughly the order they bite at scale.

7a. Idempotency — retries with the same arguments must be safe

Naïve solution: "I'll just retry on failure. The job is small, retries are fine."

How it breaks: Worker pulls ChargePayment(order_id: 42, amount: 100). Calls Stripe; Stripe charges the card; the worker crashes before acking. Visibility timeout expires. Another worker pulls the same job. Calls Stripe again. Customer charged $200 for a $100 order. Now the customer service team has to issue a refund, but in the meantime the customer has filed a chargeback, and the chargeback adds a $15 dispute fee on top of the duplicate. Multiplied by 0.01% of jobs that hit this race, at 1M payment jobs/day, you have 100 incidents/day.

The actual fix: make the job idempotent by encoding a deterministic key into the external call. For Stripe:

def perform(order_id, amount)
  idempotency_key = "charge:order:#{order_id}"
  Stripe::Charge.create(
    amount: amount,
    currency: "usd",
    source: token,
    idempotency_key: idempotency_key
  )
end

Stripe's server checks if it has seen this Idempotency-Key in the last 24 hours; if yes, returns the cached response of the original request. Two attempts → one charge. The idempotency key is deterministic — same job arguments give the same key — so retries hit it. Many APIs (Stripe, PayPal, Shopify, GitHub) all support an idempotency-key header for this exact reason.

For internal database updates, idempotency comes from: - Upserts with deterministic primary keys: INSERT ... ON CONFLICT DO NOTHING or ON CONFLICT DO UPDATE. - Conditional updates: UPDATE orders SET confirmation_sent = true WHERE id = 42 AND confirmation_sent = false. The second attempt no-ops because the row is already updated. - Idempotency table: a separate table tracking processed job_idresult_hash. The job checks the table before doing work; the first attempt records the result.

Discord-style insight: for high-volume notification systems (Discord uses Celery), idempotency is often "best-effort dedupe" rather than absolute. A 0.01% duplicate notification rate is acceptable for product messages; a 0% duplicate charge rate is required for payments. The strictness of idempotency scales with the cost of duplication.

7b. Long-running jobs and the visibility timeout race

Naïve solution: "I'll set the visibility timeout to 1 hour to be safe."

How it breaks: Now a real worker crash causes the job to sit invisible for an hour before retry. Meanwhile, the queue grows, the user is waiting, and your SLO (Service Level Objective) is blown. Also: if the job is truly a 1-hour job (large file transcoding), the visibility timeout might still expire and a second worker starts processing — now two workers are running the same job, fighting over the output S3 path or doing duplicate API calls.

The actual fix: heartbeat-based liveness. The worker periodically extends the visibility timeout (or updates locked_at in DB-backed):

# Celery / SQS pattern
def process_long_job(self, video_id):
  visibility_extender = threading.Thread(
    target=lambda: self.heartbeat_every(60),
    daemon=True
  )
  visibility_extender.start()
  try:
    transcode_video(video_id)
  finally:
    visibility_extender.stop()

def heartbeat_every(self, secs):
  while not stop:
    sleep(secs)
    sqs_client.change_message_visibility(
      QueueUrl=self.queue_url,
      ReceiptHandle=self.receipt,
      VisibilityTimeout=120  # extend by 2 minutes
    )

Set visibility timeout short (~2 minutes), but extend every minute. If the worker truly dies, the next heartbeat doesn't happen, the timeout expires after ~2 minutes, and another worker picks up. Total downtime: ~2 minutes, not 1 hour.

For DB-backed queues, the equivalent is the worker updating locked_at = now() every minute. The reaper only reclaims rows where locked_at < now() - 5 minutes.

The deeper insight: the visibility timeout is the system's heartbeat budget. A worker that doesn't extend within the timeout is presumed dead. This is the same pattern as Kafka's session.timeout.ms for consumer group membership and Temporal's worker heartbeat — universal across "broker tracks worker liveness via timeout."

7c. Poison pills — jobs that always fail

Naïve solution: "Retry until it succeeds."

How it breaks: A bug in the job code causes it to crash on a specific input. The job retries 25 times with exponential backoff, all failing. The retry sorted set fills up. Workers spend 4% of their time on this dead job, less on healthy work. If the bug affects 1% of jobs (a malformed user input), the queue effectively has 1% of its workers wasted on doomed retries. Worse, if the bug is deterministic and the input is common, the queue can fall behind permanently.

The actual fix: DLQ after N attempts. Every modern job queue has a built-in retry budget. Sidekiq's default is 25 attempts (~21 days of backoff). SQS's RedrivePolicy specifies maxReceiveCount (default off; set to 3-5 typically). After exhausting attempts, the job moves to the DLQ. The DLQ has: - A separate metric (dlq_depth). - An alert when depth crosses a threshold (e.g., >100 jobs in DLQ). - A manual inspection workflow — operator looks at the failed jobs, identifies the root cause, fixes the code, replays the DLQ (or deletes the bad inputs).

The retry budget is a circuit breaker for poison pills. Without it, one bad input poisons the queue indefinitely.

Multi-domain example: GitHub's webhook delivery — webhooks that always 500 from the destination URL go to a DLQ after ~8 hours of retries. GitHub doesn't try forever. Stripe's webhook retry strategy similarly caps at 3 days. Once exhausted, the developer must inspect and either fix their endpoint or accept the loss.

7d. Priority across queues — fair scheduling vs starvation

Naïve solution: "Process critical queue first, always. Only touch default when critical is empty."

How it breaks: A traffic spike makes critical perpetually non-empty (say, 100 jobs/sec arrival, 80 jobs/sec processing). Workers always have critical jobs to do. default and low are never touched. The queue depth on default grows from 0 to 100k over an hour. Eventually the default queue has hours of latency. Customers using non-critical features (sending reminders, generating reports) experience long delays. The system is producing for critical users at the cost of starving everyone else.

The actual fix: weighted priority. Sidekiq's --queues critical,5 default,3 low,1 syntax means "at each pull, pick a queue by weight." Over time, you process 5/9 critical, 3/9 default, 1/9 low. Even when critical has unlimited jobs, default and low still get one-third and one-ninth of capacity. Trade-off: critical isn't truly highest priority anymore — it's just highest weight. For real "drop everything else" priority, you need separate worker fleets per queue (a Sidekiq Enterprise pattern) so that critical doesn't compete with default at all — default workers ONLY process default, critical workers ONLY process critical. Static allocation, isolated capacity.

The general rule: never let a single queue's throughput rate equal or exceed total worker capacity, because then it starves everything else. Always reserve a floor of capacity for lower priorities, either via weighted scheduling or via dedicated worker pools.

7e. FIFO vs throughput — the ordering tax

Naïve solution: "I want FIFO, so I'll use SQS FIFO for everything."

How it breaks: SQS FIFO caps at 300 send TPS per MessageGroupId. If all your jobs share one group (because you didn't think hard about partition keys), your queue is capped at 300/sec. Meanwhile, your peak is 5000/sec. The queue grows unboundedly. Trying to add workers doesn't help — FIFO ordering means only one worker can be in-flight per group at a time.

The actual fix: only use FIFO where ordering actually matters, and shard the group key. Most jobs don't need ordering — SendOrderConfirmation for order 42 and order 43 can run in either order. Use Standard SQS or any non-FIFO queue. For the small subset that needs ordering (e.g., per-user ledger updates), use FIFO with MessageGroupId = user_id — this gives ordering per user but parallelism across users. With 10k active users, you have 10k independent groups, each at 300 TPS, total 3M TPS of capacity, with per-user ordering.

The deeper insight: ordering and parallelism are in tension. Strict total ordering = single-worker bottleneck. Per-key ordering = parallel across keys, ordered within. Choose the smallest unit of ordering you actually need.

7f. Job ordering with retries — the silent reorder

Naïve solution: "Jobs go in FIFO, they come out FIFO."

How it breaks: Job 1 (UpdateBalance(user: 7, delta: +10)) is enqueued at T0. Job 2 (UpdateBalance(user: 7, delta: -5)) at T1. Job 3 (UpdateBalance(user: 7, delta: +3)) at T2. Worker A picks up job 1, fails (network error), goes to retry with 30-second backoff. Worker B picks up job 2, succeeds. Worker C picks up job 3, succeeds. 30 seconds later, worker D picks up job 1's retry and succeeds. Final balance: 0 + 10 - 5 + 3 + 10 (retry) = 18, but the correct order should have been +10 then -5 then +3 = +8. Order broken by retry.

Worse: the worker that originally processed job 1 did apply the +10 once before crashing post-update; the retry adds another +10. Now you have double-application due to non-idempotency plus order break due to retry. Two compounding bugs.

The actual fix: for any sequence of jobs that mutates shared state, the sequence must be either (a) FIFO-ordered with single in-flight (SQS FIFO with MessageGroupId = user_id), accepting the 300 TPS/group limit; or (b) commutative + idempotent (e.g., set-style "set user 7's balance to X" instead of "add Y to balance"), so order doesn't matter; or (c) routed to a workflow engine that owns ordered execution as a first-class primitive.

Multi-domain example: order-fulfillment pipelines (commerce) typically use FIFO per order_id — you don't want "ship order" to fire before "allocate inventory." Notification fanout (LinkedIn-style social feeds) often does NOT use FIFO — a thousand likes on a post don't need ordering, just eventual consistency.

7g. Backpressure — queues that grow forever

Naïve solution: "The queue is durable. It can absorb spikes."

How it breaks: Workers are processing 1000 jobs/sec. Producers are enqueueing 1500 jobs/sec. Net growth: 500 jobs/sec. Over an hour, queue depth grows by 1.8M jobs. Over a day, 43M. Eventually: - Redis fills its memory; OOM killer fires; Redis crashes; entire queue lost (if persistence isn't tuned hard). - Postgres job table grows; index queries slow down; throughput collapses further. - Latency from enqueue to execution grows from seconds to hours to days. The user who clicked "send confirmation" at noon gets the email two days later — useless.

The actual fix: monitor queue depth and alert at thresholds. The typical setup: - Alert at queue depth > 10k for critical queue (which should normally be ~0). - Alert at queue lag (time from enqueue to dequeue) > 60 seconds. - Autoscale worker count based on queue depth (most modern setups do this via Kubernetes HPA on a custom metric).

Producer-side backpressure is harder. If your web tier enqueues 1500/sec because the user is hitting a button, you can't easily make the user wait. The standard moves: - Rate-limit enqueueing per user (token bucket). - Push back at the API layer ("429 Too Many Requests"). - Drop low-priority jobs at the source if the queue is saturated (e.g., shed low queue enqueues when queue depth > threshold). - Use a "leaky bucket" enqueue that smooths bursts.

For very large systems (Shopify, Stripe), tier the queues by criticality and isolate. The critical queue can never be starved by analytics jobs because they're on different Redis clusters with different worker fleets.


§8. Failure mode walkthrough

Systematically: crash mid-job, broker outage, job that runs forever, retry storm.

8a. Worker crash mid-job

Crash mid-job, before any side effect. Worker hadn't called the API yet. Visibility timeout / heartbeat reaper detects the dead worker, re-enqueues the job. Another worker picks it up, calls the API once. No customer impact. At-least-once with single-execution.

Crash mid-job, after side effect, before ack. Worker called Stripe (charged customer), crashed before deleting the job from the queue. Visibility timeout expires; another worker picks up the job; calls Stripe again. Customer charged twice — unless the job uses Stripe's idempotency key. This is the entire reason the idempotency-key pattern exists.

Crash with partial side effects. Worker called Stripe (success), updated the orders table (success), was about to update analytics service (failed). Visibility timeout returns the job. Retry calls Stripe (idempotency dedupes), updates orders again (idempotent UPDATE no-ops because confirmation_sent = true), updates analytics (succeeds). End state consistent, no double-charge, but only because every external call was idempotent.

Recovery procedure: none required from the operator — at-least-once delivery handles it automatically, provided idempotency is in place. Without idempotency, the operator gets paged about duplicate charges, has to manually refund.

8b. Broker outage (Redis is down)

Mid-enqueue: producer (web tier) calls LPUSH, Redis is unreachable. Client raises Redis::CannotConnectError. The job is not enqueued. The producer must: - Retry the enqueue (transient blip). - Fail the request to the user ("we're having trouble, try again"). - Buffer locally and retry later (rare; usually too complex).

In Rails / Sidekiq, the default is to raise the error to the request handler. The job is lost unless the request handler retries or returns an error.

Mid-processing: worker is in the middle of LPUSH retry ... to record a retry. Redis goes down. Worker's connection drops. The job's state is unclear — the in-progress hash entry is in the now-unavailable Redis. When Redis comes back (Sentinel failover, ~5 seconds), the job's in-progress entry will be reaped by the next reaper sweep.

Broker fully down for hours: the queue is unavailable. Jobs are lost (if producers fail open) or queued in-app (if producers buffer). Workers idle. After Redis comes back, queue resumes — assuming persistence (AOF) survived the outage.

Recovery durability point: Redis AOF + replicated Redis (Sentinel) gives RPO ~1 second (one second of writes can be lost). For stricter durability, set appendfsync always (per-write fsync) — RPO 0 at ~10x throughput cost. Or use a DB-backed queue where the broker's durability IS the application database's durability — same RPO as the rest of the app.

8c. Job that runs forever

Mode 1: infinite loop. Bug in the worker code causes while true; do_nothing; end. The job never returns. The visibility timeout / heartbeat handles this — if the worker doesn't heartbeat, the broker reclaims after timeout. Unless the worker heartbeats but is stuck. This is harder. The worker is "alive" from Redis's perspective but isn't making progress.

The fix: wall-clock job timeout in the worker. Most job queues support Sidekiq.options[:timeout] or equivalent — if the job has been running for >N seconds, the worker forcibly terminates the thread or process. The job goes to retry. Set the timeout to be longer than the longest legitimate job (e.g., 10 minutes for media transcoding) but shorter than infinity.

Mode 2: slow external dependency. The job is calling SendGrid, but SendGrid is degraded — each call takes 30 seconds. The worker is "working" but spending all its time on one job. The broker doesn't know to reclaim because the worker is heartbeating. The fix: per-call timeouts. The worker sets a 5-second HTTP timeout on the SendGrid call; if it exceeds, the call fails fast, the job goes to retry. The worker is now free to handle the next job.

8d. Retry storm — downstream is down, all jobs retry, DDoS amplification

The scenario: SendGrid's API is degraded; 50% of email-send jobs fail. The retry sorted set fills with retries. After backoff, they retry. They fail again. Backoff. Retry. At sustained 50% failure, the queue effectively halves its throughput (each job costs 2 attempts). At 80% failure, throughput is one-fifth. The retry sorted set grows; workers spend most of their time on retries; healthy jobs starve.

Worse: if the SendGrid issue is severe and you're retrying aggressively, your retries make SendGrid's problem worse. You're DDOS-ing your own dependency.

The actual fix: circuit breaker. When the failure rate to a downstream exceeds a threshold (say 50% over the last 60 seconds), the worker stops calling that downstream entirely. Jobs targeting that downstream go to a "downstream-degraded" queue with an exponentially longer backoff. After the circuit breaker closes (downstream recovers, success rate restored), normal processing resumes.

Libraries: Hystrix (Netflix, now deprecated), Resilience4j (JVM), circuitbox (Ruby), Polly (.NET). The pattern is universal: don't hammer a sick service; back off, give it room to recover.

Recovery: once the downstream is healthy, the queue drains the backlog. If the backlog is large (millions of jobs), drain time may be hours. Worker autoscaling helps; rate-limit the drain so you don't re-DDOS on recovery.


§9. Why not just synchronous execution in the request

The case for background jobs is most clearly made by walking through what happens without them.

Naïve setup: the user submits a checkout. The web handler: 1. Validates the input. 2. Calls Stripe to charge the card. (~500 ms) 3. Writes the order to Postgres. (~10 ms) 4. Calls SendGrid to send a confirmation email. (~300 ms) 5. Calls warehouse fulfillment API to allocate inventory. (~700 ms) 6. Calls Elasticsearch to update search index. (~200 ms) 7. Returns 200 to the user.

Total: ~1.7 seconds under happy path. The user sees a spinner for 1.7 seconds before the page loads. P99 (because some calls hit retries or slow networks): 4-6 seconds. The browser may time out.

Now what happens when one call fails?

Stripe is slow (5 sec p99): the entire request takes 5+ seconds. Many user agents time out at 30 seconds for the connection, but the user gives up at 3 seconds and reloads — now they may double-submit. The handler is half-done (Stripe charged but order not written); the reload tries to charge again.

SendGrid is down: the email send fails. Should the handler: - Fail the request to the user and refund the charge? (Now the user sees an error after their card was charged.) - Return success without sending the email? (Now there's no record of "we owe this user an email"; it's silently lost.) - Retry the SendGrid call inline? (Adds more latency, may not succeed.)

Server crashes mid-request (instance dies between steps 2 and 5): Stripe charged, order written, email not sent, inventory not allocated, search not indexed. The system is in an inconsistent state. When the user retries, the new attempt re-charges (the original order isn't found by their dedup logic). Two charges, one order, no email, no inventory.

With a job queue:

  1. Validate input.
  2. Write order to Postgres with state = 'pending'. (~10 ms)
  3. Enqueue jobs: ChargePayment(order_id), SendConfirmation(order_id), AllocateInventory(order_id), ReindexSearch(order_id). (~5 ms total)
  4. Return 200 to the user. Total: ~20 ms.

The user gets an immediate response. The jobs run asynchronously. If SendGrid is down, the email job retries with backoff for hours, eventually succeeding or moving to DLQ for manual handling — the user got their order. If the web server crashes after step 3, the jobs are durable in the queue; another worker runs them. No inconsistent state, no double-charge (with idempotency keys), no user-facing failure.

The web tier is now stateless, fast, and resilient. The complexity of "do many things reliably" is pushed into the job queue layer, where retry, durability, and crash-recovery are first-class primitives.

This is the entire reason job queues exist. If you find yourself implementing retry logic, persistence of pending work, and crash-recovery in your request handler, you're rebuilding a job queue badly.


§10. Scaling axes

How do job queues scale, and where do the inflection points hit?

Type 1: Uniform growth — more jobs, more workers

The simplest growth pattern. You started at 1000 jobs/sec; now you're at 10000 jobs/sec. The path:

  • Add worker processes / threads. Sidekiq runs 25 threads per process by default; adding processes scales linearly until the broker saturates. At 10k jobs/sec on a single-Redis Sidekiq, you might need 20 worker processes across 5 machines.
  • Add broker capacity. Redis can sustain ~100k-200k QPS (Queries Per Second) on a single primary. Beyond that, shard Redis by queue (one Redis per queue, or one Redis per tenant). Or move to Redis Cluster for transparent sharding.
  • Partition queues by purpose / tenant. Instead of one default queue handling everything, split into email, payment, analytics, etc. Each gets its own worker pool, its own Redis (if needed), its own scaling.

Inflection points: - Around 50k jobs/sec on single Redis: shard. - Around millions/sec across the fleet: consider managed (SQS) or move to Kafka-as-queue. - Worker scaling: ~hundreds of worker processes is usually fine; thousands becomes operationally heavy (deploys, restarts).

Type 2: Hotspot intensification — one queue gets hot

A single tenant or a single queue suddenly dominates. Black Friday lands and all the e-commerce traffic goes through priority_checkout queue. The path:

  • Isolate the hot queue. Give it its own broker (dedicated Redis), its own worker pool. Other tenants' queues don't fight for capacity.
  • Split the hot queue by sub-key. priority_checkout becomes priority_checkout_shard_0, ..._shard_1, ..., ..._shard_N, partitioning by hash(tenant_id) % N. Workers consume from all shards round-robin.
  • Add a dedicated worker fleet just for the hotspot. Shopify's pod isolation model — each merchant tier has its own pod of workers, with capacity allocated proportionally.

Why is this different from Type 1? Type 1 is "add capacity uniformly." Type 2 is "hot key needs special treatment, others don't." If you uniformly scale up workers when one queue is hot, you waste 80% of the new capacity on cold queues. You need to target the hot queue with isolated resources.

Inflection points: - When one queue's throughput exceeds 20% of total worker capacity, isolate it. - When one tenant's job rate exceeds 5x the median tenant, give them dedicated workers. - When a recurring hotspot is predictable (Black Friday), pre-provision dedicated capacity.


§11. Decision matrix vs adjacent technology categories

When to pick which.

Dimension Redis-backed job queue (Sidekiq) DB-backed (Que / river) Managed (SQS / Cloud Tasks) Workflow engine (Temporal)
Latency to dispatch < 1 ms 1-5 ms 20-100 ms 5-50 ms
Throughput / instance 10k-50k jobs/sec 5k-20k jobs/sec 100k+ (managed scaling) 10k+ workflows
Durability Redis AOF + replication DB durability Managed (11-9s) Quorum DB
Atomicity with app DB No (separate broker) Yes (same DB) No No
Job state model Job blobs in lists / sets Rows in table Managed messages Event-sourced history
Multi-step workflows DIY (job chains) DIY DIY First-class
Operational burden Run Redis Already run DB None Run Temporal cluster
Cost at million jobs/day ~$200/mo (Redis + workers) Negligible incremental ~$400/mo SQS + workers ~$500-2000/mo Temporal infra
When to pick Mature ecosystem fit (Rails, Python, Node) Postgres-centric, need transactional outbox AWS-heavy, no infra appetite Multi-step processes with state

Pick rules of thumb:

  • Redis-backed (Sidekiq / Celery / BullMQ): if you have a high-volume Rails/Django/Node app, want the fastest enqueue + dequeue, accept Redis as your durability story, and your jobs are stateless short tasks. This is the default for most apps.
  • DB-backed (Que / river / GoodJob): if you're Postgres-centric, want transactional outbox, accept slightly lower throughput, and want to avoid operating a separate broker. Especially compelling for "I want to write to DB and enqueue a job atomically."
  • Managed (SQS / Cloud Tasks): if you're on AWS / GCP, want zero operational overhead, accept 50-100 ms latency, and your jobs are HTTP targets or Lambda invocations. AWS Lambda + SQS is one of the cleanest serverless job patterns.
  • Workflow engine (Temporal): if you have multi-step workflows with state across steps, long durations, or branching logic. A job queue is wrong for "30-day onboarding."

§12. Job patterns — the canonical types of jobs

Beyond "execute a function async," the field has settled on several named patterns. Knowing them by name saves design effort.

12a. Delayed jobs

"Run this at a specific future timestamp." E.g., "send the reminder 24 hours from now." Implementation: schedule sorted set keyed by execute_at (Sidekiq, BullMQ), run_at column (Que, river), or DelaySeconds parameter (SQS, max 15 min). For arbitrary future times beyond 15 min, SQS users typically use a scheduleTime Cloud Tasks job or an EventBridge scheduled rule that enqueues to SQS at the right moment. The scheduler that polls due jobs is the runtime cost of delay.

12b. Scheduled / cron-style jobs

"Run every day at 2am." Daily reports, weekly cleanup, hourly cache warming. Most job queues have an external scheduling component: - Sidekiq Cron (gem) — cron expressions; each tick enqueues a job into the regular queue. - Celery beat — separate process that holds the schedule; ticks enqueue jobs. - Hangfire RecurringJob — built-in. - AWS EventBridge → SQS / Lambda. - Cloud Scheduler → Cloud Tasks.

The pattern: a scheduler emits jobs into the queue, the queue's normal machinery handles execution. Cron itself is rarely used directly because cron has no distributed coordination — three application servers each running their own cron would fire three duplicate jobs.

12c. Recurring jobs with stateful intervals

Sidekiq Cron and similar support more than fixed cron — e.g., "run every 5 minutes, but only if the last run completed." This requires the scheduler to track last-run state. Implementation: a record in Redis or DB of last_run_at per recurring job. The scheduler checks before enqueueing.

12d. Job chaining

"Job A completes triggers job B." The output of A is the input to B. Common implementations: - In-job enqueue: at the end of job A's perform, call JobB.perform_async(result). Simple but couples A to B. - Event-driven: job A writes its result to a topic/event store; a separate consumer triggers job B. Cleaner but more infrastructure. - Workflow engine: Temporal handles this natively as part of the workflow definition. If you find yourself chaining 5+ jobs, switch to a workflow engine.

12e. Batch jobs with parallelism limits

"Process these 10k records, but only N at a time." Sidekiq Pro has Sidekiq::Batch — enqueue all 10k jobs, callbacks fire when the batch completes (success/failure). Celery has group() and chord() primitives. BullMQ has bull.flow for this. For very large batches, partition into chunks and enqueue chunk-jobs that each process M records.

12f. Idempotency keys (Stripe pattern)

The pattern: the client (your job, or the user) provides a unique Idempotency-Key; the server (Stripe, GitHub, etc.) checks if it has seen this key; if yes, returns the cached response of the original request; if no, processes normally and caches the response. Window: typically 24 hours for Stripe; 60 days for some APIs. This is the canonical pattern for making external API calls safe under retry. Always pass an idempotency key when calling a non-idempotent external API from a retrying job.

12g. Dead-letter queues (DLQ)

A separate queue (or sorted set, or table) where jobs that exhaust retries land. Operators monitor DLQ depth as a leading indicator of systemic issues. The DLQ supports: - Manual inspection — view the job, see the last error, decide what to do. - Replay — push the job back to the main queue (e.g., after fixing a bug). - Drop — delete jobs that are permanently broken (e.g., orphan jobs for deleted users).

Without a DLQ, poison-pill jobs either retry forever (blocking workers) or get silently dropped (data loss). The DLQ is the explicit "we couldn't handle this — human, please look" channel.


§13. Postgres SKIP LOCKED pattern — the transactional outbox

A dedicated section because this pattern has become foundational in modern backend design, and the job queue is one of its main applications.

13a. The dual-write problem

When a service must update its database AND publish an event to a downstream system (Kafka, another service, a job queue), the two writes are not atomic. Four bad outcomes are possible:

  1. DB write succeeds, event publish fails → downstream missed the event.
  2. DB write fails, event publish succeeds → downstream sees an event that doesn't reflect reality.
  3. Both fail.
  4. Both succeed (happy path).

Outcomes 1 and 2 are the "dual-write problem." They are inherent any time you have two systems that must be updated together but lack a distributed transaction.

13b. The transactional outbox solution

Write to your DB and write to an outbox table in the same transaction. A separate worker tails the outbox and forwards each event/job to the downstream system. Because the outbox table is in the same DB, it's atomic with your business update — either both happen or neither does. The outbox worker can be at-least-once (it sees the row, ships it, deletes it; if it crashes, it re-ships next time). At-least-once is fine because the downstream is idempotent.

BEGIN;

UPDATE orders
SET status = 'paid', paid_at = now()
WHERE id = 42;

INSERT INTO outbox (event_type, payload, created_at)
VALUES ('OrderPaid', '{"order_id":42,"amount":100}', now());

COMMIT;

A separate worker:

WITH next AS (
  SELECT id, event_type, payload
  FROM outbox
  WHERE processed_at IS NULL
  ORDER BY id
  LIMIT 10
  FOR UPDATE SKIP LOCKED
)
SELECT * FROM next;

Worker publishes each event to Kafka, then:

UPDATE outbox
SET processed_at = now()
WHERE id IN (...);

(Or DELETE; some teams keep historical outbox rows for audit.)

13c. Why this is also a job queue

If you squint, the outbox IS a job queue: - INSERT into outbox = enqueue. - SELECT FOR UPDATE SKIP LOCKED = claim. - DELETE = ack. - UPDATE with retry_at = retry with backoff. - A separate failed_outbox table = DLQ.

This is precisely how Que, GoodJob, and river work. The transactional outbox is the canonical use case for DB-backed job queues — it's the one place where atomicity with the business DB is structurally required, and a separate broker (Redis, SQS) can't give you that.

13d. When the outbox pattern is the right fit

  • You're already using Postgres / MySQL.
  • You need atomicity between a business write and a job enqueue.
  • Your job throughput is below ~30k/sec (the practical ceiling of PG-backed queues).
  • You want one durability story (the application DB), not two.

When it's not the right fit: - Jobs at very high throughput (>50k/sec) — Redis or a managed queue scales better. - Jobs that don't need atomicity with the DB — the outbox pattern is overhead you don't need.

13e. The pattern in production

LinkedIn's Brooklin (a CDC system, conceptually similar to the outbox tailing pattern), Debezium (a CDC connector that tails MySQL binlog and Postgres WAL — Write-Ahead Log), and "outbox publishers" in many e-commerce and fintech systems are all realizations of the same insight. The DB is the durable single source of truth; the outbox makes the "and-then" of "DB update and-then publish event" atomic.


§14. SQS deep dive

A dedicated section because SQS is the canonical managed job queue and the abstractions matter.

14a. Standard queue

  • Throughput: virtually unlimited. Per-queue billions of messages/day demonstrated.
  • Ordering: best-effort; in practice mostly in-order but not guaranteed. Concurrent producers and replication can reorder.
  • Delivery: at-least-once. Duplicates are common (low single-digit percentage in practice).
  • Use for: the vast majority of work-queue patterns where order doesn't matter and idempotency is in place.

14b. FIFO queue

  • Throughput: 300 send TPS per MessageGroupId (3,000 with high-throughput-mode for the queue overall).
  • Ordering: strict FIFO within each MessageGroupId.
  • Delivery: exactly-once within a 5-minute dedup window (based on MessageDeduplicationId).
  • Use for: per-user ledger updates, per-order fulfillment sequences, anywhere ordering matters.

The 5-minute window is the key constraint. If a producer sends MessageDeduplicationId = "abc123" at T0 and again at T0+4 minutes, the second is silently deduped. At T0+6 minutes, the second one would be accepted (window expired). So "exactly-once" is "exactly-once within a 5-minute window." For longer dedup needs, your application must layer its own idempotency.

14c. Visibility timeout

Default 30 seconds. Configurable up to 12 hours. The worker can extend at any time via ChangeMessageVisibility. The standard pattern is short timeout (30 sec) + heartbeat extension every 20 seconds for long jobs.

14d. DLQ via RedrivePolicy

Configure on the source queue: RedrivePolicy = {"deadLetterTargetArn": "...", "maxReceiveCount": 5}. After 5 receives without successful delete (i.e., 5 timeouts), SQS auto-moves to the DLQ. The DLQ is just another SQS queue; you can configure alarms on its depth.

14e. Long polling

ReceiveMessage with WaitTimeSeconds=20 (the max). SQS holds the connection open for up to 20 seconds, returning immediately when a message arrives. Saves cost (fewer empty receives) and reduces latency (no polling delay). Always use long polling. Short polling exists for legacy reasons.

14f. AWS Lambda + SQS event source

Configure a Lambda function with an SQS event source. AWS auto-polls the queue, batches messages (1-10 per invocation), invokes the Lambda. Lambda concurrency auto-scales — empty queue → no invocations → no cost; spiky load → more concurrent Lambdas (up to your account limit). Effectively a serverless job queue. Used heavily in event-driven cloud architectures.

The pattern: producer publishes to SQS, Lambda consumes. You manage the Lambda code; AWS manages everything else. Operational burden ~0. Cost: ~$0.40 per million SQS requests + Lambda execution cost ($0.20 per million invocations + duration cost). For modest workloads (~10M jobs/day), total cost is ~$50-100/month — cheaper than running a single EC2 instance.


Different domains using job queues in different ways.

15a. Email sending (background SMTP delivery)

The original use case. User signs up → enqueue welcome email. The email job calls SendGrid / Mailgun / Postfix; on transient failure (5xx, timeout), the job retries with backoff. On permanent failure (bounce, invalid address), the job moves to DLQ for the support team to inspect. Why a queue? Email APIs are slow (~300 ms p99) and occasionally degraded; running them inline blocks the user request. Why not a workflow engine? Single step, no state — overkill.

15b. Image / video processing

User uploads a video; backend enqueues TranscodeVideo(video_id). Worker downloads the original from S3, runs ffmpeg with multiple output profiles, uploads results back to S3, updates the DB with the result URLs. Each video is ~5-30 minutes of CPU. Why a queue? The work is expensive and parallelizable; user shouldn't wait. Workers can be GPU-equipped machines that autoscale on queue depth. Notification fans out when the job completes (push to user's app).

15c. Payment retries (webhook delivery to merchants)

Stripe's webhook delivery model. A merchant subscribes to webhook events (payment.succeeded, etc.); Stripe enqueues a delivery job for each event-merchant pair. The job POSTs to the merchant's URL; on non-2xx, retries with exponential backoff (Stripe's documented schedule: a few minutes, then hours, then days, capped at 3 days, then DLQ-equivalent). Why a queue? Merchants' endpoints are slow, sometimes down, sometimes broken. The retry budget is the contract. Stripe's webhook reliability is one of their core developer-experience promises.

15d. Order fulfillment

User completes checkout → order written to DB → enqueue FulfillOrder(order_id). The fulfillment job calls the warehouse system, allocates inventory, generates a shipping label, schedules pickup. On any failure, retries with backoff; on permanent failure, alerts the operations team. Why a queue? Many independent backend calls, some of which are slow (warehouse API takes seconds). Decouples checkout latency from fulfillment correctness. Many large e-commerce platforms run this pattern at peak throughputs of hundreds of thousands of orders/hour during major sales events.

15e. Data import / export

User uploads a 100 MB CSV with 1M rows; backend enqueues ImportCSV(file_id). The import job streams the file from S3, parses chunks of 1000 rows, INSERTs into Postgres in batches, updates a progress counter visible to the user. Total run time: minutes to hours. Why a queue? A request handler can't run for hours; the user wants to navigate away and come back. Status displayed via a separate GET /imports/{id} endpoint that reads the progress counter.

15f. Notification fanout

A LinkedIn-style social feed posts an update. Service writes the post to its DB. Enqueues FanOutNewPost(post_id, author_id). The fan-out job reads the author's follower list (maybe 10k users), enqueues DeliverNotification(user_id, post_id) for each. Each delivery job pushes to the user's app (push notification, in-app feed update). Total: 1 enqueue produces 10k child jobs. Why a queue? Fan-out is bursty and parallel; the original poster's request can't wait for 10k notifications.

15g. ML batch inference

Daily job: score all users' likelihood to churn. Enqueue ScoreUserChurn(user_id) for each of 10M users. Workers (often GPU pods) pull jobs, batch them in groups of 256 for GPU efficiency, run the model, write predictions to a feature store. Total: 10M jobs/day batched into ~40k GPU calls. Why a queue? Predictable burst of work that needs parallelism + idempotency (re-scoring is fine). The queue smooths the work over an hour, avoiding the "big SQL transaction" anti-pattern.

15h. Database maintenance

Periodic cleanup: delete expired sessions, archive old data, compact tables. Cron-style scheduled jobs enqueue work jobs. Why a queue? Cron alone doesn't distribute work across multiple servers; the queue does. Also: the maintenance jobs can be partitioned ("clean up 100k rows at a time, in 10 parallel chunks") for parallelism without locking.

15i. Webhook delivery (Stripe, GitHub, Shopify style)

Outbound webhook delivery is itself a job queue: the platform records "I owe webhook X to merchant Y" and a worker delivers it with retries. GitHub's webhook delivery system reportedly handles tens of millions of webhook deliveries per day; Stripe's is similar. Both have public documentation of their retry schedules and DLQ behavior.

15j. Search index updates

User updates a product (e.g., changes the title). Backend writes to Postgres + enqueues ReindexProduct(product_id). Worker reads the latest state from Postgres, computes the indexable document, calls Elasticsearch's _update API. Why a queue? Decouples the write path (fast) from the indexing path (sometimes slow during ES degradation). Also batches updates — if the same product is updated 5 times in 10 seconds, the deduplicating queue (Sidekiq Unique Jobs, BullMQ deduplication) processes only the latest.


§16. Real-world implementations with numbers

Concrete deployments illustrating the scale range.

Shopify's Sidekiq Enterprise deployment is one of the largest known Sidekiq installations. Engineering blog posts and public talks describe a fleet of thousands of worker pods organized by merchant tier (pods per tier provide isolation — a noisy merchant doesn't starve others). Peak job rates during Black Friday Cyber Monday have been reported in the hundreds of thousands per second across the fleet, with daily volumes reaching the hundreds of millions on peak days. Features used: Sidekiq Enterprise (Periodic, Unique Jobs, Encryption, Rate Limiting), pod-per-shard isolation, custom queue routing.

GitHub runs background jobs heavily for webhook delivery, repository indexing, notification fan-out, and CI orchestration. Their public engineering blog has described internal job-queue evolution from Resque (their original, open-sourced in 2010) to migration toward more modern infrastructure as scale grew.

Stripe's webhook delivery system delivers tens of millions of webhooks per day. The public documentation outlines the retry schedule (immediate, then increasing delays over 3 days), the idempotency-key model, and the DLQ-style "event will not retry further" terminal state. The system is one of the more battle-tested public examples of webhook-delivery-as-job-queue.

AWS SQS reportedly handles billions of messages per day across all customers (number widely cited in AWS marketing and confirmed in re:Invent talks). A single high-volume customer can hit a few billion messages/month — game backends, log delivery, media processing platforms.

Discord has publicly described using Celery (Python) for asynchronous background work, including notification delivery and trust-and-safety pipelines. Scale claimed in posts: tens of millions of users, hundreds of millions of jobs per day.

Airbnb originally built a custom job system (Chronos-derived) before migrating to other tools as the company grew. Their public engineering content describes the iterative path from "custom thing" to managed solutions.

Resque (the predecessor to Sidekiq, both originally GitHub-derived) is now largely legacy but is still in use in older Rails codebases. Sidekiq is the dominant modern Ruby choice; Resque survives in maintenance mode.

Hangfire is the dominant .NET background job library, used in many enterprise .NET shops. Scale per public references: thousands of jobs/sec on typical SQL Server backends.

Que / GoodJob / river are smaller-scale by deployment count, but the DB-backed approach has been gaining traction. river specifically (2023+) is the modern Go entry and is being adopted in greenfield projects that want transactional outbox semantics without a separate broker.

Cloud Tasks (Google) has fewer public scale numbers but is widely used in GCP-hosted serverless apps. Sub-account scale is similar in pattern to SQS — handle a workload appropriately.

The takeaway: at low to mid scale, the queue choice is mostly about ecosystem fit (Ruby → Sidekiq, Python → Celery, Node → BullMQ, .NET → Hangfire, Go → river or asynq). At high scale, the pattern matters more than the tool — pod isolation, queue sharding, dedicated worker fleets per tier. The named tools are largely converging in capability; the operational sophistication is where the engineering investment lives.


§17. Summary

A background job queue is the substrate that turns "do many things reliably after returning to the user" from an application-level concern into infrastructure. It accepts work-to-be-performed, durably stores it, dispatches it to a worker fleet at-least-once, retries on failure with backoff, schedules delayed execution via sorted sets or run_at columns, and quarantines poison pills in a dead-letter queue. It is NOT a message log (events for many consumers, replayable) and NOT a workflow engine (multi-step stateful processes); it is the one-shot, short-atomic-job primitive that lives between the request handler and the side effect. The Redis-backed (Sidekiq, Celery, BullMQ) realization gives sub-millisecond enqueue + dequeue and millions of jobs/sec on commodity hardware; the Postgres SKIP LOCKED realization (Que, river) gives transactional atomicity with your application data — the killer feature for the outbox pattern; the managed realizations (SQS, Cloud Tasks) trade latency for zero operational overhead. Across all variants, the contract is the same: at-least-once delivery (you layer idempotency), per-queue ordering (you partition by key for finer ordering), visibility-timeout-driven liveness (the broker doesn't trust the worker is alive without a heartbeat). The technology is mature, the patterns are settled, and the distinction between "make a job idempotent and trust the queue" versus "fight the queue with custom retry logic" is the difference between a system that scales and one that creates incidents.