A technology reference. Workflow engines as a class — Temporal, Cadence, AWS Step Functions, Apache Airflow, Argo Workflows, Netflix Conductor — sit at a specific layer in the backend stack, solve a specific problem (durable orchestration of multi-step processes), and have a specific internal structure (event sourcing plus deterministic replay, or some compromised variant of it). This doc is about the technology. We will see it serving payment lifecycles, ride dispatch, ETL (Extract Transform Load) pipelines, ML (Machine Learning) training, video transcoding, CI/CD (Continuous Integration / Continuous Delivery), and notification campaigns — different domains exploiting the same primitives.
§1. What workflow engines ARE
A workflow engine is the piece of infrastructure that makes a multi-step process survive crashes, restarts, deploys, and time. You write code (or declare JSON, or draw a graph) that says "do A; if A succeeds, do B; if A fails, compensate and notify; wait three days for human approval; then do C." The engine guarantees that — once started — that process runs to completion regardless of who or what dies along the way. State is durable. Step retries are managed. The path is auditable.
Three deliberately broad descriptions: the engine orchestrates (it tells other systems what to do, but the work happens in those systems), it is durable (state survives process restarts, persisted to a database before any reply is acknowledged), and it is long-running-capable (workflows can sit blocked for milliseconds or for months waiting for a signal, a timer, or a human).
Where it sits in the stack:
┌─────────────────────────────────────────────────┐
│ Application / business logic │
│ (the workflow code or definition) │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ ★ WORKFLOW ENGINE (this layer) │
│ durable state, retries, timers, history, │
│ sticky dispatch, idempotency support │
└─────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Persistent store │ │ Worker fleet │
│ (Cassandra, │ │ (executes activity │
│ MySQL, DynamoDB) │ │ business logic) │
└─────────────────────┘ └─────────────────────┘
│
▼
External systems (payment gateway, DB, message queue, S3,
k8s pods, ML training cluster, ...)
The workflow engine is not the worker. It is the coordinator that tells workers what to do next based on the history of what they have already done.
Distinguish from adjacent categories. Several technologies look like workflow engines from a distance. They are not.
- Cron / scheduler. Triggers on a clock. No concept of multi-step dependencies, retries beyond rerun-from-scratch, durable state, or compensation. Cron + status table is the naïve replacement; §9 dissects why it dies.
- Message queues (Kafka, RabbitMQ, SQS). Move messages. No concept of "what step the workflow is on" or "retry this specific activity within the larger flow." You can choreograph on top, but you've moved every workflow concern — state, ordering, compensation, audit — into your application code.
- Saga libraries (Axon, Eventuate). Code-level libraries that help structure compensation. Still need something underneath that owns durable state. The engine replaces both.
- Batch DAG schedulers (Airflow, Luigi, Prefect, Dagster). Workflow engines, but the batch shape — daily/hourly DAGs (Directed Acyclic Graphs) of data tasks, where a run is one execution of a fixed DAG against a fixed window. Different concerns dominate: backfill, idempotent task reruns over a partition key, data lineage. They do not optimize for "millions of concurrent workflows each at a unique point." The design space splits hard between them and Temporal-class engines.
- Function-as-a-service event chains (Lambda + EventBridge). Choreography without a coordinator. State in databases between functions; retries per-function, not per-workflow.
Not good for:
- Sub-millisecond request paths. The engine writes to a quorum-replicated store on every transition; not in a 10ms p99 RPC path. Workflows trigger fan-out work; the trigger itself is async.
- Pure data transformation at line-rate. 10M events/sec through a transform wants a stream processor (Flink). Per-transition overhead too high.
- Arbitrary scale per single workflow. A workflow's history is bounded (Temporal: 10MB/50k events). One workflow processing a billion items is wrong; the pattern is parent-fanout-to-child or
ContinueAsNew(§7). - Bypassing determinism. Code-as-workflow engines sandbox workflow code: no
time.Now, no random, no I/O outside activities. Fighting this is fighting the technology.
§2. Inherent guarantees
Provided by design.
- Durable workflow state. Every state transition is persisted to a quorum-replicated store before being acknowledged. If a process dies mid-workflow, the next worker that picks it up reaches an identical state by reading the persisted history. No workflow is silently lost.
- Exactly-once activity dispatch from the engine's perspective. A given invocation — identified by a stable
activity_id— has at most one in-flight attempt at any moment. Retries happen with stable identity, after the previous attempt has been declared failed or timed out. Never two parallel schedules of the same activity. - Automatic retries with configurable backoff. Declarative policy: initial interval, multiplier, max attempts, max duration, non-retryable error types. Activity failures caught by the engine, not user code.
- Deterministic replay (Temporal/Cadence-class). State is reconstructible by re-running workflow code from event 1 against persisted history. Same code + same history always reaches the same next decision. Durability is cheap — no mid-workflow checkpointing, just replay.
- Long-running waits as a first-class primitive. A workflow can sleep 30 days, wait for a signal, or wait for a timer, without holding process resources. The engine durably schedules the wake-up.
- Complete audit history. Every event — start, decision, activity scheduled/completed/retried, timer fired, signal received, completion — with timestamps, payloads, failure traces.
- Idempotency on workflow start.
StartWorkflow(workflow_id="order-123")twice produces one workflow — the second returns the existing run.
Not provided. Must be layered on.
- Exactly-once side effects. Stable
activity_idacross retries; not exactly-once external effect. If your activity charges Stripe and the network drops after the charge, the retry charges again unless you pass an idempotency key. Application's job. - Low-latency dispatch under load. Single-digit ms at p50; intentionally not sub-ms. Wrong tool inside a single RPC budget.
- Arbitrary throughput per single workflow. A workflow is a single shard owner; one workflow at 10k events/sec hammers one persistent-store partition. The engine scales by workflow count, not per-workflow rate (§10).
- Bypass of determinism. Call
time.Now()in workflow code and replay diverges —NonDeterministicWorkflowError. Load-bearing safety, but it exposes bugs rather than masking them. - Cross-workflow distributed transactions. Two workflows atomically committing a joint outcome — not provided. Build a saga (a workflow that calls compensating activities) on top.
The contract: engine owns workflow-level durability, retry semantics, history, timer machinery. Application owns side-effect idempotency, business-correct compensation, and the data plane outside the workflow.
§3. The design space
Four distinct designs in the workflow-engine family, each with a different center of gravity, plus Argo as a fifth Kubernetes-native variant.
A. Code-as-workflow with deterministic replay (Temporal, Cadence). You write workflow code in a real language (Go, Java, Python, TypeScript, .NET). The engine intercepts every async point — await ChargePayment(...), workflow.Sleep(2h), signal.Receive(ctx) — and persists results as history events. Replay rebuilds state by re-running the code over the event log. Strengths: full expressivity, IDE support, native compensation (defer cancelBlock), long-running waits trivial, signals/queries first-class. Mainstream for business process orchestration — payments, order lifecycles, KYC (Know Your Customer) onboarding. Costs: determinism constraint forces discipline; SDK magic must be understood to debug; significant engineering investment to operate self-hosted.
B. Declarative state machine (AWS Step Functions, Azure Durable Functions in some modes). Workflows are JSON/YAML state machines: states, transitions, choices, parallel branches, retries. Activities referenced by name (e.g., Lambda ARN — Amazon Resource Name). Strengths: no determinism constraint, strong visual tooling, tight cloud integration (Step Functions can call ~200 AWS services directly). Costs: limited expressivity, JSON sprawls for complex flows, vendor lock-in. Step Functions Standard caps ~4k transitions/sec/account. Best for cloud-native short orchestrations and AWS-heavy ETL.
C. Batch DAG schedulers (Airflow, Prefect, Dagster, Luigi). A DAG of tasks, typically in Python. The engine schedules DAG runs against a parameterization — usually a date partition. A "workflow" is a DAG run, often on a schedule. Strengths: tailored for batch ETL/ML — backfills, idempotent task reruns over a partition key, data lineage, SQL/Spark integrations. Costs: wrong model for "10M concurrent workflows each at a unique point." Long-running waits awkward (sensors). Per-task overhead hundreds of ms.
D. JSON workflow DAG with worker poll (Netflix Conductor). JSON DAG of tasks (HTTP_TASK, DECISION_TASK, FORK_JOIN, SUB_WORKFLOW). Workers poll, execute, POST result. State in the engine's DB. Strengths: language-agnostic, good for polyglot teams. Netflix uses it for video encoding (Cosmos), recommendation refresh, content ingestion. Costs: less expressive than code-as-workflow; no deterministic replay (state is mutable); operational maturity behind Temporal.
E. Argo Workflows — Kubernetes-native. Kubernetes CRDs (Custom Resource Definitions) define workflows; each step is a pod. State in etcd via k8s. Targets CI/CD pipelines, ML training, infrastructure automation. Strengths: zero distance to Kubernetes, native to GitOps. Costs: each step a pod (overhead ~seconds), not viable for high-frequency micro-orchestrations.
Comparison table
| Dimension | Temporal/Cadence | Step Functions | Airflow | Conductor | Argo Workflows |
|---|---|---|---|---|---|
| Definition model | Code (Go/Java/Python/TS/.NET) | JSON state machine | Python DAG | JSON DAG | YAML CRD |
| Persistence | Event-sourced history (Cassandra/MySQL/Postgres) | Managed | Metadata DB (Postgres/MySQL) | State DB (Postgres/Dynamo/ES) | etcd via k8s |
| Replay model | Deterministic replay | N/A (declarative) | Task rerun within DAG run | Resume from current task | Resume from current step |
| Determinism on user code | Yes (sandboxed) | None | None | None | None |
| Long-running waits | Native (years) | Native (≤1 yr Standard) | Awkward (sensors) | Native | Pod sleep / external |
| Throughput ceiling | ~100k+ transitions/sec sharded | ~4k/sec Standard | ~10k tasks/min | ~10k workflows/min | ~hundreds/min (pod overhead) |
| Concurrency in flight | Millions | Up to ~1M | Thousands (DAG runs) | Hundreds of thousands | Limited by k8s |
| Compensation | First-class via code | Catch / parallel branches | Painful | Built-in tasks | DIY |
| Visibility | Full event UI | Visual graph | DAG-run UI | UI + REST | argo CLI / UI |
| Versioning of long-running flows | GetVersion / patches | State machine versions | DAG file replacement | JSON definition update | YAML update |
| Lock-in | Self-host or Temporal Cloud | AWS only | OSS / Astronomer | OSS | Self-host on k8s |
| Typical fit | Business processes, sagas, payments | AWS-heavy short orchestrations | Scheduled batch ETL/ML | Polyglot media/content | k8s-native CI/CD, ML |
The variants overlap but each has a center of gravity. Picking is mostly about which center you live closest to.
§4. Underlying data structure — byte-level mechanics
This is the section that separates Staff-level understanding from a feature-list reading. The depth differentiator: event sourcing plus deterministic replay.
4a. Event-sourced history
The fundamental abstraction: a workflow is its history, not a row. History is an append-only sequence of typed events; state is derived by re-running workflow code over the event log.
Two tables in a Temporal/Cadence-class engine (Cassandra canonical; MySQL/Postgres for smaller deployments):
CREATE TABLE history_node (
shard_id int,
workflow_id text,
run_id uuid,
branch_id uuid, -- supports fork-on-reset
node_id bigint, -- first event id in this batch
txn_id bigint, -- optimistic concurrency token
events blob, -- batch of N serialized events
PRIMARY KEY ((shard_id), workflow_id, run_id, branch_id, node_id)
);
CREATE TABLE executions (
shard_id int,
namespace_id uuid,
workflow_id text,
run_id uuid,
execution blob,
next_event_id bigint,
PRIMARY KEY ((shard_id), namespace_id, workflow_id, run_id)
);
history_node is the immutable event log (append-only). executions is a small mutable pointer per workflow: next event ID, current branch, last status. Mutations to executions use a txn_id compare-and-swap — optimistic concurrency per workflow.
An event is a protobuf:
EventID: int64 (monotonic within a run)
EventType: enum (WorkflowExecutionStarted, ActivityTaskScheduled,
ActivityTaskCompleted, TimerStarted, SignalReceived, ...)
EventTime: timestamp
TaskID: int64 (global ordering within shard)
Version: int64 (multi-cluster replication version)
Attributes: oneof (per-event-type payload)
Typical event size: 200B-2KB. A batch in history_node holds 5-20 events to amortize row overhead. Hard cap: 50,000 events or 10MB, whichever first.
4b. Why event sourcing fits the access pattern
Two dominant operations:
- Append events at the tail. Pure sequential write. Cassandra's LSM (Log-Structured Merge tree) is built for this: append to memtable, flush to SSTable (Sorted String Table). ~10-20k writes/sec/node comfortable.
- Read history range for replay. Sequential scan inside one Cassandra partition, inside one clustering range. Served from SSTables plus memtable.
Append-mostly + occasional bulk range scan = exactly what LSM trees are designed for. The alternative — a single mutable state row — fails three ways: no audit (mutable rows show only current state), no deterministic replay (intermediate decisions unrecoverable), no retry semantics (start vs. completion must be distinct events). Event sourcing is the only structure that supports all three simultaneously.
4c. Tradeoff: event-sourced history vs. mutable state row
| Dimension | Event-sourced history (LSM) | Mutable state row (B+ tree) |
|---|---|---|
| Audit | Complete by construction | Need a separate change-log |
| Replay | Native | Impossible without change-log |
| Append throughput | ~10-20k/sec/node | ~2-5k/sec/node (page splits) |
| Read latest state | Scan history + replay (cached) | Single point read |
| Storage cost | 10-100x more | Lean |
| Schema evolution | Hard (events immutable forever) | Easy (ALTER TABLE) |
What you give up choosing event sourcing: cheap state queries (mitigated by sticky caches) and schema flexibility (event protos must be backward-compatible forever). What you gain: durability, replay, audit — for free.
This is why Conductor — which uses a more mutable model — has weaker replay semantics than Temporal. The choice is architectural, not cosmetic.
4d. Sharding by workflow_id
The engine shards by hash(namespace + workflow_id) mod N, typically N=4096 (16k at very large deployments). One shard is owned by exactly one history-service node at a time. The shard owns the in-memory mutex for every workflow ID in it — no two requests mutate the same workflow concurrently cluster-wide. All Cassandra rows for the shard's workflows live in one partition; reads/writes are local to one replica set. The shard also maintains transfer/timer/replication queues as rows in the same partition.
When a history node dies, a membership protocol (Ringpop-style) detects and reassigns shards to survivors. New owners reload executions rows into cache — tens of seconds outage, no data loss because every event was quorum-written.
4e. Sticky task queues
Cassandra owns durability; replay is expensive. The sticky-execution cache: once a worker has replayed a workflow, it keeps the in-memory state alive and owns subsequent decision tasks until it crashes, evicts under memory pressure, or the workflow goes idle.
First decision task for workflow_id=X:
matching picks any free worker W from the task-queue partition
W replays full history (N events) → in-memory state
W signals: "I am sticky for X; send future tasks to me"
Subsequent decision tasks for workflow_id=X:
matching routes directly to W's sticky queue
W executes the new code; no replay needed
Per-decision cost: O(new events), not O(total history)
W crashes:
matching detects sticky timeout
next decision goes to the global task queue
some worker W' picks up, replays full history, becomes new sticky owner
At ~95% sticky hit rate, average replay cost amortizes across hundreds of decisions. Without sticky cache, every decision pays full replay — at 50-event median, that's ~50x more worker CPU.
4f. The determinism constraint
Replay must produce the same decisions every time. Workflow code lives under a strict sandbox:
| Forbidden | Why | What you do instead |
|---|---|---|
time.Now() |
Wall clock differs across replays | workflow.Now(ctx) |
rand.Float64() |
Different random on replay | workflow.SideEffect(ctx, ...) — result persisted |
os.Getenv / external config read |
Environment differs on replaying worker | Pass config as workflow input |
| Direct HTTP / DB / Kafka call | Side effects must be replayable | Wrap in an Activity |
| Goroutines / native threads / sleeps | Non-deterministic scheduling | workflow.Go, workflow.Sleep |
| Iterating unsorted maps | Iteration order differs | Sort keys before iterating |
| Mutating global state | Not visible to other replays | Use workflow-local variables |
The SDK enforces this with a custom scheduler: every await goes through the runtime, which consults history; if the event exists, return the stored result; if not, dispatch for real. This is what makes await ChargePayment(input) look like ordinary async code while being durable.
If replay reaches a decision that doesn't match history, the engine raises NonDeterministicWorkflowError and parks the workflow. The error is loud on purpose.
4g. One concrete lifecycle, end-to-end
A canonical payment-orchestration workflow, byte level.
T=0: API server calls client.StartWorkflow(workflow_id="order-abc", ...). Frontend hashes order-abc → shard 1742. History-service node holding shard 1742 takes the per-workflow mutex (in-memory, scoped to the shard) and writes:
event #1: WorkflowExecutionStarted (workflow_type=OrderWorkflow, input, run_id)
event #2: WorkflowTaskScheduled (task_queue=ORDER_TQ)
Cassandra append, quorum write to 2 of 3 replicas. Durability point: when quorum acks, workflow is officially started. The history service enqueues the decision task.
T=2ms: Matching dispatches the decision task. Worker long-polls, gets the task. No cached state (first task). Fetches GetHistory(order-abc, 1, 2). Replays. Workflow code runs to its first workflow.ExecuteActivity(ctx, ChargePayment, input). Returns a future. Engine collects a command.
T=5ms: Worker sends RespondWorkflowTaskCompleted with ScheduleActivityTask(activity_type=ChargePayment, activity_id=1, retry_policy, timeouts). History appends events #3 WorkflowTaskStarted, #4 WorkflowTaskCompleted, #5 ActivityTaskScheduled (one Cassandra batch, quorum). Enqueues activity task on CHARGE_TQ.
T=10ms: Activity worker pulls. Calls payment provider with idempotency_key=order-abc-1 (workflow_id + activity_id). Provider charges. Returns charge_id=ch_xyz. Worker calls RespondActivityTaskCompleted.
T=400ms: History appends events #6 ActivityTaskStarted, #7 ActivityTaskCompleted(result=ch_xyz), #8 WorkflowTaskScheduled. New decision task to the sticky worker.
T=405ms: Same worker, no replay. charge_payment_future.Get(ctx, &charge_id) returns ch_xyz immediately (event in history). Code continues to ExecuteActivity(ctx, ReserveInventory, ...). Events 9-13 written. And so on.
Total events for a 4-activity workflow: ~25. History: 5-10KB.
Crash survival at each step.
- Before T=2ms (history not written):
StartWorkflowRPC hasn't returned. Caller retries with same workflow_id; idempotent. - Between T=2ms and T=5ms (worker took task, didn't respond):
ScheduleToStarttimeout fires, decision task re-enqueued, another worker replays, reissues the sameScheduleActivityTask. Idempotent byactivity_id. - Between T=10ms and T=400ms (activity worker died mid-call):
StartToCloseor heartbeat timeout fires.ActivityTaskTimedOutappended. Retry policy schedules a new attempt. The activity may have completed the side effect once — that is why the idempotency key matters.
The durability point at every step is the quorum write. Until the ack lands, nothing has changed. After it lands, no crash can undo it.
§5. Capacity envelope
Workflow engines span four orders of magnitude.
Small: 10s to 1000s. Startup on AWS Step Functions for cron-driven backend jobs and onboarding. Free tier 4,000 transitions/month, then ~$25/M. Or a small team on self-hosted Temporal on three VMs for a payment-flow POC: ~100 concurrent, ~10 transitions/sec, ~50GB disk.
Mid: Coinbase. Wraps every withdrawal/deposit in a workflow (compliance demands single auditable trails). Mid-2020s: single-digit-millions of concurrent workflows, ~10k transitions/sec sustained, heavy signals for human-approval gates. Sharded MySQL, ~50 history nodes, ~20 matching, ~200 workers. Histories average ~100 events; compliance flows wait weeks for manual review.
Large: Uber Cadence. By year 2 ran millions of workflows in flight, billions of activities/day across hundreds of services — trip lifecycle, fraud review, payouts, driver onboarding. Cassandra-backed, hundreds of nodes per region, multi-region active-active. Peak ~500k decision tasks/sec, ~10B activities/day, ~4 weeks hot retention.
Giant: Stripe Connect, Snap, Netflix Conductor. Stripe Connect: payout flows wait days for ACH (Automated Clearing House) settlement. Tens of millions of in-flight payouts. Netflix Conductor: tens of millions/day across Cosmos (video encoding/transcoding), content ingestion, recommendation refresh. Snap: Temporal for ad-creative review and content moderation, single-digit-millions in-flight.
At this scale, federation is mandatory. A single cluster handles low-millions; beyond that, partition namespaces across clusters with global-namespace async replication. Blast-radius management dominates.
| Scale tier | Example | In-flight | Transitions/sec | Backend |
|---|---|---|---|---|
| Small | Startup w/ Step Functions | 10s | <10/sec | Managed |
| Small-mid | Self-host POC | 100-10k | 10-100/sec | Single MySQL |
| Mid | Coinbase | ~1-5M | ~10k/sec | Sharded MySQL |
| Large | Uber Cadence | ~10M | ~100k/sec | Cassandra (100s nodes) |
| Giant | Stripe / Snap / Netflix | 10M+ | 100k-500k/sec | Federated multi-cluster |
Per-cluster ceiling: ~100k-500k transitions/sec. Past that, federate.
§6. Architecture in context — canonical pattern
The canonical code-as-workflow deployment:
┌─────────────────────┐
│ Workflow client │ (your application code calling
│ (your API server) │ StartWorkflow / SignalWorkflow
└──────────┬──────────┘ over gRPC)
│
gRPC: namespace + workflow_id
▼
┌────────────────────────────┐
│ Frontend Service │ stateless; auth, namespace
│ (load-balanced fleet) │ routing, rate limiting
└─────┬─────────────┬────────┘
│ │
┌────────────┘ └────────────┐
│ (start, signal, query) │ (poll for tasks)
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ History Service │ ◄──persist── │ Matching Service │
│ (sharded by │ │ (task-queue │
│ hash(workflow_id) │ │ partitioned) │
│ mod N shards) │ dispatch │ │
│ │ tasks ─────►│ In-memory queues │
│ - workflow mutex │ │ for decision and │
│ - writes events │ │ activity tasks │
│ - emits visibility │ └──────────┬───────────┘
└──────┬───────────────┘ │ long-poll
│ append + read ▼
▼ ┌──────────────────────────┐
┌─────────────────────┐ │ Worker Pool │
│ Persistent store │ │ │
│ (Cassandra / │ │ - Workflow Worker: │
│ MySQL / Postgres /│ │ replays history, │
│ Dynamo) │ │ decides next step │
│ │ │ - Activity Worker: │
│ Partition: shard_id│ │ executes business │
│ Cluster: workflow_id,│ │ logic (HTTP, DB, │
│ run_id, event_id │ │ Kafka, S3, etc.) │
│ Quorum writes │ │ │
│ │ │ Sticky task queue: │
│ │ │ workflow_id pinned │
│ │ │ to one worker │
└──────┬──────────────┘ └──────────────────────────┘
│ async stream
▼
┌─────────────────────────────────────────────────────┐
│ Visibility pipeline │
│ Persistent store → CDC (Change Data Capture) → │
│ Kafka → Elasticsearch / OpenSearch │
│ Cold storage: history archived to S3 after │
│ workflow close + retention period │
└─────────────────────────────────────────────────────┘
Sharding keys, per layer:
- Frontend → History:
hash(namespace + workflow_id) mod N_shardsselects one history shard. That shard owns the mutex for that workflow ID cluster-wide. - History → Persistent store: partition key =
shard_id, clustering key =(workflow_id, run_id, event_id). Within one partition, workflow histories sit adjacent. - History → Matching: a decision task targets the task-queue partition
hash(task_queue_name + workflow_id) mod num_partitions. Sticky routing: subsequent tasks for the same workflow target the worker's sticky partition. - Matching → Worker: long-poll gRPC. P99 dispatch latency under 5ms on a healthy cluster.
- Visibility: CDC stream → Kafka topic partitioned by namespace; Elasticsearch indexer consumes for search.
Conductor's architecture has the same shape but with a more REST-y interface (workers POST results) and less aggressive sharding. Step Functions hides this all behind an AWS managed front end but the components are conceptually parallel.
§7. Hard problems inherent to this technology
Six challenges anyone using workflow engines will face. None are bugs; all are inherent.
§7.1 Determinism violations after a code change
Setup. Payment workflow in production. Six months later you add a fraud-check step between ReserveInventory and NotifyWarehouse. Deploy.
Naïve fix. Change the code, redeploy.
Failure. Workflow order-abc has events 5-8: ReserveInventory scheduled/completed, then WorkflowTaskScheduled. Replay through new code emits ScheduleActivityTask(CheckFraud). But the next history event is ScheduleActivityTask(NotifyWarehouse) (old path). Mismatch. NonDeterministicWorkflowError. Workflow parks. Multiply by all running workflows — production traffic bricked.
Real fix. GetVersion API patches:
v := workflow.GetVersion(ctx, "add-fraud-check", workflow.DefaultVersion, 1)
if v == 1 {
workflow.ExecuteActivity(ctx, CheckFraud, input).Get(ctx, nil)
}
workflow.ExecuteActivity(ctx, NotifyWarehouse, input).Get(ctx, nil)
First execution records a MarkerRecorded event with the version; replay returns the same version. Pre-deploy workflows see DefaultVersion and skip fraud check; new workflows see version 1.
Appears verbatim in every domain: a Netflix Cosmos transcoding workflow adding a perceptual quality check; an ML training pipeline adding hyperparameter tuning; a CI/CD pipeline in Argo adding a lint stage. Same fix: branch on a version marker.
§7.2 History explosion: the 10MB / 50k-event cap
Setup. A customer subscription workflow — one per customer — lives forever. After two years it has 100,000 events.
Naïve fix. Make the store hold bigger histories.
Failure. Replay cost. Every worker reads 100k events and re-runs workflow code over all of them. At 100 events/ms replay, 1s of CPU per decision. Across 10M long-lived workflows, unrunnable. Beyond CPU: row-size limits, blob serialization, and visibility pipeline all choke past ~10MB.
Real fix. ContinueAsNew. When history has accumulated enough (>5000 events, >1 day, or a business-meaningful boundary), workflow code calls workflow.NewContinueAsNewError(ctx, NextRun, snapshotState). The engine closes the current run with WorkflowExecutionContinuedAsNew, starts a new run (same workflow_id, new run_id) with the snapshot as input. New run starts with one event.
Subscription workflow ContinueAsNews every billing cycle. A long ML training pipeline continues at each epoch. A video transcoding workflow continues at each output-quality tier. Same pattern, different domain.
Cadence and Temporal both refuse to grow histories past 10MB/50k — workflow fails explicitly, forcing the design fix. The cap is load-bearing; prevents accidentally infinite workflows.
§7.3 Workflow versioning across long-running deployments
Setup. An approval workflow waits 30 days for a human. You ship v2 that restructures the steps.
Naïve fix. Redeploy and hope.
Failure. v1 workflows cannot replay under v2 — mismatch storm.
Real fix. Two layers. Patches (modern API on top of GetVersion) for small targeted changes. Task-queue versioning / BuildID compatibility for deep restructuring: declare a BuildID for your worker binary; matching routes tasks to workers running a compatible BuildID with the one that started the workflow. Old workflows drain on old binaries; new on new.
For incompatible v1 → v2: keep v1 workers alive until in-flight drain, or terminate-and-restart with v2-shaped input. No free lunch.
Appears differently in different engines: Step Functions has state machine versions (aliases pin executions). Airflow has DAG-file replacement (each DAG run pinned to its DAG version). Argo pins each WorkflowTemplate execution to the spec at submit time.
§7.4 Activity timeouts vs heartbeats
Setup. A long-running ML training activity — 4 hours on a GPU. Sometimes 4 hours; sometimes the GPU hangs.
Naïve fix. StartToClose = 6 hours.
Failure. Two distinct problems. (1) Hung activity wastes a worker slot for 6 hours. (2) Worker dies mid-activity, engine doesn't know for 6 hours.
Real fix. Heartbeats. Long activities periodically activity.RecordHeartbeat(ctx, progress). If none for HeartbeatTimeout (e.g., 60s), the activity is failed-fast and retried on another worker. Recorded progress lets the new attempt resume.
Now StartToClose = 6 hours (legit slack) and HeartbeatTimeout = 60s (fast death detection). Four-knob problem:
| Timeout | Bounds | Typical |
|---|---|---|
ScheduleToStart |
Queue wait before pickup | 10s-1min |
StartToClose |
Single attempt duration | Minutes to hours |
ScheduleToClose |
Total time including retries | Usually unset |
HeartbeatTimeout |
Between heartbeats | 15-60s |
Most production issues: StartToClose huge (activity is sometimes slow) without HeartbeatTimeout (worker death takes hours to detect). Identical in Airflow, Conductor, Step Functions.
§7.5 Activity idempotency
Setup. ChargePayment calls a provider. Provider returns 200 OK. Network drops before the worker's response reaches the engine. Engine times out, retries. Card charged twice.
Naïve fix. maximum_attempts = 1. No retries.
Failure. Any transient failure fails the entire workflow. Untenable.
Real fix. Application-layer idempotency keys. The activity passes a stable key (workflow_id + activity_id) to the external API. Well-designed APIs dedupe — the second request returns the same response without re-charging.
The engine guarantees activity_id is stable across retries (assigned at ActivityTaskScheduled, never changes). The engine does NOT promise exactly-once execution; it promises identity is stable so the application can implement exactly-once at the side-effect boundary. For systems without idempotency support: a processed_keys table with INSERT IF NOT EXISTS before the side effect.
Applies in every domain. Email campaign workflow must not send the same email twice. Deployment automation applying a Kubernetes manifest uses a stable resource name (kubectl apply is idempotent by name). Video transcoding writes output to S3 with a deterministic key so re-runs overwrite.
§7.6 Signals vs queries vs updates: synchronous interaction
Setup. A running workflow needs to receive "user clicked approve" and expose "current state" to a monitor.
Naïve fix. Poll a DB in workflow code.
Failure. Workflow code is event-driven, not tick-driven. DB polling is non-deterministic (forbidden). Reading state via a DB duplicates state outside the engine.
Real fix. Three primitives.
- Signal: event sent into a workflow. Persisted as
WorkflowExecutionSignaledin history. Workflow code blocks onworkflow.GetSignalChannel("approve").Receive(ctx, &payload). Durable, ordered per workflow, replay-safe. Cost: each signal grows history. - Query: synchronous read of in-memory state. Not persisted. Engine invokes a read-only handler on the sticky worker. Cheap, frequent-OK.
- Update (newer Temporal): synchronous write-and-respond. Engine writes the event, runs workflow logic, returns result in one shot. RPC-like.
Human-approval workflow: signals. Deployment orchestration: queries for status. Long data pipeline: signals for cancellation. "Signal storm" footgun — accidentally signaling the same workflow 1000 times/sec — is a known production issue. Same problem in Conductor (events vs status queries) and Step Functions (sendTaskHeartbeat vs DescribeExecution).
§7.7 Signals, queries, updates — the three external interaction primitives in depth
The trio is so fundamental to durable-execution engines that it deserves a self-contained treatment. Each primitive answers a different question about how the outside world talks to a running workflow.
Signal. Fire-and-forget. The client calls SignalWorkflow(workflow_id, signal_name, payload). The frontend hashes to the shard owner, which appends WorkflowExecutionSignaled to history and enqueues a decision task. Returns to the caller as soon as the event is durably written — not when the workflow has reacted. Inside workflow code, a coroutine awaits on workflow.GetSignalChannel(name).Receive(ctx, &payload). Replay-safe by construction: signals are events, replay reads them back in order.
Cost model: every signal is a quorum write to the history store plus one decision task. A signal storm of 1k/sec on one workflow saturates one shard. Practical signals/sec/workflow ceiling sits around 100-1000 depending on persistence backend. Multiply by number of workflows and you get cluster pressure.
Use cases: human approval/rejection (approve, reject), external state changes (order_cancelled, payment_received), incoming events that influence flow (a driver going online, a fraud alert arriving). The convention "use a signal name per logical event" keeps the workflow easy to read; signalled_data channels become typed handlers.
Query. Synchronous read of workflow state. The client calls QueryWorkflow(workflow_id, query_name, args). The frontend routes to the sticky worker (or any worker if no sticky owner; that worker replays then answers). The worker invokes a read-only handler registered in workflow code that returns a value derived from current state. Crucially, no history mutation. A query is invisible to replay — it does not appear in the event log, it cannot change the workflow's path.
The constraint that bites people: query handlers must be deterministic and read-only. If you mutate workflow-local state in a query handler, replay will see two divergent states across workers and the next decision task can fail non-deterministically. Stripe-internal lore: a query handler that lazily computed a cached value the first time it was called, then returned the cache. On a different worker the cache wasn't populated, behavior diverged. Fix: queries must be pure functions of current state.
Use cases: dashboards showing "what step is this workflow on," monitors showing in-flight inventory, debugging "why is this stuck."
Update (Temporal 1.21+, May 2023 onwards). Synchronous mutation with response. The client calls UpdateWorkflow(workflow_id, update_name, args) and blocks until the workflow returns a result. Internally, the engine writes WorkflowExecutionUpdateAccepted, runs the update handler in the workflow (which can call activities, await timers, etc.), and on completion writes WorkflowExecutionUpdateCompleted carrying the result. Engine streams the result back over the open RPC.
Before Update, the only way to get a synchronous mutate-and-respond was the "signal + query polling" pattern: send a signal that mutates state, then poll a query until the state reflects the mutation. Ugly. Race-prone. Update collapses that into one round trip. Validators (a synchronous check that can reject the update before it lands in history) give a cheap admission gate — reject obviously bad updates without growing history.
Use cases: bidding workflows where the client wants placeBid → accept/reject; deployment workflows where requestPromotion → return promoted_revision; subscription workflows where applyDiscount → return new_total. Anywhere RPC semantics fit better than fire-and-forget.
Comparison.
| Signal | Query | Update | |
|---|---|---|---|
| Direction | Inbound only | Outbound only | In + Out |
| Persisted in history | Yes | No | Yes |
| Mutates state | Yes | No | Yes |
| Sync/async | Async | Sync | Sync |
| Replay impact | Replayed verbatim | Skipped | Replayed verbatim |
| Validator/admission | No | N/A | Yes |
| Backpressure | Signal storm risk | Cheap | Update storm risk; validator gates |
Step Functions equivalents. SendTaskSuccess/SendTaskFailure (signal-like, completes a WaitForCallback task), DescribeExecution (query-like, reads execution state), no native update. Conductor: events trigger workflow progression (signal-like); REST API exposes status (query-like); no update.
§7.8 Workflow cancellation semantics
Cancellation is a signal with extra ceremony. The user calls CancelWorkflow(workflow_id). The engine appends WorkflowExecutionCancelRequested to history and schedules a decision task. Inside workflow code, the cancellation propagates to ctx: ctx.Done() channel closes (Go), CancellationException (Java), CancelledError (Python). The workflow code is expected to handle the cancel — typically by running compensation, then returning.
But here is the asymmetric reality: the engine cannot cancel a running activity in flight. When a workflow is cancelled while an activity is mid-execution, the engine cannot reach inside the activity worker and halt it. What it can do: stop scheduling new activities, send a cancellation request that the activity may observe via heartbeat response.
Heartbeat-driven cancellation contract. An activity that wants to respond to cancellation must heartbeat regularly. When the activity calls RecordHeartbeat(ctx, progress), the engine's response includes a flag — CancelRequested = true if the workflow has been cancelled or the activity itself has been requested-cancelled. The activity code must check that flag and exit cleanly. Pseudocode:
for chunk := range bigDataset {
if err := activity.RecordHeartbeat(ctx, chunk.position); err != nil {
if errors.Is(err, activity.ErrCancelled) {
return cleanup(chunk.position)
}
return err
}
processChunk(chunk)
}
Activities that ignore the flag — long-running CPU work without heartbeat checks, blocking I/O without a context-aware client — will run to completion regardless of the cancel. The "we cancelled but the activity already ran for 10 more minutes" reality is structural, not a bug. Common in ML training activities (4-hour GPU runs that don't check between steps), large file copies (a single io.Copy that doesn't observe context), and external API calls that aren't context-aware.
Cancellation is not termination. TerminateWorkflow is the harder hammer: writes WorkflowExecutionTerminated immediately, no chance for workflow code to compensate. Use Cancel when you want graceful unwinding; Terminate when the workflow is broken and you just want it gone (e.g., a non-deterministic-replay stuck workflow).
Cancellation scopes. Modern SDKs offer CancellationScope — wrap a block of workflow code so a cancel triggers only that block's compensation, not the whole workflow. Saga compensation (§7.9) uses this heavily: each forward step is wrapped in a scope whose cancel handler runs the corresponding compensation.
Operational reality. If you have a workflow stuck in a long activity that ignores cancel, you have two options: wait for StartToClose timeout (could be hours), or TerminateWorkflow and live with no compensation. Some teams adopt a rule: every activity longer than 60 seconds must heartbeat. Enforced via lint or activity wrappers.
§7.9 Sagas and compensation in detail
The canonical multi-step business transaction that can't use a database transaction (because it spans services, hours/days, and external systems) uses the saga pattern: forward steps + compensation steps executed in reverse order on failure.
Structure. A booking workflow:
1. ChargeCard(user, $100) → compensation: RefundCard(user, $100)
2. ReserveInventory(item, qty=1) → compensation: ReleaseInventory(item, qty=1)
3. AssignWarehouseSlot(slot=42) → compensation: ReleaseWarehouseSlot(slot=42)
4. NotifyCustomer(email) → compensation: SendCancellationEmail(email)
If step 3 fails, compensate 2 then 1 (in reverse). If step 4 fails, compensate 3 then 2 then 1. The pattern in Temporal Go SDK looks like:
compensations := []func() error{}
if err := workflow.ExecuteActivity(ctx, ChargeCard, ...).Get(ctx, nil); err != nil {
return err
}
compensations = append(compensations, func() error {
return workflow.ExecuteActivity(disconnectedCtx, RefundCard, ...).Get(ctx, nil)
})
if err := workflow.ExecuteActivity(ctx, ReserveInventory, ...).Get(ctx, &reservationID); err != nil {
return runCompensations(compensations)
}
compensations = append(compensations, func() error {
return workflow.ExecuteActivity(disconnectedCtx, ReleaseInventory, reservationID).Get(ctx, nil)
})
// ... and so on
runCompensations pops in reverse order and executes each. The disconnectedCtx is critical: it's a context that survives workflow cancellation, so compensations still run even if the whole workflow is being cancelled.
Compensation idempotency requirement. Every compensation activity must be idempotent. Why: compensation may itself fail and be retried. Refunding the same charge twice must be a no-op (the payment provider returns "already refunded"); releasing the same inventory item twice must be safe. The same workflow_id + step_id idempotency-key pattern from §7.5 applies.
Partial compensation reality. Real money flows have asymmetries. Charge succeeds, shipping fails: compensation runs RefundCard. The refund may take 5-10 business days to land on the customer's statement even though the activity returns immediately ("refund initiated"). Customer sees the charge, contacts support, gets told "wait for the refund." Workflow's job is to emit auditable events for support; can't make a slow external system faster.
Compensation fails too. The deeper recovery layer. RefundCard itself fails: payment provider returns 500 for 6 hours. Options:
- Retry forever — the workflow stays in compensation indefinitely, alerts page on-call.
- Manual intervention queue — escalate after N retries to a human-approval workflow: "compensation stuck, refund $100 to user X, charge_id=ch_xyz, please process manually." Human runs the refund out of band, sends a signal to mark compensation done.
- Accept the asymmetry — record a
compensation_failedevent, complete the workflow as "failed with un-refunded charge," generate a ticket for the finance ops team.
Most production sagas combine: retry with backoff for 24 hours, then escalate to manual. The audit trail (every attempt logged in workflow history) is the point — finance can prove what happened.
Choreography vs orchestration sagas. Choreography: each service emits "I did my part" events, downstream services listen and react. No central coordinator. Compensation requires each service to listen for "rollback" events. Fragile at scale (no global view of the saga state). Orchestration (the Temporal pattern): one workflow owns the steps and compensation order. Centralized, auditable, debuggable. The industry has converged on orchestration for non-trivial sagas.
Saga vs distributed transactions. A 2-phase commit (2PC) across services would also solve this, but at the cost of synchronous coordination, holding locks across services, and a global blocking phase. Sagas trade atomicity for availability: the system is never atomically consistent across the steps, but it eventually converges to either "fully done" or "fully compensated." For 99% of cross-service business workflows, that tradeoff is right.
§7.10 Workflow versioning at production scale — the hardest problem
§7.1 introduced GetVersion. The full picture at production scale is a year-long migration playbook, not a single API call.
The constraint. A payment workflow runs for 30 days (waiting on ACH settlement and human review). During those 30 days, the team deploys 14 times. Each deployment changes the workflow code. Replay of a 30-day-old workflow under today's code must still produce the same decisions it made yesterday. Multiply by tens of millions of in-flight workflows, each frozen at a different code version.
The three strategies in production use.
1. GetVersion(versionId, minSupported, maxSupported) — in-place compatibility. The most common pattern. For each behavior-changing edit, wrap with a GetVersion block. The engine records a MarkerRecorded event with the version on first execution; replay returns the same version. Old workflows see DefaultVersion, new ones see version 1.
v := workflow.GetVersion(ctx, "v2-fraud-check", workflow.DefaultVersion, 2)
switch v {
case workflow.DefaultVersion:
// original behavior, pre-fraud-check
case 1:
// first fraud-check version
workflow.ExecuteActivity(ctx, CheckFraudV1, ...).Get(ctx, nil)
case 2:
// refined fraud-check with merchant heuristic
workflow.ExecuteActivity(ctx, CheckFraudV2, ...).Get(ctx, nil)
}
minSupported lets you garbage-collect: when no workflow older than version 1 exists, change to GetVersion(ctx, "v2-fraud-check", 1, 2). The engine refuses to replay versions below minSupported, forcing those workflows to be terminated. Hygiene matters — workflows accumulate GetVersion calls across years of evolution and become hard to read.
2. The "patch" pattern — explicit conditional logic. Sometimes a deployment introduces a new check that didn't exist before. Old workflows skip it; new ones execute it. Modern Temporal SDKs ship Patch(ctx, patchId) and DeprecatePatch(ctx, patchId) as cleaner aliases over GetVersion. Patch returns true if the workflow started after the patch was deployed, false otherwise. DeprecatePatch returns true unconditionally — used after you've drained out the old workflows and want to mark the patch removable.
The lifecycle: introduce Patch → wait for old workflows to drain → call DeprecatePatch → in next deployment, remove the patch entirely (assuming you've waited long enough). This is the GC story for GetVersion markers.
3. Forced-replay-fail strategy — last resort. When a code change is fundamentally incompatible (e.g., reordering activities, changing input types, removing a step). Options:
- Worker BuildID versioning. Tag worker binaries with a BuildID. Matching routes tasks for a workflow to a worker that's compatible with the BuildID the workflow started on. Old workflows drain on old binaries; new workflows route to new binaries. When old workflows are done, retire the old binary.
- Terminate and restart. Kill all in-flight v1 workflows, restart them as v2 with whatever state can be ported. Brutal but sometimes the right answer for low-value, short-lifetime workflows.
- Migration workflow. Start a one-time workflow per in-flight v1 workflow that reads its state, signals it to a graceful stop, then starts a fresh v2 with that state.
Concrete walkthrough — a payment workflow that ran for 30 days, code deployed twice during.
Day 0: PaymentWorkflow starts for order-abc. Code version is v1. Event 1 written.
Day 5: Workflow has progressed through steps 1-3 of 7. Engineer ships v2 of the workflow that adds a step between 3 and 4 (a compliance check). v2 wraps the new step in Patch(ctx, "add-compliance-step") so:
if workflow.Patch(ctx, "add-compliance-step") {
workflow.ExecuteActivity(ctx, ComplianceCheck, ...).Get(ctx, nil)
}
order-abc is at step 4 already. Replay reaches the Patch call. The engine looks for a MarkerRecorded event for "add-compliance-step"; finds none (workflow started before the patch). Returns false. Workflow skips the compliance check. Good — workflow continues on its original path.
Day 12: Engineer ships v3 that changes the retry policy of step 5 (which order-abc hasn't reached yet). Pure activity-options change, no Patch needed — when order-abc reaches step 5 it'll use v3's options.
Day 14: order-abc reaches step 5 (the post-settlement reconciliation). Replay under v3 — different retry policy, but the activity hasn't started yet, so no replay conflict. Decision is reached, activity scheduled with v3's options. History captures it.
Day 25: ACH settlement completes. Workflow signals progress to step 6 (notify), then step 7 (close). Replay under v3 of all 25 days' history works because:
- Steps 1-3 still execute the same code (no relevant
GetVersion/Patchcalls). - Step 4:
Patch("add-compliance-step")returnsfalse(no marker), skips. - Step 5: v3's retry policy applies (activity-options live at schedule-time).
- Steps 6-7: same code, no version branches.
Day 30: Workflow completes. Engineer ships v4 that deprecates add-compliance-step (every workflow that could have hit that branch has now finished). Two deployments later, the Patch call is removed entirely from the codebase.
The 18-month migration of long-running workflows. Some workflows are years long. Customer subscriptions, regulatory holds, multi-year warranties. Migrating these from WorkflowV1 to WorkflowV2 (a structural rewrite) takes 18 months because you must keep both workflow types deployed and let v1 workflows drain naturally. Operationally:
- Month 0: Deploy v2 alongside v1. New workflows start as v2.
- Month 6: All "year" workflows started in month -1 have drained. v1 worker pool can be downsized.
- Month 12: All "18-month" workflows drained. Last v1 binary retired.
- Month 18: Cleanup of dead
GetVersionbranches in v2 codebase.
Painful. Inescapable.
Step Functions, Airflow, Argo each have their own version story. Step Functions: state machine versions + aliases pin executions to a version. Airflow: DAG file replacement is uncontrolled — each DAG run is pinned to the DAG version it started on (via a serialized DAG snapshot, in newer Airflow). Argo: a WorkflowTemplate is snapshotted at submit time, so submitted workflows are immune to subsequent template edits. The shape — pin running executions to the code that started them, evolve new executions freely — is universal.
§7.11 Workflow determinism — what's forbidden and why
§4f gave the constraint. The full picture: every workflow SDK ships a deterministic sandbox that polices what workflow code can do. Violations either fail loudly at runtime (caught) or silently corrupt replay (uncaught — the dangerous category).
Caught violations (SDK rejects at register/runtime).
| Violation | Why fails replay | SDK enforcement |
|---|---|---|
time.Now() / Date.now() |
Wall clock varies across replays | Sandboxed in TypeScript/Python; Go panics at runtime |
rand.Float64() |
New random number on replay | Same as above |
os.Getenv / file read |
Worker environment varies | Static analyzer warning; runtime panic |
| Goroutine / thread / native async | Non-deterministic scheduling | Replaced by workflow.Go, workflow.NewFuture |
Direct time.Sleep |
Doesn't survive replay; blocks the workflow task | Use workflow.Sleep |
http.Get / db.Query direct |
Side effect not in history | Must go through an activity |
Uncaught violations (subtle bugs in replay).
These are the disasters. Code that runs fine on first execution but mismatches on replay.
- Iterating maps with non-deterministic iteration. Go's
mapand Python'sdict(pre-3.7) iterate in random order. If workflow code doesfor k, v := range myMap { workflow.ExecuteActivity(ctx, A, k) }, the order of activity scheduling differs between executions. Replay sees a different schedule order, mismatches. Fix: sort keys explicitly before iterating. - Sort with pointer comparison. A real disaster:
sort.Slice(items, func(i, j int) bool { return items[i].ptr < items[j].ptr }). Pointer addresses differ across processes. The sorted order is non-deterministic. The "we added a sort that uses pointer comparison" disaster is canonical lore — found by a workflow stuck inNonDeterministicWorkflowErrorafter a deploy that added the sort. - Floating-point math across architectures. Rare but real: x86 vs ARM workers can produce subtly different floats. Workflow code that branches on a float comparison can diverge. Fix: use integer math in workflow code; do floating-point in activities.
- System-call return values.
os.Hostname(),os.Pid()— different per worker. Replay on a different worker diverges. - Locale-dependent string sorting.
sort.Strings(strs)uses the OS locale in some platforms. Locale differences across workers cause divergent sorts. - Async timers without workflow API. Spawning a goroutine that calls
time.Sleepand then mutates workflow state: that goroutine runs only on first execution; replay never re-creates it. State diverges.
The deterministic sandbox in each SDK.
- Go SDK. Workflow code runs in a custom coroutine scheduler (
workflow.Goinstead ofgo). Time access viaworkflow.Now. The SDK monkey-patches enough to catch most violations; uncaught ones surface asNonDeterministicWorkflowErroron next replay. - TypeScript SDK. Runs workflows in v8 isolates (separate JavaScript contexts) with
globalThispatched: noDate.now, noMath.random, nofetch. The strongest sandbox of the three. - Python SDK. Uses
asynciowith a custom event loop that drives workflow tasks;time.time()and friends are patched. Less hermetic than v8 isolates. - Java SDK. Workflow code runs in a Workflow Thread (a custom scheduler atop standard Java threads); thread-local context tracks workflow identity. Forbidden APIs (e.g.,
Thread.sleep) are detected via byte-code inspection at registration.
Why deterministic at all? Two design choices were on the table: "checkpoint state after every step" (Cadence's path: no) vs "rerun code over event log" (the path: yes). Checkpointing requires serializing all in-memory state — workflow-local variables, channel buffers, deferred work. Across long-lived workflows with rich state, that's huge. Replay over events trades CPU (re-run code) for storage (no state snapshots). Cheap CPU vs expensive storage was the bet, and it's been right for a decade.
Why loud failure? NonDeterministicWorkflowError is loud on purpose. Any silent corruption — workflow takes a different path than it would have — is a money/safety bug. Better to park the workflow and require a human/GetVersion patch than silently send the customer to the wrong fulfillment center.
§7.12 Side effects and local activities
Two patterns that look like activities but aren't full activities. They exist precisely because the determinism constraint is sometimes too strict.
SideEffect. "I need a random number, once." A workflow that picks a random partition assignment, or generates a UUID for downstream use, can't call rand.Float64() directly (non-deterministic). Wrapping in a full activity is heavyweight (round-trip through matching, separate worker, history events for schedule + start + complete). workflow.SideEffect(ctx, func(ctx workflow.Context) interface{} { return rand.Float64() }) is the lightweight path:
- First execution: the function runs in the workflow worker. Result is captured. Engine writes
MarkerRecordedevent with the value. - Replay: workflow reaches
SideEffect. Engine reads the marker, returns the stored value. Function is not re-run.
One event in history vs. three for an activity. Stays inside the workflow worker process. Use cases: generate a UUID, pick a random A/B-test bucket, read a feature flag at a deterministic point.
Caveats: the function should be fast and side-effect-free (don't make HTTP calls from a SideEffect; if the marker is lost, you don't get a retry). The result must be serializable. If the function panics, the workflow fails (no retry).
MutableSideEffect. Variant: re-evaluates the function on each call, but only writes a new marker if the result changed. Useful for "read a slowly changing config value, only update if it actually changed." Most teams just use the regular SideEffect with explicit caching in workflow state.
Local Activity. "I need to call an HTTP endpoint without paying the matching round-trip." A normal activity goes:
workflow worker → frontend → history → matching → activity worker → execute → respond → frontend → history → workflow worker
A local activity skips the matching dispatch:
workflow worker → execute in same process → respond → workflow worker
The result is still written to history as a MarkerRecorded-style event (so replay is correct), but the dispatch latency drops from ~10ms to ~submillisecond. Use cases: a quick local computation that's too complex for SideEffect (because it might fail and benefit from retries), but too cheap to deserve full activity machinery.
Tradeoffs:
| Activity | Local Activity | SideEffect | |
|---|---|---|---|
| Runs on separate worker | Yes | No (same as workflow) | No |
| Recorded in history | Schedule, Start, Complete | Single marker | Single marker |
| Automatic retries via policy | Yes | Limited (within workflow task budget) | None (no retry semantics) |
| Latency | ~10ms+ dispatch | Submillisecond | Submillisecond |
| Timeout flexibility | Full timeout knobs | Capped by workflow task timeout | None |
| Use case | External I/O, long work | Fast in-process calls, retries OK | One-shot non-deterministic value |
The danger with local activities: people overuse them for everything because they're fast. Then a workflow task gets a 20-second local activity that exceeds the workflow task timeout, causing the entire decision to fail. Convention: local activities should be sub-second, max a few seconds.
§8. Failure modes
Worker crash mid-activity. Activity worker dies between ActivityTaskStarted and RespondActivityTaskCompleted. If heartbeating: HeartbeatTimeout fires → engine emits ActivityTaskTimedOut → retry policy schedules a new attempt. If not: StartToClose fires; same outcome, slower. Side effect may have completed once — application-layer idempotency key makes retry a no-op. Durability point: ActivityTaskStarted committed with quorum.
History service shard loss. Node owning shards 1700-1799 crashes. Membership protocol detects, reassigns. New owners reload executions rows from the persistent store. ~30s pause for affected shards; no data loss because every event was quorum-written. Durability rests on RF=3 with quorum writes — 1 lost replica survivable; simultaneous loss of 2 replicas of the same partition is the only failure that loses durability.
Deterministic replay breaks after a code change. Covered in §7.1. Recovery: park workflow on NonDeterministicWorkflowError, deploy a GetVersion patch, unpark. Symptom is sharp and immediate — preferable to silent corruption.
Activity stuck forever. Worker alive but looping. Eventually StartToClose fires regardless of heartbeats. Operational override: terminate workflow via UI, which writes WorkflowExecutionTerminated.
Signal lost / duplicated. Client retries SignalWorkflow; signal persisted twice. Engine does not dedupe by default. Include a deterministic signal-key in the payload and dedupe in workflow code, or use SignalWithStart (idempotent on workflow_id).
Persistent-store quorum loss. Two of three replicas down: writes fail. Workflow progress halts; StartWorkflow returns errors. No data loss because nothing was acked. Recovery: restore replicas; workflows resume.
Matching service partition loss. A matching node holding 1k partitions crashes. Tasks aren't lost — they exist as transfer-queue or timer-queue rows in the persistent store. A new matching node loads them on reassignment. ~10-30s blip; no work lost.
Network partition between client and frontend. Client retries StartWorkflow. Idempotent by workflow_id. Application-layer dedupe is necessary for signals.
§9. Why not cron + a status table?
The naïve replacement engineers reach for first:
CREATE TABLE workflows (
id VARCHAR(64) PRIMARY KEY, current_step VARCHAR(64),
state_blob BLOB, last_updated TIMESTAMP
);
Cron sweeps WHERE current_step != 'DONE' AND last_updated < NOW() - INTERVAL 1 MINUTE and re-runs each row's current step.
This works for ten workflows. It dies before a hundred. Eight failure modes, each requiring you to reinvent a piece of the engine:
- No retry semantics.
charge_paymentfails with a network error. Status stillSTART. Cron calls again. Was the first attempt applied? Double charge unless every call carries an idempotency key — your job to build for every step. - No compensation.
reserve_inventoryfails aftercharge_paymentsucceeded. Who refunds? Hand-writtenrollback_to(state)covering every branch. - No visibility. Workflow stuck at
RESERVEDfor 6 hours. Didnotify_warehousesilently fail? You write a custom dashboard. - Race on status update. Two cron workers pick the same row. Both call
notify_warehouse. You add aversioncolumn and conditional UPDATE — reinventing optimistic concurrency, badly. - Status drift. Operator updates
current_stepto "unstick" a workflow.state_blobinconsistent. Workflow advances down the wrong branch. Money disappears. - No replay. Bug discovered. "What would have happened with the fix?" Status-table has only current state, not path.
- No long-running waits. "Wait 30 days for human approval" — build a scheduler with a
scheduled_taskstable and a poller. Reinventing timers, badly. - No scaling. Cron sweep over 10M rows every minute is a full table scan. Index it; index maintenance slows writes. Shard by
id mod N; reinventing shard-aware engines.
This is exactly why Uber built Cadence: dozens of internal teams had each rolled status-table workflow systems, all suffering subsets of the above. Once you have more than a handful of nontrivial workflows, the engineering ROI of adopting a workflow engine is overwhelming.
§9.1 The canonical migration from cron + state table
The "we already have a cron-driven status table; how do we get to Temporal" migration is one of the most-walked paths in the workflow-engine space. The shape of the legacy system:
workflows table (status, current_step, state_blob, last_updated)
▲
│ UPDATE current_step, state_blob WHERE id = ...
│
cron jobs (every 1 minute)
├─ ProcessChargingWorkflows: SELECT * WHERE current_step = 'charging'
├─ ProcessShippingWorkflows: SELECT * WHERE current_step = 'shipping'
└─ ProcessExpiredWorkflows: SELECT * WHERE last_updated < NOW() - 1h
Each cron handler runs the relevant step and updates the row. The migration usually goes in five phases.
Phase 1: stand up Temporal in shadow mode. Deploy the cluster. Build the v2 workflow that mirrors the cron logic in code. Don't route any traffic to it yet. Validate operationally — workers can start, signals flow, observability works.
Phase 2: shadow workflow comparison. Start a v2 workflow alongside every v1 row, but the v2 is informational only — it logs what it would have done, doesn't perform side effects. Build tooling that compares v1's state and v2's state at each tick; alerts on divergence. Run for 2-4 weeks until divergence is sub-1% and explained.
Phase 3: dual-run with reconciliation. Both v1 and v2 perform side effects, but with idempotency keys scoped to the workflow ID. Both attempt to charge the card, both attempt to ship. The downstream services dedupe on the idempotency key — only the first one actually charges. The race winner is recorded. Build dashboards: % of workflows where v2 won, % where v1 won, % of mismatches. The dual-run phase shakes out subtle bugs (timing differences, race orderings).
Phase 4: gradual cutover. Route 1% of new workflows to v2 only. Monitor for a week. Bump to 10%, then 50%, then 90%, then 100%. In-flight v1 workflows continue to completion on the cron path; new workflows are v2.
Phase 5: drain and decommission. Wait for v1 workflows to drain. Decommission the cron jobs and the workflows table. Archive the state-blob history for compliance.
Patterns that survive the migration.
- The idempotency-key convention. If you've been doing dual-run, you already have stable keys on every downstream call. Carry that into v2.
- The state-blob → workflow-input mapping. The legacy state blob is the source of truth for what data the workflow needs. Promote the relevant fields to workflow inputs.
- The "stuck workflow" dashboard. Legacy systems usually have a dashboard for "workflows with last_updated > 1h." Port the query to Temporal's visibility API: "open workflows with no recent decision task." Familiar to ops.
- The compensation table. If the legacy system has a
compensationstable tracking refunds and rollbacks, that's exactly your saga's compensation list. Move it into the workflow code as a stack.
Patterns that don't survive.
- The "fix it by UPDATE-ing a row" runbook. Operators used to nudge stuck workflows by changing
current_stepin the DB. Equivalent in Temporal isResetWorkflow(rewind to an earlier event) — different mental model. Train ops on the new tooling. - Polling for status. Code that polled the
workflowstable for "is it done yet" must move toDescribeWorkflowExecutionor a query handler. - Batch SELECT-and-update tactics. "Find all workflows in step X and advance them" — Temporal expects each workflow to drive itself. Don't try to recreate the cron sweep; let signals drive transitions.
The end state: cleaner code, cleaner ops, real audit trail. Costs: an engine to operate, a team to know it. Worth it past ~5 distinct workflow types.
§10. Scaling axes
Two fundamentally different growth axes. They demand different fixes.
Type 1: more workflows (uniform expansion). From 1M concurrent to 100M concurrent of roughly the same shape (median 50 events, 5-minute duration).
- 1M concurrent: ~10 history nodes, ~5 matching, 3 persistent-store RF=3, ~50 workers. 4096 shards overkill (~250 workflows/shard).
- 10M concurrent: ~50 history, ~20 matching, 20 persistent-store. Per-shard cache warming matters.
- 100M concurrent: Single-region single-cluster too risky (blast radius). Inflection point: federate. Temporal global namespaces with async cross-region replication. Each cluster handles a subset of namespaces.
Also split namespaces by business domain: orders, fraud, payouts, content moderation — each in its own namespace with quotas and isolation. A fraud bug cannot bring down orders.
Persistent-store scaling: linear. 100M × 100 events × 1KB ≈ 10TB hot. RF=3 on 100 nodes → ~300GB/node, comfortable on NVMe.
Type 2: long-running workflows (hotspot intensification). Same count; each is a beast.
- 50 events: Replay trivial. Sticky cache 95%. CPU idle.
- 5,000 events: Replay cost real. Cold restart ~100ms CPU per decision. Cache eviction hurts. Fix: aggressive sticky tuning +
ContinueAsNewat meaningful boundaries. - 50,000 events: Engine's hard limit. Workflow structurally wrong; redesign.
ContinueAsNewmandatory.
Noisy workflow anti-pattern. One workflow signaling itself 1000/sec is a per-shard hotspot: one persistent-store partition hammered. Other workflows on the same shard suffer write amplification. Fix: redesign (more frequent ContinueAsNew, or split), or move to its own shard. No resharding fix for per-workflow hotspots — the shard owns the workflow; you cannot split one workflow across two shards without breaking serializability. The application must change.
Inflection points. 1M → 10M concurrent: add shards/nodes (linear). 10M → 100M: federate (topology shift). 1k → 50k events/workflow: ContinueAsNew mandatory. Single → multi-namespace: when one workflow class can blast-radius others.
§11. Decision matrix vs adjacent categories
| Dimension | Workflow engine (Temporal-class) | DB-driven saga | Choreography (queue) | Batch DAG (Airflow) |
|---|---|---|---|---|
| Number of steps | 2-100s | 2-3 | 2-10 | 5-100 (per DAG) |
| Workflow duration | ms to years | ms to hours | ms to minutes | Hours (per run) |
| Concurrent in-flight | 1k-100M+ | <10k (DB choke) | Queue limit | <10k (Airflow choke) |
| Compensation | First-class | DIY | DIY | DIY |
| Auditability | Complete event history | Manual change-log | Queue logs | DAG-run history |
| Long waits (days) | Native | DIY scheduler | DIY scheduler | Awkward (sensors) |
| Visibility UI | First-class | DIY | DIY | First-class for DAGs |
| Operational cost | High (run an engine) | Low (just a DB) | Low (just a queue) | Medium |
| Best for | Business processes, sagas | <5 flows, low scale | Loose event-driven | Daily/hourly ETL |
Pick a workflow engine when: multi-step processes with compensation, audit, or long durations; more than a handful of distinct flow types; scale beyond a single DB's saga capacity; need replay/audit; long-running waits.
Pick DB-driven sagas when: one or two simple flows, low scale (<10k concurrent), can't afford engine operational overhead, flows short enough that crash recovery is rare.
Pick choreography (message queue) when: genuinely event-driven (each service reacts independently), no global "workflow state" anyone cares about. Often coexists with an engine: queue moves events; engine coordinates workflows.
Pick batch DAG (Airflow) when: data processing on a schedule, with backfills and idempotent task reruns over a partition key.
Pick Argo Workflows when: sequence of Kubernetes pods (CI/CD, ML training); team already on k8s; per-step overhead in seconds acceptable.
Thresholds for switching from DB-saga to engine: >5-10 distinct workflow types; any workflow >10 minutes; non-trivial compensation (>2-3 rollback steps); path-not-just-state audit; any workflow >10 steps. The real signal: engineers reinventing engine features (timers, signal handlers, retry-with-backoff helpers) in status-table code. Once you see that, you've outgrown the table.
§12. Cron-style scheduled workflows
Cron-on-a-schedule is a workflow engine's most boring but most underestimated feature. Temporal exposes Schedules (formerly cron schedules) — a first-class resource that says "start this workflow on this cron expression with these inputs."
ScheduleSpec:
cron_expression: "0 */4 * * *" # every 4 hours
workflow_type: DailyReconciliation
task_queue: RECONCILIATION_TQ
workflow_input: {...}
catchup_window: 24h
pause_on_failure: false
The Schedule resource lives in the engine. The engine fires the schedule at each tick, starting a fresh workflow execution. Each execution is independent (separate run_id, separate history).
Difference from external cron + workflow start.
| Concern | External cron (k8s CronJob) → StartWorkflow |
Native Schedule |
|---|---|---|
| Idempotency on tick | DIY (cron retries on failure can double-start) | Engine guarantees one start per scheduled time |
| Missed executions during outage | DIY (which ticks were missed? you'd have to log them externally) | catchup_window tracks misses |
| Backfill | DIY | Schedule backfill API replays missed executions |
| Pause | Stop the CronJob, but a tick mid-execution is undefined | Schedule pause is an engine action; in-flight workflows continue |
| Observability | Two systems to query | One UI shows schedule + all triggered runs |
The "we missed 10 executions during the outage, should we backfill?" decision. Real conversation. Cluster went down for 8 hours. Schedule fires every hour. Eight ticks didn't run.
Options:
- Skip them. Acceptable for "best-effort" schedules — e.g., a cache refresh that runs hourly. Today's data was a bit stale; tomorrow will be fine.
- Backfill all 8. Run them sequentially or in parallel, in chronological order. Each tick is a workflow that takes "scheduled time" as input. Activities use that time, not
workflow.Now. Useful for analytics rollups: each tick computes the previous hour's data. - Backfill the most recent only. For "current state" schedules where only the latest matters. E.g., a daily ranking refresh — yesterday's ranking is moot; only run today's.
- Manual triage. Sometimes backfill is unsafe (e.g., a customer-facing email send). Disable the schedule, look at each missed tick individually.
catchup_window is the engine-side knob: if a tick should have fired more than the window ago, skip it. Default in Temporal is 1 minute (paranoid). For "always backfill" schedules, set it to 30 days.
Schedule-on-a-schedule (anti-pattern). Some teams chain a schedule that runs a workflow that itself contains a workflow.Sleep(1h) loop. Don't — schedules are the right primitive for "fire periodically." Sleeping inside a workflow holds history events forever.
Step Functions, Airflow. Step Functions has EventBridge Schedules as the cron mechanism — same shape. Airflow's entire premise is scheduling: every DAG has a schedule_interval. Argo has CronWorkflows. The primitive is universal across the category.
§13. Multi-tenancy and namespaces
A single workflow cluster usually hosts workloads from many teams. Without isolation, a noisy team can starve everyone else. Namespaces are the primary isolation primitive in Temporal/Cadence.
What a namespace is. A logical workflow space. Each namespace has:
- Its own set of workflows (no cross-namespace
workflow_idcollisions; you can haveorder-123inpaymentsandorder-123insubscriptionsand they're different). - Its own task queues (a worker polls a specific namespace + task queue).
- Its own retention policy (how long workflow histories are kept after closure).
- Its own visibility index (Elasticsearch namespace).
- Its own quotas (rate limits on starts, signals, queries, persistence writes).
- Its own ACLs (which user/service can start/signal/query workflows here).
Why namespaces instead of separate clusters. Cluster-level isolation is the strongest but most expensive. Each cluster needs a separate persistent store, separate matching/history/frontend fleet, separate visibility pipeline. Per-team clusters at a company with 50 teams = 50× the operational cost. Namespaces share the cluster but give logical isolation cheaply.
Per-tenant quotas. Critical for shared clusters. Each namespace can be configured with:
MaxWorkflowExecutionsPerSecond— start rate limit.MaxSignalsPerSecond— signal rate limit.MaxWorkflowTaskExecutionsPerSecond— decision-task throttle.MaxActivityExecutionsPerSecond— activity-dispatch throttle.RPSPerNamespace— coarse RPC rate.
The frontend enforces these; over-quota requests get throttled (HTTP 429-equivalent). Without quotas, a runaway team that signals a workflow 10k times/sec can saturate the cluster.
The "noisy tenant blocks others" problem. Even with quotas, hot shards can hurt neighbors. A namespace gets unlucky and 50 of its workflows hash to the same shard; that shard's persistence partition is hot. Other namespaces with workflows on the same shard see write latency. Mitigations:
- Per-namespace sharding hints (some engines support routing namespaces to a subset of shards).
- Persistence rebalancing (Cassandra repair + manual key range moves).
- Federation — move the noisy namespace to its own cluster.
Tenant onboarding pattern. A team requests a namespace. Ops provisions it with conservative quotas, monitors usage for 2 weeks, then either raises quotas or splits the namespace if the team's workflows are heterogeneous (e.g., fast/slow flows in the same namespace would benefit from separate quotas).
Multi-tenant in Step Functions, Argo, Conductor. Step Functions: account + region is the isolation; sub-account isolation via IAM but no native namespace. Argo: Kubernetes namespaces map 1:1 — natural fit. Conductor: workflow names + permission scopes; less aggressive than Temporal.
Cross-namespace coordination. Sometimes a workflow in namespace A needs to start a child workflow in namespace B. Modern Temporal supports this via cross-namespace child workflows; older Cadence required signals to a workflow in the other namespace. Cross-namespace adds latency and coupling — prefer single-namespace designs unless there's a strong reason.
§14. History search and observability
Workflows generate billions of events. Querying them — "show me every order workflow that failed in step 3 between Tuesday and Thursday" — is a distinct concern from running them.
Visibility API. Temporal's Visibility is the search/list interface. Backed by Elasticsearch (or OpenSearch). The history service emits a Change Data Capture stream of visibility records (workflow started, completed, search attributes updated); the visibility indexer consumes the stream and writes to Elasticsearch.
Each workflow has system attributes (WorkflowId, WorkflowType, RunId, ExecutionStatus, StartTime, CloseTime, TaskQueue) automatically indexed. Plus custom search attributes declared per namespace:
workflow.UpsertSearchAttributes(ctx, map[string]interface{}{
"CustomerTier": "Gold",
"OrderAmount": 299.99,
"StuckInStep": "FraudCheck",
})
Each UpsertSearchAttributes call writes a marker event in history and updates the visibility index. The search attributes become queryable.
Querying. Visibility supports a SQL-like query language:
WorkflowType = "OrderWorkflow"
AND ExecutionStatus = "Failed"
AND StuckInStep = "FraudCheck"
AND StartTime BETWEEN "2026-05-15" AND "2026-05-22"
ORDER BY StartTime DESC
LIMIT 100
Used heavily for ops dashboards, batch operations (mass-restarting workflows that hit a known bug), compliance audits, customer support ("here's every workflow we ran for customer X").
Custom search attributes are a contract. Once you index a workflow type with attribute CustomerTier, you must keep the meaning stable. Renaming or repurposing breaks queries. Some teams version the attribute names (CustomerTierV2).
Replay debugging. The killer feature for incidents. When a workflow has a bug, download its history (one binary file ~10-100KB) and replay it locally:
tctl workflow show --workflow_id order-abc --output_filename history.bin
Then in your dev environment:
replayer := worker.NewWorkflowReplayer()
replayer.RegisterWorkflow(OrderWorkflow)
err := replayer.ReplayWorkflowHistoryFromJSONFile(logger, "history.bin")
The workflow runs with all activities mocked (their results come from history). You can set breakpoints, step through, see exactly what decisions the code took. Indispensable for debugging stuck workflows: "why did this workflow take path A instead of B?"
The "find all payment workflows that failed in step 3" query in practice. Real example. A bug shipped that caused payment workflows to fail in the fraud-check step. Ops query:
WorkflowType = "PaymentWorkflow"
AND ExecutionStatus = "Failed"
AND StuckInStep = "FraudCheck"
AND StartTime > "2026-05-20T00:00:00Z"
Returns 4,217 workflow IDs. Then batch operation: tctl workflow batch terminate --query "<above>" or tctl workflow batch reset --reset-type LastWorkflowTask --query "<above>". Resets all matching workflows to before the failed step. They retry under the fixed code.
This batch operation pattern — query + bulk action — is one of the highest-leverage capabilities a workflow engine provides. Without it, "fix 4k workflows" is a team-week of one-off scripts.
Cost of visibility. Elasticsearch is the expensive component for many large clusters. A workflow that updates UpsertSearchAttributes 100 times in its life creates 100 index updates. Multiply by 100M workflows = 10B index updates. Visibility clusters can grow larger than the primary persistence cluster in cost. Hygiene: only Upsert when the attribute actually changes; avoid high-cardinality attributes.
§15. Cost economics
Engine costs split across three axes.
History storage. Each event ~200B-2KB serialized; batched in history_node rows of 5-20 events. Workflow history ~5-50KB typical, ~500KB-1MB for long workflows. Storage cost = (workflow rate) × (avg history size) × (retention period).
Concrete: 10M workflows/day, avg history 20KB, retention 30 days. → 10M × 20KB × 30 = 6 TB hot. Cassandra RF=3 → 18 TB raw. At ~$0.10/GB/month on commodity NVMe → ~$1.8k/month for storage alone. Add cold archival to S3 after retention → cents/GB/month, near-free.
Compute fleet.
- History service. Sized by shard count × write rate. Rule of thumb: each history node owns 100-500 shards, handles 5-10k writes/sec. 4096 shards / 200 shards-per-node = 20 history nodes for a mid-large cluster.
- Matching service. Sized by task-queue throughput. ~30-100 nodes for 100k tasks/sec.
- Frontend. Stateless, scales with API QPS. Often 10-30 nodes.
- Worker fleet. Scales with workflow execution rate × activity rate. Long-polling workers wait idle most of the time; sized for peak concurrency, not average load. The "long polls, idle cost" reality: a worker waiting for tasks consumes connection slots in the matching service and a small CPU baseline. With 1000 workers each holding 10 long polls, that's 10k open connections continuously.
Worker idle cost is real. A team that over-provisions workers (10x what's needed) is paying for 10x the idle long polls. Mitigation: tune MaxConcurrentActivityExecutionSize and MaxConcurrentWorkflowTaskExecutionSize per worker; reduce worker fleet size; use auto-scaling tied to task-queue backlog.
Visibility / Elasticsearch. Often the surprise cost. Each workflow generates 3-10 visibility events (start, status changes, completion, plus any UpsertSearchAttributes). 100M workflows/year × 5 events × 1KB = 500 GB. Elasticsearch storage + replication + hot-warm-cold tiering. Easily 2-5x the primary persistence cost.
The "workflow rate × event rate × retention" multiplication. This product is the headline number. Examples:
- Small cluster: 1k workflows/day × 20 events/workflow × 30 days retention = 600k events stored. Trivial.
- Mid: 1M workflows/day × 30 events × 30 days = 900M events. Solid mid-sized Cassandra deployment.
- Large: 100M workflows/day × 50 events × 30 days = 150B events. Federated cluster, multi-region, dedicated team.
Cost-saving levers.
- Shorten retention. 30 days → 7 days. Most workflows are debugged within 7 days of completion. Archive to S3 for compliance, drop from hot.
ContinueAsNewmore aggressively. A subscription workflow that lives forever and grows to 50k events is a million times more expensive to store than one thatContinueAsNews monthly with a 100-event history per cycle.- Snapshotting via
ContinueAsNew. Truncate history by closing the current run and starting fresh with a compact snapshot as input. Every event before theContinueAsNewis archived to S3 after retention. This is the canonical pattern for long-lived workflows. - Reduce
UpsertSearchAttributesfrequency. Each Upsert is one history event + one index update. - Right-size worker fleet. Use auto-scaling tied to
WorkflowTaskBacklog/ActivityTaskBacklogmetrics.
Temporal Cloud pricing for reference. Charged per action (workflow start, signal, query, activity completion, timer fire, etc.) plus storage. At list, ~$25/M actions for the first tier, scaling down. A workflow with 30 events ≈ 30 actions. 1M workflows/day = ~30M actions/day = ~900M/month → ~$22k/month at list. Add storage. Add support. Big teams negotiate down significantly. Compare to self-hosted: a 20-node Cassandra cluster + 50-node engine fleet + ops headcount usually crosses $50k+/month before you've shipped anything. Cloud breaks even early; self-host pays off at large scale.
§16. Workflow vs activity idempotency
Two distinct flavors of idempotency live in the engine, with different enforcement points.
Workflow start idempotency. Built in. StartWorkflow(workflow_id="order-123") is idempotent on workflow_id. The engine maintains the invariant "at most one running workflow per workflow_id at any time" via the shard mutex. Calling StartWorkflow twice with the same workflow_id:
- If the first is still running: the second returns the existing run's handle (or
WorkflowExecutionAlreadyStartedError, depending onWorkflowIDReusePolicy). - If the first has completed: behavior depends on
WorkflowIDReusePolicy: AllowDuplicate: start a new run with the same ID, new run_id.AllowDuplicateFailedOnly: only allow if the previous run failed.RejectDuplicate: reject.TerminateIfRunning: cancel the in-flight and start fresh.
The workflow_id is your business-domain idempotency key. A common convention: workflow_id = "order-{order_id}" ensures one workflow per order forever.
Activity idempotency. Not built in. The engine guarantees:
activity_idis stable across retries (the same logical activity invocation, possibly multiple physical attempts).attemptis incremented on each retry.
The engine does NOT guarantee:
- The side effect is performed at most once. (Retries can re-execute a side effect that completed but failed to ack.)
- The application-layer key is constructed correctly.
Application's job: construct an idempotency key from activity_id + attempt (or just activity_id if you want all retries to dedupe to the same external operation) and pass it to the external system.
func ChargePayment(ctx context.Context, input PaymentInput) (string, error) {
activityInfo := activity.GetInfo(ctx)
idempotencyKey := fmt.Sprintf("%s-%s", activityInfo.WorkflowExecution.ID, activityInfo.ActivityID)
return stripeClient.Charge(input, stripe.WithIdempotencyKey(idempotencyKey))
}
If Stripe receives two requests with the same key, it returns the cached result of the first — the side effect happens once. Without the key, the second retry charges the card again. This is the boundary where exactly-once semantics actually live.
External idempotency keys vs activity-derived keys. Two flavors:
- Activity-derived (above): key is a function of
workflow_id + activity_id. Stable across retries; different across logical invocations. The default for most cases. - Externally provided: the caller of the workflow provides a business idempotency key. Example: a deduplicated job submission API where the caller passes
request_id, and the workflow uses that as the idempotency key for downstream services. Necessary when the caller might call the workflow multiple times for the same logical request.
The "two activities in the same workflow with the same external effect" footgun. If two activities in the same workflow both call Stripe.Charge with idempotencyKey = workflow_id, they collide — the second returns the first's response. Always include the activity_id in the key, not just the workflow_id.
§17. Temporal Cloud and managed offerings
When to run an engine yourself vs. let someone else run it.
Self-host arguments.
- Deep customization. Custom persistence backend (your in-house Cassandra fork), custom matching algorithm, integration with proprietary auth (your internal IDP — Identity Provider), bring-your-own-key encryption.
- On-prem requirement. Regulated industries (banking, healthcare in some geographies) that can't ship data to a third party.
- Cost at very large scale. Past ~1B actions/month, Cloud's per-action pricing exceeds the all-in cost of self-hosting + a small operations team.
- Co-location with workloads. Latency-sensitive workflows where the engine + workers + persistence + downstream services all need to be in the same network.
Managed (Temporal Cloud, AWS Step Functions, Conductor managed via Orkes) arguments.
- Small team. A 5-person team building a payment system shouldn't spend half its time on Cassandra repair.
- Fast startup. Cloud is a 1-day onboarding vs a 3-month self-hosted cluster bring-up.
- Operational coverage. 24/7 ops, upgrade management, hot-fixes for engine bugs, capacity planning — all included.
- Multi-region failover. Temporal Cloud offers cross-region replication as a feature; self-hosted requires you to build it.
Temporal Cloud pricing model. Pay-per-action + storage + support tier. Actions billed at ~$25/M list (negotiable). Storage at ~$0.05/GB-hour. Support tiers from Bronze (best-effort) to Platinum (24/7 phone). Predictable economics — you pay roughly proportional to use.
Hybrid pattern. Some teams self-host their primary clusters (cost + customization) but use Cloud for staging/dev environments (low volume, fast iteration). Operational footprint scales sub-linearly.
When the calculus flips.
- Crossing ~1k namespaces (per-namespace ops overhead in Cloud becomes a constraint; self-host wins).
- Crossing ~1B actions/month (cost curve favors self-host).
- Crossing geographic boundaries that Cloud doesn't serve (e.g., China, certain government regions).
For most teams under 100 engineers building business workflows, Cloud wins on TCO (Total Cost of Ownership). For platform teams at large companies offering workflow-as-a-service internally, self-host is usually the right answer.
§18. Durable execution as a movement
"Durable execution" has emerged as a category label. Temporal coined it; an ecosystem of lighter-weight alternatives has grown alongside.
Temporal-class: full durable execution with event-sourced replay. Temporal, Cadence, Restate. Heavy clusters, deterministic SDKs, designed for millions of in-flight workflows.
Lighter-weight serverless durable execution.
- Inngest. SaaS, serverless. Workflows are TypeScript functions deployed to Inngest's runtime. Triggers via events. Step functions within Inngest are durable (each step's result is persisted). Targets the "we want durability for short workflows" wedge — typical workflows are seconds to minutes, not days.
- Restate. Open-source. Durable RPC + Kafka-style log. Less coupled to a specific SDK; works with HTTP services that adhere to the Restate protocol. Aims for "durable execution as a library, not a platform."
- Trigger.dev. Like Inngest but more developer-experience focused. Workflows are TypeScript; Web UI for triggering and monitoring. Free tier; pricing kicks in around mid-volume.
- Cloudflare Workflows. Cloudflare's recent entry. Durable execution embedded in the Workers runtime. JavaScript-only, ties into Workers ecosystem.
The "we have small workflows but want durability" wedge. A team running 1k workflows/day, each 30 seconds, doesn't need a Temporal cluster. They want:
- Durable retries on failures.
- Activity-equivalent step boundaries.
- A UI to see what's running.
Inngest/Trigger.dev/Restate all fit. Temporal would be overkill — both operationally (running a cluster) and conceptually (the deterministic-replay model has more rules than needed for 30-second workflows).
When these beat Temporal.
- Cold-start matters. Inngest's serverless runtime spins up a function per workflow step; no idle worker fleet. For sub-100/day workflows, that's a win.
- JavaScript-first stacks. Trigger.dev and Inngest are TypeScript-native; the dev experience is tighter than Temporal's TS SDK (which is excellent but still requires the deterministic-replay mindset).
- Event-driven by default. Inngest's primary trigger is an event; Temporal's is a
StartWorkflowcall. For "react to user signed up" workflows, Inngest's mental model fits better.
When Temporal still wins.
- Long-running workflows. Anything >1 day favors Temporal. Inngest can do it but you're paying serverless overhead for idle time.
- Complex sagas with custom retry/compensation. Temporal's first-class support is hard to beat.
- Self-hosting requirement. Most of the new entrants are SaaS-only.
- High throughput. Temporal scales to 100k+ transitions/sec per cluster; Inngest is multi-tenant SaaS with shared capacity.
The category is consolidating. Expect 2-3 winners (Temporal, Inngest, maybe Restate) over the next few years. The shape — "durable functions with replay or persisted-step semantics" — is converging across all of them.
§19. Airflow vs Temporal — the confusion that won't die
Two engines, two completely different problem shapes, frequently confused. The confusion is so common it deserves its own section.
Airflow's center of gravity: batch DAGs on a schedule. A DAG is a static graph of tasks. A run is one execution of the DAG against a parameterization (typically a date partition). Tasks are usually heavyweight (Spark jobs, dbt runs, ML training scripts). Per-task overhead measured in hundreds of milliseconds to seconds (scheduler poll loop + worker dispatch).
Strengths:
- Backfills (rerun the DAG for past dates).
- Data lineage (which task produced which dataset, dating across runs).
- Sensor-style waits (poll for an external condition).
- Catchup logic when the scheduler was down.
- Integration with the data ecosystem: SQL operators, S3 sensors, dbt, Spark, etc.
Weaknesses:
- Long-running concurrent workflows (each at a unique point) — not designed for. Airflow can run thousands of DAG runs concurrently, but not millions.
- Per-task latency too high for fast orchestrations.
- Awkward for "wait for an external signal" — you build a sensor that polls.
Temporal's center of gravity: long-running stateful workflows with millions of concurrent instances. A workflow is a piece of code that runs durably. Each workflow is unique — different inputs, different state. Tasks (activities) can be heavyweight or lightweight (a single HTTP call). Per-activity overhead measured in milliseconds.
Strengths:
- Millions of concurrent workflows, each at a different point.
- Long-running waits (days, months) without holding resources.
- Signals from external systems (vs polling).
- Code-as-workflow with full language expressivity.
- Compensation / sagas as first-class patterns.
Weaknesses:
- Batch backfills over a date range — awkward. Temporal has no built-in concept of "rerun this workflow for every day in 2024."
- Data lineage — not provided.
- Per-task overhead too high for line-rate data transformations.
- Determinism constraints make some data manipulation patterns ugly.
The "we put a 1-second-running workflow in Airflow" anti-pattern. A team wanted to orchestrate a 4-step API call sequence that takes 1 second total. They put it in Airflow. Each task has ~500ms of scheduler overhead. The whole "workflow" takes 3 seconds. They run 10k/day. Airflow's scheduler is saturated with task scheduling for these tiny workflows; their real DAGs (the actual ETL) are delayed. Wrong tool. Temporal would handle this in 50ms per workflow with no scheduler pressure.
The "we put a daily ETL in Temporal" anti-pattern. A team built their daily data pipeline as Temporal workflows. Every morning at 2 AM, one workflow per partition (1k partitions/day) starts. Each runs a Spark job. The Spark jobs are 30 minutes to 4 hours. Workflows hold history events for each activity (Spark job start, complete, retry on failure). After a year, the visibility cluster is full of completed workflows. Backfills are painful — each requires triggering a new workflow run, no native concept of "rerun the pipeline for last week." Wrong tool. Airflow would handle this with native backfill support, DAG visualization showing dependencies between days, and built-in data-lineage hooks.
The decision rule.
- Time axis is the calendar? (Daily, hourly partitions.) Airflow.
- Time axis is the workflow's own life? (This workflow lives from when the order is placed until delivery.) Temporal.
Restated: if your workflow's mental model is "today's run vs yesterday's run," it's batch DAG (Airflow). If the mental model is "this workflow is processing one customer's transaction, born and dies independently of any clock," it's durable execution (Temporal).
Many teams need both. A daily ETL pipeline (Airflow) feeds data that a real-time order workflow (Temporal) reads. They live in the same stack and coexist; they don't replace each other.
§20. Failure modes not covered
§8 covered the routine failure modes — worker crashes, shard losses, signal duplicates. The deeper category: failures of the engine itself, the ones that require engineering escalation.
History database split-brain. Cassandra's eventual consistency is mostly hidden behind quorum reads/writes, but it can leak. Scenario: a partition between two Cassandra DCs. Both sides keep accepting writes (active-active). The same shard's executions row is updated on both sides with different txn_id values. When the partition heals, repair reconciles using timestamps — and the loser's writes are silently discarded. If those writes corresponded to acked workflow events, the workflow's view diverges from what clients were told.
Mitigation: deploy Cassandra with proper consistency (CL_QUORUM for both reads and writes, RF=3, ideally RF=3 per DC with cross-DC quorum). Periodically run consistency checks comparing history hashes between DCs. Critical workflows pin to a single DC for write traffic.
Recovery from divergence: extremely painful. Possibly involves replaying events from backups against a single-DC source of truth, manually reconciling discrepancies. Usually weeks of forensics + ops.
Workflow stuck in NotFound because shard moved during partition. A network partition isolates a history node from the shard registry. The registry believes the node is dead, reassigns its shards to a survivor. The original node still thinks it owns those shards (split-brain). Clients hitting the original see "workflow exists" responses; clients hitting the survivor see "NotFound" because the survivor hasn't loaded the shard state yet. After the partition heals, the original gives up ownership; clients see consistent behavior, but during the partition some signal/query traffic was misrouted.
Recovery: usually self-heals once the partition closes. For workflows that received conflicting responses (signal accepted on old owner, query NotFound on new owner), reset to before the partition.
Replay break after Temporal version upgrade. The engine itself ships with internal protocol changes. Upgrading from Temporal 1.18 to 1.22 across a multi-month period can introduce replay differences if the upgrade affects how certain event types are serialized. The official upgrade path is conservative — N-2 version compatibility — but customers occasionally hit edge cases where an old workflow's history can't be replayed under a new engine version.
Mitigation: pre-production replay tests using WorkflowReplayer against a representative sample of in-flight workflow histories. If any fail, hold the upgrade.
Recovery: pin the workflow to a specific Temporal SDK version, run an older worker pool, drain the workflow. Engineering escalation may be needed (Temporal support helps with upgrade-related replay issues).
Visibility-index lag during scale events. A traffic spike triples workflow start rate. The visibility indexer (consumer of the CDC stream from persistence to Elasticsearch) falls behind. UIs and dashboards see stale data — workflows that completed 10 minutes ago still appear "Running." Programmatic queries against visibility return wrong results.
Mitigation: monitor visibility lag (Kafka consumer lag metric). Scale the visibility indexer pre-emptively. For queries that need fresh data, use DescribeWorkflowExecution (reads directly from history, not visibility) instead of ListWorkflows.
Recovery: scale up visibility, wait for catchup, audit any decisions made during the lag window.
Persistent-store schema migration. Adding an index, changing a column type. Cassandra requires running schema changes one column family at a time, with replication factor coordination. Done wrong, half the cluster sees the old schema and half the new — writes from one side are unreadable by the other. Practical mitigation: run all schema changes in a dedicated maintenance window, with replication paused, and validate on a staging cluster first.
Worker memory exhaustion from large workflow histories. A workflow that grew to 8MB of history (close to the 10MB limit). Workers replay it; each replay holds the full history in memory plus workflow state. 100 such workflows on one worker = 1GB of replay memory. Worker OOMs (Out Of Memory), task fails over to another worker, that one also OOMs. Cascading failure of the worker fleet for that workflow type. Mitigation: per-worker concurrency limits scaled to expected history size; aggressive ContinueAsNew.
The "task queue partition hotspot" failure. A task queue with 4 partitions. One workflow type accidentally targets only partition 0 (e.g., via a sticky-routing bug). Partition 0's matching shard becomes saturated; tasks back up; new workflow starts time out. Other partitions are idle. Mitigation: monitor per-partition task counts; rebalance partition assignments; in extreme cases, increase partition count and migrate.
§21. Use case gallery
The same technology serves wildly different domains.
Payment lifecycle orchestration — Stripe Connect, Coinbase. Each money movement is a workflow: validate, fraud check, charge/transfer, wait for external clearing (ACH 1-3 business days), reconcile, notify. Compensation: refund or reverse on failure. Demands: audit (SOX — Sarbanes-Oxley, PCI-DSS — Payment Card Industry Data Security Standard), exactly-once side effects, long durations, human-approval signals. Variant fit: code-as-workflow (Temporal); too complex for declarative state machines, too audit-strict for choreography.
Ride lifecycle — Uber, Lyft. Each ride a workflow: match driver, accept, navigate to pickup, ride, drop off, charge rider, pay driver, handle disputes. Minutes to hours, days for disputes. Tens of millions/day. Variant fit: code-as-workflow at high scale; Uber's original Cadence use case.
Order fulfillment — DoorDash, Amazon. Place → confirm → batch → assign courier → pickup → deliver → reconcile. Tens of millions of new workflows/day at peak. 30-60 minutes typical, hours for scheduled. Signals from couriers, customers, support. Variant fit: code-as-workflow with heavy signal use.
CI/CD orchestration — Argo Workflows on Kubernetes. Pipeline = clone → build → test → security scan → deploy → smoke test → promote. Each step a containerized pod. Hundreds to thousands of pipelines/day. Variant fit: Argo. The work is containers; co-locating the engine with Kubernetes is right. Temporal would be overkill for per-step overhead.
ML training pipelines — Argo + Airflow + sometimes Temporal. Batch training on a schedule (daily recommendation model retrain): Airflow — DAG runs, backfills, lineage. Containerized triggered pipelines: Argo on k8s. Experiment management or online learning loops with feedback signals: Temporal (each experiment a long-lived workflow that mutates as data arrives). Variant fit: depends on shape.
Video transcoding — Netflix Cosmos (Conductor). Each upload triggers: validate, demux, transcode to multiple formats (bitrates, codecs, resolutions), QC (Quality Control), package, publish to CDN (Content Delivery Network). Tens of millions/day. Heterogeneous worker pools. Variant fit: Conductor. Polyglot workers, JSON definitions, language-agnostic poll/POST.
Email/notification campaigns — Mailchimp, marketing platforms. Compute audience, render template, send batch, track opens/clicks, retry bounces. Hundreds of millions of recipients per campaign; campaigns last days. Variant fit: workflow engine for campaign orchestration (one workflow per campaign); stream processor for the per-email send path.
Data ETL DAGs — Airbnb, Lyft, every data team. Daily ETL = extract, transform, validate, load to warehouse, refresh downstream tables, quality checks. Hundreds of DAGs per team; hourly/daily schedules. Variant fit: Airflow. Built for exactly this shape.
The breadth is the point. Workflow engines are not "the payments tool" or "the data pipeline tool" — they are an orchestration primitive that, in different variants, serves different shapes of multi-step process.
§22. Real-world implementations with numbers
Uber Cadence (2017-present). Built by the team led by Maxim Fateev. By 2019 ran the majority of Uber's stateful business logic. Millions of workflows in flight, billions of activity executions per day, Cassandra-backed, hundreds of nodes per region, multi-region active-active. Foundational reference for code-as-workflow.
Temporal (Cadence fork, 2019-present). Same architectural lineage with more aggressive SDK iteration (Go, Java, Python, TypeScript, .NET, PHP). Temporal Cloud's largest customers run single-digit-millions of concurrent workflows with tens of thousands of actions/sec/namespace. Self-hosted at Stripe, Snap, Datadog, Box, Coinbase.
Stripe Connect. Temporal-class durable execution for payout flows. Workflow waits days for ACH settlement and must compensate on failure. Tens of millions of in-flight payouts at any time.
Coinbase. Wraps every withdrawal in a workflow because compliance demands a single auditable trail. Single-digit-millions in-flight.
DoorDash order lifecycle. Order = place → confirm → batch → assign → pickup → deliver → reconcile, 30-60 minutes typical. Tens of millions of new workflows/day at peak.
Netflix Conductor (2016-present). Open-source. Powers Cosmos (video encoding/transcoding) plus content ingestion and recommendation refresh. Tens of millions of workflows/day. Also used by Airbnb for async payouts.
Apache Airflow (2014-present, Airbnb origin). Dominant choice for scheduled ETL/ML. Largest deployments (Airbnb, Lyft, Pinterest) run tens of thousands of DAG runs/day.
AWS Step Functions (2016-present). Managed declarative state machines. Standard: 4k transitions/sec/account, ~$25/M transitions. Express: at-most-once, 5-min max. Widely used in AWS-heavy shops.
Argo Workflows (2017-present, Intuit origin). Kubernetes-native via CRDs. Heavy use for CI/CD and ML training on k8s. Adopted across CNCF (Cloud Native Computing Foundation) ecosystem (Adobe, Intuit, Tesla).
Snap. Temporal for ad-creative review, content-moderation pipelines. Single-digit-millions in-flight.
The pattern: any organization with multi-step business processes, long durations, and audit needs converges on a workflow engine after one or two attempts at "just a status table." The shape determines which variant fits — but the conclusion that some engine is needed is robust across domains.
§23. Summary
A workflow engine is event sourcing weaponized for business processes: every workflow is its append-only history, every worker is a deterministic replay machine, every activity is at-least-once with stable identity so application-layer idempotency keys can give exactly-once side effects, and the whole thing is sharded by hash(workflow_id) onto a quorum-replicated store with sticky in-memory caches on workers so we pay full replay cost once per worker handoff and zero on the steady-state path — buying durable state, retries, long waits, replay, and audit in exchange for forcing every line of workflow code to be deterministic.