Architecture¶
FastAPI Task Manager uses a Redis Streams-based architecture with leader election to provide distributed, fault-tolerant task scheduling.
Component Overview¶
graph TD
TM[TaskManager] --> R[Runner]
TM --> TG1[TaskGroup 1]
TM --> TG2[TaskGroup N]
R --> LE[LeaderElector]
R --> CO[Coordinator]
R --> SC[StreamConsumer]
R --> RE[Reconciler]
CO -->|publishes| RS[(Redis Stream)]
SC -->|consumes| RS
LE -->|leader lock| RD[(Redis)]
RE -->|checks stale| RS
The library follows a hierarchical structure: TaskManager -> TaskGroup -> Task
Core Components¶
TaskManager¶
The entry point that integrates with FastAPI's lifespan. It creates the Redis connection, spawns the Runner, and provides the API router via get_manager_router(). On startup, it also loads any dynamic tasks persisted in Redis.
TaskGroup¶
Organizes related tasks. Tasks are registered via the @task_group.add_task(cron_expr) decorator. Supports multiple cron expressions and kwargs per task function. Also provides @task_group.register_function() for the dynamic task system.
Runner¶
Orchestrates all stream mode components: LeaderElector, Coordinator, StreamConsumer, and Reconciler. Manages the worker lifecycle and coordinates graceful shutdown.
LeaderElector¶
Distributed leader election via Redis. Uses a lock key with TTL and periodic heartbeat renewal. Only the leader instance runs the Coordinator and Reconciler; all instances run the StreamConsumer.
Coordinator¶
Runs on the leader only. Determines which tasks are due based on their cron expressions and next_run timestamps, then publishes task messages to the Redis Stream.
StreamConsumer¶
Runs on all instances. Consumes task messages from the Redis Stream using consumer groups (XREADGROUP). This distributes execution across all workers while ensuring each task is processed exactly once.
Reconciler¶
Runs on the leader only. Periodically scans for:
- Overdue tasks: tasks whose
next_runhas passed but were not scheduled (e.g. due to a leader failover) - Stale pending messages: messages claimed by a consumer but not acknowledged within the timeout (e.g. worker crash)
Detected issues are resolved by republishing tasks or reclaiming pending messages.
Execution Flow¶
sequenceDiagram
participant C as Coordinator (Leader)
participant S as Redis Stream
participant W1 as Worker 1 (Consumer)
participant W2 as Worker 2 (Consumer)
participant R as Redis
C->>R: Check task next_run timestamps
C->>S: XADD task message
S-->>W1: XREADGROUP (claims message)
W1->>R: SET running heartbeat
W1->>W1: Execute task function
W1->>R: Record stats (XADD)
W1->>R: Update next_run
W1->>S: XACK (acknowledge message)
W1->>R: DEL running heartbeat
- The Coordinator (leader only) checks which tasks are due and publishes them to the Redis Stream
- StreamConsumers on all instances read from the stream via consumer groups
- The worker sets a running heartbeat key, executes the task, records statistics, and acknowledges the message
- If a worker crashes, the heartbeat expires and the Reconciler reclaims the pending message
Fault Tolerance¶
Leader Failover¶
If the leader crashes, its lock expires after leader_heartbeat_interval * 3 seconds. Another instance acquires leadership and resumes scheduling. The Reconciler detects any tasks that were missed during the failover window.
Worker Crash Recovery¶
Each executing task maintains a heartbeat key in Redis with a short TTL. If a worker crashes:
- The heartbeat key expires (after
running_heartbeat_interval * 3seconds) - The Reconciler detects the pending message has been idle too long
- The message is reclaimed and re-executed by another worker
Exponential Backoff¶
When a task fails, the system applies exponential backoff: the next retry is delayed by retry_backoff * retry_backoff_multiplier ^ (failures - 1), capped at retry_backoff_max. This prevents rapid failure loops. Backoff state can be reset via the POST /tasks/reset-retry API endpoint.