Problem Statement
Long-running operations — generating large reports, exporting datasets, sending bulk notifications, processing uploaded files — block request handlers if run synchronously, causing timeouts and poor user experience. Without a structured job system, these tasks are either run synchronously (blocking), scheduled via cron (inflexible), or implemented inconsistently across services. The goal was a unified, observable, and reliable async job framework covering the full task lifecycle from dispatch to completion and failure handling.
Key Challenges:
- Task durability — jobs must survive worker restarts without being lost
- Reliable failure handling with appropriate retry strategies
- Priority queuing ensuring urgent tasks preempt batch workloads
- Progress reporting for long-running tasks visible to end users
- Dead-letter handling for tasks that exhaust retries
System Architecture
Celery workers consume tasks from Redis queues. Multiple named queues handle different priority levels and task categories, with dedicated workers per queue ensuring critical tasks are never starved. Task metadata is stored in a results backend tracking state, progress, and outcomes.
Queue Architecture
Separate Celery queues for critical, standard, and bulk tasks. Critical queue workers run at higher concurrency and are isolated from batch workloads. Queue depth monitoring triggers auto-scaling of worker processes during peak periods.
Task Lifecycle Management
Task states (Pending → Running → Success/Failure → Retrying) tracked in the results backend. Application services poll or subscribe for completion events to notify users and trigger downstream actions.
Retry Strategy
Configurable retry policies per task type: exponential backoff with jitter for transient failures, fixed intervals for external API rate limits, and immediate retry for worker crash recovery. Max retry counts and total retry windows are enforced.
Dead-Letter & Monitoring
Tasks exhausting retry limits are moved to a dead-letter queue with full failure context. A monitoring dashboard tracks queue depths, worker throughput, failure rates, and dead-letter accumulation with alerting on anomalies.
Key Engineering Challenges
Task Durability on Worker Crash
Challenge: A worker crash mid-task leaves the task in a running state indefinitely without recovery.
Solution: Celery task acknowledgement after completion (not receipt), combined with a task heartbeat monitor that detects stale running tasks and reschedules them after a configurable timeout.
Idempotent Task Execution
Challenge: A task retried after partial completion may duplicate side effects (e.g., sending a report twice or double-inserting records).
Solution: Task idempotency keys checked before execution. Completed subtasks are checkpointed so retried tasks resume from the last successful step rather than restarting from scratch.
Progress Reporting
Challenge: Users submitting long-running export or report tasks need visibility into progress, not just eventual completion.
Solution: Tasks publish incremental progress updates to the results backend at configurable intervals. A WebSocket endpoint streams these updates to the requesting client's browser in real time.
Queue Starvation
Challenge: Large batch jobs monopolise worker capacity, delaying critical user-facing tasks.
Solution: Dedicated worker pools per queue with capacity isolation. Critical queue workers are never allocated to batch queues, guaranteeing sub-second critical task pickup regardless of batch queue depth.
Solutions Implemented
- Priority Queue Isolation: Named queues with dedicated worker pools ensuring critical task capacity is never consumed by batch workloads.
- Idempotent Task Design: Completion checkpointing enabling safe retries on partial failures without duplicating side effects.
- Real-Time Progress Streaming: Worker-to-client progress updates via WebSocket for long-running task visibility.
- Dead-Letter Queue: Persistent capture of exhausted-retry tasks with full failure context for investigation and manual reprocessing.
- Operations Dashboard: Live queue depth, worker throughput, failure rate, and dead-letter metrics with alerting on threshold breaches.
Outcome & Impact
On worker crash or restart
Critical queue always available
Real-time user visibility
Dead-letter with context