
Hosted By
-
Kenten Danas Lead Developer Advocate
Astronomer -
Alex Kennedy Airflow Engineer
Astronomer
By Kenten Danas, Field Engineer, and Alex Kennedy, Airflow Engineer at Astronomer
1. Key Points About Scaling Airflow
- Virtually unlimited scaling potential
- Use CeleryExecutor and KubernetesExecutor
- Tune parameters to fit your needs
- Easy to scale to more capacity
- Aggregate logging is important
2. High-Level Steps to Scale Airflow
2. Why Scale Apache Airflow?
- Because workload outgrows your initial infrastructure
- More DAGs, more Tasks
- More intensive individual tasks
- Because your core Airflow components need more durability
- To Prepare for more DAGRuns and compute load
- To benefit from elasticity — save money by scaling as needed
4. Symptoms That Mean You’re Ready to Scale
- Many tasks stuck in Queued or Scheduled state
- Unacceptable latency between tasks
- Missing SLAs
- High resource usage on Scheduler or Webserver
- Out of Memory (OOM) errors on Tasks
Principles of Scaling Systems
5. Basics of Scaling Systems
- Vertical Scaling
- Increase the size of instance
- (RAM, CPU, etc.)
- Increase the size of instance
-
Adds more power to an existing worker
-
Gives individual tasks more horsepower
-
Gets very expensive, very quickly
-
If vertical scaling reaches a threshold, think about delegating work to a dedicated distributed processing engine — e.g., Spark, Dask, Ray
-
Horizontal Scaling
-
Increase number of instances
- (RAM, CPU, etc.)
- Adds more nodes to the cluster
- Increases maximum number of tasks and DAGRuns that the system can handle
- Fits Airflow’s orchestration model
- Celery and Kubernetes executors well designed for horizontal scaling
-
Scaling Airflow as a Distributed Platform
6. Scaling with CeleryExecutor
https://airflow.apache.org/docs/apache-airflow/stable/executor/celery.html
-
Allows for easy horizontal scaling
-
Runs workers’ processes that process TaskInstances
-
To scale, add a new worker process
- Can be on a new node or an existing node
- The connection between workers uses Celery broker and metadatabase
- $AIRFLOW_HOME looks identical to other worker nodes
7. Scaling with KubernetesExecutor
https://airflow.apache.org/docs/apache-airflow/stable/executor/kubernetes.html TaskInstances run on K8s pods
- TaskInstance pods are ephemeral
- Each task get its own pod
- No workers’ processes
Parameter Tuning when Scaling Airflow
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html
- parallelism
- max_active_runs_per_dag
- max_active_tasks_per_dag
- worker_concurrency
- Pool size
These parameters control the number of tasks that can be run at a time.
8. Sizing Pools
- Pools are another way that we can limit the amounts of tasks that can run
- Can be used to circumvent executor slots being hogged by heavy DAGs with a lot of tasks
- Limits the number of running and queued tasks (active tasks that are under the control of the executor)
- Groups tasks together to limit the active instances by group
High Availability Airflow Components
Other Airflow Components can Scale!
Scheduler
- Airflow allows for multiple schedulers
- Increases the number of tasks that can be scheduled
- Allows the scheduling platform more stability
Web Server
- Multiple web servers increase the load and capacity of the Web UI
Logging on Distributed Airflow
9. Traits of a Great Logging System
- Aggregated
- Historical
- Indexed
- Searchable
10. Importance of Good Logging Practices
- 85% of problems have clues for a solution in the scheduler logs
- Important, if there are multiple schedulers, to be able to collect all of their logs in one place, since those components are working together on the same solution
- Need to keep a history of the logs in a searchable format in order to diagnose problems and work out solutions
- The first step for debugging is to correlate timestamps between problems in task logs with scheduler logs at the same time
11. Debugging Distributed Airflow
- 90% of problems have clues for a solution in the scheduler logs
- 8% of problems are resource consumption issues
- Out of memory
- CPU limited cycles for tasks
- The other 2% is the hard part
How to Scale your Deployment - Demo

Get Apache Airflow Certified
If you want to learn more about how to get started with Airflow, you can join the thousands of other data engineers who have received the Astronomer Certification for Apache Airflow Fundamentals. This exam assesses an understanding of the basics of the Airflow architecture and the ability to create simple data pipelines for scheduling and monitoring tasks.
Learn More About Certification