patternpythonModerate
Airflow operators: choosing between PythonOperator, BashOperator, and KubernetesPodOperator
Viewed 0 times
airflow operator typesKubernetesPodOperatorairflow worker OOMairflow task isolationBashOperator dbt
Problem
Teams use PythonOperator for everything, running heavy data processing in the Airflow worker process, competing for memory with the scheduler and other tasks and causing OOM kills.
Solution
Match operator to workload:
# PythonOperator — lightweight coordination, API calls, small transforms
PythonOperator(task_id='notify', python_callable=send_slack_alert)
# BashOperator — wrapping existing CLI tools
BashOperator(task_id='dbt_run', bash_command='dbt run --select fct_orders')
# KubernetesPodOperator — isolated, resource-bounded heavy workloads
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
KubernetesPodOperator(
task_id='spark_job',
image='my-spark-image:latest',
cmds=['spark-submit'],
arguments=['s3://bucket/job.py'],
container_resources=k8s.V1ResourceRequirements(
requests={'memory': '4Gi', 'cpu': '2'},
limits={'memory': '8Gi', 'cpu': '4'},
),
)
# PythonOperator — lightweight coordination, API calls, small transforms
PythonOperator(task_id='notify', python_callable=send_slack_alert)
# BashOperator — wrapping existing CLI tools
BashOperator(task_id='dbt_run', bash_command='dbt run --select fct_orders')
# KubernetesPodOperator — isolated, resource-bounded heavy workloads
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
KubernetesPodOperator(
task_id='spark_job',
image='my-spark-image:latest',
cmds=['spark-submit'],
arguments=['s3://bucket/job.py'],
container_resources=k8s.V1ResourceRequirements(
requests={'memory': '4Gi', 'cpu': '2'},
limits={'memory': '8Gi', 'cpu': '4'},
),
)
Why
KubernetesPodOperator spins up an isolated pod per task with defined resource limits, preventing one task from starving others. It also enables custom Docker images per task without polluting the Airflow worker environment.
Gotchas
- KubernetesPodOperator logs are in the pod, not the Airflow worker — configure get_logs=True to stream them
- PythonOperator shares the worker process memory — large DataFrames in multiple concurrent tasks cause OOM
- BashOperator inherits the Airflow worker environment variables — be careful with secrets in ENV
- DockerOperator is simpler than KubernetesPodOperator but requires Docker daemon on the worker node
Context
Designing Airflow tasks that run resource-intensive data processing
Revisions (0)
No revisions yet.