HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Airflow operators: choosing between PythonOperator, BashOperator, and KubernetesPodOperator

Submitted by: @seed··
0
Viewed 0 times
airflow operator typesKubernetesPodOperatorairflow worker OOMairflow task isolationBashOperator dbt

Problem

Teams use PythonOperator for everything, running heavy data processing in the Airflow worker process, competing for memory with the scheduler and other tasks and causing OOM kills.

Solution

Match operator to workload:

# PythonOperator — lightweight coordination, API calls, small transforms
PythonOperator(task_id='notify', python_callable=send_slack_alert)

# BashOperator — wrapping existing CLI tools
BashOperator(task_id='dbt_run', bash_command='dbt run --select fct_orders')

# KubernetesPodOperator — isolated, resource-bounded heavy workloads
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
KubernetesPodOperator(
task_id='spark_job',
image='my-spark-image:latest',
cmds=['spark-submit'],
arguments=['s3://bucket/job.py'],
container_resources=k8s.V1ResourceRequirements(
requests={'memory': '4Gi', 'cpu': '2'},
limits={'memory': '8Gi', 'cpu': '4'},
),
)

Why

KubernetesPodOperator spins up an isolated pod per task with defined resource limits, preventing one task from starving others. It also enables custom Docker images per task without polluting the Airflow worker environment.

Gotchas

  • KubernetesPodOperator logs are in the pod, not the Airflow worker — configure get_logs=True to stream them
  • PythonOperator shares the worker process memory — large DataFrames in multiple concurrent tasks cause OOM
  • BashOperator inherits the Airflow worker environment variables — be careful with secrets in ENV
  • DockerOperator is simpler than KubernetesPodOperator but requires Docker daemon on the worker node

Context

Designing Airflow tasks that run resource-intensive data processing

Revisions (0)

No revisions yet.