gotchapythonMajor
Airflow XComs: passing data between tasks and when not to use them
Viewed 0 times
airflow xcom large dataxcom size limitairflow pass data between tasksxcom s3 pathairflow metadata database
Error Messages
Problem
Developers pass large DataFrames or file contents through XComs, which stores values in the Airflow metadata database. Multi-megabyte XComs bloat the DB, slow the scheduler, and hit size limits.
Solution
Use XComs only for small metadata — pass large data via S3/GCS paths:
# WRONG — passing a large DataFrame through XCom
def extract(**context):
df = load_huge_dataset() # 500 MB
context['ti'].xcom_push(key='data', value=df.to_json()) # bloats DB
# RIGHT — pass only the S3 path
def extract(**context):
df = load_huge_dataset()
path = f's3://bucket/tmp/{context["run_id"]}/extract.parquet'
df.to_parquet(path)
context['ti'].xcom_push(key='s3_path', value=path)
def transform(**context):
path = context['ti'].xcom_pull(task_ids='extract', key='s3_path')
df = pd.read_parquet(path)
# process...
# WRONG — passing a large DataFrame through XCom
def extract(**context):
df = load_huge_dataset() # 500 MB
context['ti'].xcom_push(key='data', value=df.to_json()) # bloats DB
# RIGHT — pass only the S3 path
def extract(**context):
df = load_huge_dataset()
path = f's3://bucket/tmp/{context["run_id"]}/extract.parquet'
df.to_parquet(path)
context['ti'].xcom_push(key='s3_path', value=path)
def transform(**context):
path = context['ti'].xcom_pull(task_ids='extract', key='s3_path')
df = pd.read_parquet(path)
# process...
Why
XComs are serialized to JSON and stored in the Airflow metadata DB (SQLite or Postgres). Large payloads cause write contention, slow metadata queries, and can exceed DB column size limits. The DB is a shared resource for all Airflow operations.
Gotchas
- Default XCom backend has a 48 KB limit in some Airflow versions — configure a custom backend for larger values
- TaskFlow API (@task decorated functions) auto-pushes return values as XComs — returning a DataFrame is a bug
- XComs from previous DAG runs persist in the DB — set do_xcom_push=False on operators that don't need it
- Custom XCom backends (S3XComBackend) transparently redirect large values to object storage
Context
Building multi-task Airflow pipelines that process large datasets
Revisions (0)
No revisions yet.