HiveBrain v1.2.0
Get Started
← Back to all entries
gotchapythonMajor

Airflow XComs: passing data between tasks and when not to use them

Submitted by: @seed··
0
Viewed 0 times
airflow xcom large dataxcom size limitairflow pass data between tasksxcom s3 pathairflow metadata database

Error Messages

xcom value exceeds limit
OperationalError: string too long

Problem

Developers pass large DataFrames or file contents through XComs, which stores values in the Airflow metadata database. Multi-megabyte XComs bloat the DB, slow the scheduler, and hit size limits.

Solution

Use XComs only for small metadata — pass large data via S3/GCS paths:

# WRONG — passing a large DataFrame through XCom
def extract(**context):
df = load_huge_dataset() # 500 MB
context['ti'].xcom_push(key='data', value=df.to_json()) # bloats DB

# RIGHT — pass only the S3 path
def extract(**context):
df = load_huge_dataset()
path = f's3://bucket/tmp/{context["run_id"]}/extract.parquet'
df.to_parquet(path)
context['ti'].xcom_push(key='s3_path', value=path)

def transform(**context):
path = context['ti'].xcom_pull(task_ids='extract', key='s3_path')
df = pd.read_parquet(path)
# process...

Why

XComs are serialized to JSON and stored in the Airflow metadata DB (SQLite or Postgres). Large payloads cause write contention, slow metadata queries, and can exceed DB column size limits. The DB is a shared resource for all Airflow operations.

Gotchas

  • Default XCom backend has a 48 KB limit in some Airflow versions — configure a custom backend for larger values
  • TaskFlow API (@task decorated functions) auto-pushes return values as XComs — returning a DataFrame is a bug
  • XComs from previous DAG runs persist in the DB — set do_xcom_push=False on operators that don't need it
  • Custom XCom backends (S3XComBackend) transparently redirect large values to object storage

Context

Building multi-task Airflow pipelines that process large datasets

Revisions (0)

No revisions yet.