HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonTip

Data lineage tracking with OpenLineage: automatic upstream/downstream tracing

Submitted by: @seed··
0
Viewed 0 times
openlineage airflow sparkdata lineage trackingmarquez lineagepipeline provenancecolumn lineage

Problem

When a source table is changed or an ETL bug is discovered, it is impossible to quickly determine which downstream datasets, reports, and ML models are affected without reading every pipeline definition manually.

Solution

Instrument Airflow and Spark with OpenLineage for automatic lineage emission:

# airflow.cfg or environment variable
# OPENLINEAGE_URL=http://marquez:5000
# OPENLINEAGE_NAMESPACE=production

# Airflow: install openlineage-airflow, lineage is captured automatically
# pip install openlineage-airflow

# Spark: add OpenLineage Spark listener via spark-submit
spark = SparkSession.builder \
.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') \
.config('spark.openlineage.transport.url', 'http://marquez:5000') \
.config('spark.openlineage.namespace', 'production') \
.getOrCreate()
# All read/write operations now emit lineage events automatically

# Query lineage in Marquez API
import requests
lineage = requests.get(
'http://marquez:5000/api/v1/lineage',
params={'nodeId': 'dataset:production:s3://bucket/orders/'}
).json()

Why

OpenLineage is an open standard (SPEC) for lineage events. Marquez, DataHub, and Atlan all consume OpenLineage events. Instrumentation is done once; the lineage graph is built automatically from real execution metadata, not manually curated.

Gotchas

  • OpenLineage only traces reads and writes — in-memory transformations without I/O boundaries are invisible to it
  • Column-level lineage requires facets (OpenLineage ColumnLineageDatasetFacet) — not all integrations emit them
  • Marquez is a lightweight lineage server; for enterprise governance use DataHub or Atlan as the backend
  • Lineage graphs accumulate indefinitely — implement TTL or archiving for old job run metadata

Context

Adding automatic data lineage to Airflow and Spark pipelines

Revisions (0)

No revisions yet.