patternpythonTip
Data catalog: automated metadata discovery with Apache Atlas or DataHub
Viewed 0 times
data catalog datahubmetadata managementdata discoverydata lineage catalogdatahub python emitter
Problem
Analysts spend hours or days tracking down where data lives, what it means, who owns it, and whether it is safe to use, because there is no central inventory of datasets, schemas, and ownership.
Solution
Emit metadata events from pipelines to a data catalog using the DataHub Python emitter:
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass, SchemaMetadataClass, OwnershipClass,
)
import datahub.emitter.mce_builder as builder
emitter = DatahubRestEmitter('http://datahub-gms:8080')
# Register a dataset
dataset_urn = builder.make_dataset_urn('s3', 'my-bucket/events/date=2024-01-01')
mce = builder.make_lineage_mce(
upstream_urns=[builder.make_dataset_urn('kafka', 'events-topic')],
downstream_urn=dataset_urn,
)
emitter.emit_mce(mce)
# dbt integration — automatic metadata + lineage via dbt-datahub
# Run after dbt: datahub ingest -c dbt_ingestion.yaml
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass, SchemaMetadataClass, OwnershipClass,
)
import datahub.emitter.mce_builder as builder
emitter = DatahubRestEmitter('http://datahub-gms:8080')
# Register a dataset
dataset_urn = builder.make_dataset_urn('s3', 'my-bucket/events/date=2024-01-01')
mce = builder.make_lineage_mce(
upstream_urns=[builder.make_dataset_urn('kafka', 'events-topic')],
downstream_urn=dataset_urn,
)
emitter.emit_mce(mce)
# dbt integration — automatic metadata + lineage via dbt-datahub
# Run after dbt: datahub ingest -c dbt_ingestion.yaml
Why
Data catalogs aggregate schema, lineage, ownership, and usage statistics into a searchable interface. Teams find datasets in seconds instead of hours, and data governance teams can track PII columns, data owners, and access patterns across the entire data platform.
Gotchas
- Manual catalog entry is always stale — automate metadata emission from pipelines and ingest tools
- DataHub requires Kafka, Elasticsearch, and MySQL to run — substantial infrastructure for small teams
- dbt-datahub and OpenLineage adapters provide automatic lineage for dbt and Spark without custom code
- Schema metadata without column-level descriptions is incomplete — require descriptions as part of the data contract
Context
Building a data governance layer for a growing data platform
Revisions (0)
No revisions yet.