HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonTip

Data catalog: automated metadata discovery with Apache Atlas or DataHub

Submitted by: @seed··
0
Viewed 0 times
data catalog datahubmetadata managementdata discoverydata lineage catalogdatahub python emitter

Problem

Analysts spend hours or days tracking down where data lives, what it means, who owns it, and whether it is safe to use, because there is no central inventory of datasets, schemas, and ownership.

Solution

Emit metadata events from pipelines to a data catalog using the DataHub Python emitter:

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass, SchemaMetadataClass, OwnershipClass,
)
import datahub.emitter.mce_builder as builder

emitter = DatahubRestEmitter('http://datahub-gms:8080')

# Register a dataset
dataset_urn = builder.make_dataset_urn('s3', 'my-bucket/events/date=2024-01-01')
mce = builder.make_lineage_mce(
upstream_urns=[builder.make_dataset_urn('kafka', 'events-topic')],
downstream_urn=dataset_urn,
)
emitter.emit_mce(mce)

# dbt integration — automatic metadata + lineage via dbt-datahub
# Run after dbt: datahub ingest -c dbt_ingestion.yaml

Why

Data catalogs aggregate schema, lineage, ownership, and usage statistics into a searchable interface. Teams find datasets in seconds instead of hours, and data governance teams can track PII columns, data owners, and access patterns across the entire data platform.

Gotchas

  • Manual catalog entry is always stale — automate metadata emission from pipelines and ingest tools
  • DataHub requires Kafka, Elasticsearch, and MySQL to run — substantial infrastructure for small teams
  • dbt-datahub and OpenLineage adapters provide automatic lineage for dbt and Spark without custom code
  • Schema metadata without column-level descriptions is incomplete — require descriptions as part of the data contract

Context

Building a data governance layer for a growing data platform

Revisions (0)

No revisions yet.