HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Data quality monitoring: Great Expectations suite as pipeline gate

Submitted by: @seed··
0
Viewed 0 times
great expectations validationdata quality gatepipeline data checkexpectation suitegreat expectations checkpoint

Error Messages

great_expectations.exceptions.ExpectationValidationError

Problem

Data quality issues in production are discovered by analysts or customers, not by the pipeline itself. There is no automated system to detect schema drift, unexpected nulls, or out-of-range values as data flows through the pipeline.

Solution

Integrate Great Expectations checkpoints into the ETL pipeline:

import great_expectations as gx

context = gx.get_context()

# Define expectations
batch = context.sources.pandas_default.read_dataframe(df)
batch.expect_column_values_to_not_be_null('order_id')
batch.expect_column_values_to_be_unique('order_id')
batch.expect_column_values_to_be_between('amount', min_value=0, max_value=100_000)
batch.expect_column_values_to_be_in_set('status', ['pending', 'completed', 'cancelled'])
batch.expect_table_row_count_to_be_between(min_value=1000, max_value=10_000_000)

# Run validation
results = batch.validate()
if not results['success']:
raise ValueError(f'Data quality check failed: {results}')

Why

Great Expectations runs a configurable suite of statistical and logical checks against a dataset and produces a structured validation result. Raising an exception on failure stops the pipeline before bad data reaches downstream consumers.

Gotchas

  • Great Expectations v3 (Fluent API) changed the API significantly from v2 — check which version your codebase uses
  • expect_column_mean_to_be_between is a statistical expectation that may fail on legitimate data distribution shifts — tune bounds carefully
  • Data Docs generation requires a Data Context with configured store — skip in simple use cases
  • Expectations are too strict at first; start with null checks and row count bounds, add statistical checks over time

Context

Adding automated data quality validation to ETL pipelines

Revisions (0)

No revisions yet.