patternpythonModerate
Data quality monitoring: Great Expectations suite as pipeline gate
Viewed 0 times
great expectations validationdata quality gatepipeline data checkexpectation suitegreat expectations checkpoint
Error Messages
Problem
Data quality issues in production are discovered by analysts or customers, not by the pipeline itself. There is no automated system to detect schema drift, unexpected nulls, or out-of-range values as data flows through the pipeline.
Solution
Integrate Great Expectations checkpoints into the ETL pipeline:
import great_expectations as gx
context = gx.get_context()
# Define expectations
batch = context.sources.pandas_default.read_dataframe(df)
batch.expect_column_values_to_not_be_null('order_id')
batch.expect_column_values_to_be_unique('order_id')
batch.expect_column_values_to_be_between('amount', min_value=0, max_value=100_000)
batch.expect_column_values_to_be_in_set('status', ['pending', 'completed', 'cancelled'])
batch.expect_table_row_count_to_be_between(min_value=1000, max_value=10_000_000)
# Run validation
results = batch.validate()
if not results['success']:
raise ValueError(f'Data quality check failed: {results}')
import great_expectations as gx
context = gx.get_context()
# Define expectations
batch = context.sources.pandas_default.read_dataframe(df)
batch.expect_column_values_to_not_be_null('order_id')
batch.expect_column_values_to_be_unique('order_id')
batch.expect_column_values_to_be_between('amount', min_value=0, max_value=100_000)
batch.expect_column_values_to_be_in_set('status', ['pending', 'completed', 'cancelled'])
batch.expect_table_row_count_to_be_between(min_value=1000, max_value=10_000_000)
# Run validation
results = batch.validate()
if not results['success']:
raise ValueError(f'Data quality check failed: {results}')
Why
Great Expectations runs a configurable suite of statistical and logical checks against a dataset and produces a structured validation result. Raising an exception on failure stops the pipeline before bad data reaches downstream consumers.
Gotchas
- Great Expectations v3 (Fluent API) changed the API significantly from v2 — check which version your codebase uses
- expect_column_mean_to_be_between is a statistical expectation that may fail on legitimate data distribution shifts — tune bounds carefully
- Data Docs generation requires a Data Context with configured store — skip in simple use cases
- Expectations are too strict at first; start with null checks and row count bounds, add statistical checks over time
Context
Adding automated data quality validation to ETL pipelines
Revisions (0)
No revisions yet.