patternpythonMajor
Change Data Capture with Debezium: streaming database changes to Kafka
Viewed 0 times
cdc change data capturedebezium postgres kafkareplication slot postgresbinlog cdcstream database changes
Error Messages
Problem
Polling a source database every N minutes for changed rows is slow, misses deletes, adds load to the source, and produces high latency for downstream consumers.
Solution
Use Debezium CDC to stream database change events from the transaction log:
# Debezium connector config (deployed via Kafka Connect REST API)
connector_config = {
'name': 'postgres-cdc',
'config': {
'connector.class': 'io.debezium.connector.postgresql.PostgresConnector',
'database.hostname': 'postgres',
'database.port': '5432',
'database.user': 'debezium',
'database.password': 'secret',
'database.dbname': 'app_db',
'database.server.name': 'pgserver1',
'table.include.list': 'public.orders,public.customers',
'plugin.name': 'pgoutput',
'publication.autocreate.mode': 'filtered',
}
}
# Events on Kafka topic pgserver1.public.orders:
# { 'op': 'c', 'before': null, 'after': {...} } -- INSERT
# { 'op': 'u', 'before': {...}, 'after': {...} } -- UPDATE
# { 'op': 'd', 'before': {...}, 'after': null } -- DELETE
# Debezium connector config (deployed via Kafka Connect REST API)
connector_config = {
'name': 'postgres-cdc',
'config': {
'connector.class': 'io.debezium.connector.postgresql.PostgresConnector',
'database.hostname': 'postgres',
'database.port': '5432',
'database.user': 'debezium',
'database.password': 'secret',
'database.dbname': 'app_db',
'database.server.name': 'pgserver1',
'table.include.list': 'public.orders,public.customers',
'plugin.name': 'pgoutput',
'publication.autocreate.mode': 'filtered',
}
}
# Events on Kafka topic pgserver1.public.orders:
# { 'op': 'c', 'before': null, 'after': {...} } -- INSERT
# { 'op': 'u', 'before': {...}, 'after': {...} } -- UPDATE
# { 'op': 'd', 'before': {...}, 'after': null } -- DELETE
Why
CDC reads from the database replication log (WAL for Postgres, binlog for MySQL), capturing every INSERT/UPDATE/DELETE with microsecond latency and zero polling load on the source. Deletes are captured as explicit events, solving the main limitation of updated_at-based polling.
Gotchas
- Postgres requires wal_level=logical and a replication slot — replication slots accumulate WAL until consumed; monitor slot lag
- Debezium initial snapshot may take hours on large tables — plan for downtime or use snapshot.mode=initial_only
- Schema changes in the source (ALTER TABLE) require Debezium schema history topic to replay correctly
- A stale replication slot with no active consumer fills WAL indefinitely and can cause disk exhaustion
Context
Replacing polling-based incremental loads with real-time CDC streaming
Revisions (0)
No revisions yet.