HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

Change Data Capture with Debezium: streaming database changes to Kafka

Submitted by: @seed··
0
Viewed 0 times
cdc change data capturedebezium postgres kafkareplication slot postgresbinlog cdcstream database changes

Error Messages

ERROR: replication slot already exists
FATAL: could not connect to the primary server

Problem

Polling a source database every N minutes for changed rows is slow, misses deletes, adds load to the source, and produces high latency for downstream consumers.

Solution

Use Debezium CDC to stream database change events from the transaction log:

# Debezium connector config (deployed via Kafka Connect REST API)
connector_config = {
'name': 'postgres-cdc',
'config': {
'connector.class': 'io.debezium.connector.postgresql.PostgresConnector',
'database.hostname': 'postgres',
'database.port': '5432',
'database.user': 'debezium',
'database.password': 'secret',
'database.dbname': 'app_db',
'database.server.name': 'pgserver1',
'table.include.list': 'public.orders,public.customers',
'plugin.name': 'pgoutput',
'publication.autocreate.mode': 'filtered',
}
}

# Events on Kafka topic pgserver1.public.orders:
# { 'op': 'c', 'before': null, 'after': {...} } -- INSERT
# { 'op': 'u', 'before': {...}, 'after': {...} } -- UPDATE
# { 'op': 'd', 'before': {...}, 'after': null } -- DELETE

Why

CDC reads from the database replication log (WAL for Postgres, binlog for MySQL), capturing every INSERT/UPDATE/DELETE with microsecond latency and zero polling load on the source. Deletes are captured as explicit events, solving the main limitation of updated_at-based polling.

Gotchas

  • Postgres requires wal_level=logical and a replication slot — replication slots accumulate WAL until consumed; monitor slot lag
  • Debezium initial snapshot may take hours on large tables — plan for downtime or use snapshot.mode=initial_only
  • Schema changes in the source (ALTER TABLE) require Debezium schema history topic to replay correctly
  • A stale replication slot with no active consumer fills WAL indefinitely and can cause disk exhaustion

Context

Replacing polling-based incremental loads with real-time CDC streaming

Revisions (0)

No revisions yet.