Overview
Distributed systems hold the same fact in multiple places: a payment lives in the core ledger, a search index, a fraud feature store, and an analytics warehouse. Keeping these in sync is the single most consequential distributed-systems problem in financial services. Get it wrong and the bank shows different balances on the app vs the statement — and you find out in a complaint.
The robust pattern is change data capture (CDC): read the transaction log of the source database, ship the changes through Kafka, apply them downstream. Debezium is the de-facto open-source implementation; for IBM Db2 and z/OS, IBM’s InfoSphere Data Replication (IIDR) plays the same role.
If two services write to the same data, you have a synchronisation problem you can’t solve with synchronisation. The first move is always: pick one writer, derive the rest. CDC then becomes feasible because it has a single log to follow.
The dual-write problem
The naive pattern: when an event occurs, write to the database and publish to Kafka. The two writes are not atomic. Four failure modes:
The solution is to make the two writes one write. Two patterns achieve this: CDC (read the DB log, ship to Kafka) and outbox (write the event to a DB table inside the same transaction, then ship the table’s rows via CDC). Outbox is CDC applied to your own table; you’ll usually use both.
CDC with Debezium
Debezium is a Kafka Connect source connector that reads the database’s replication log (Postgres logical decoding, MySQL binlog, Db2 transaction log, SQL Server CDC). Each row change becomes a Kafka message with the before-image, after-image, operation type, and a position marker (LSN, GTID, log sequence).
Transactional outbox
The application writes the business row and the event into the same transaction. An outbox table holds events; CDC ships them to Kafka. If the transaction commits, both are present; if it rolls back, neither is. Atomicity solved.
CREATE TABLE outbox (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_id varchar(64) NOT NULL,
aggregate_type varchar(64) NOT NULL,
event_type varchar(64) NOT NULL,
payload jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
-- Debezium — using its outbox event router SMT — expects:
-- id → message id (deduplication key downstream)
-- aggregate_id → Kafka message key (partitioning + ordering)
-- aggregate_type → topic suffix (one outbox → many topics)
-- event_type → header for filtering
-- payload → message value
@Transactional
public void submit(Payment p) {
paymentRepo.save(p); // business row
outboxRepo.save(new OutboxEvent( // event in same txn
p.getId(), // aggregate_id
"Payment", // aggregate_type → payment topic
"PaymentSubmitted", // event_type
Json.toJson(p))); // payload
// commit happens at method exit; both rows appear or neither does
}
Debezium reads the outbox table’s WAL entries, applies the Outbox Event Router SMT (Single Message Transform) to route by aggregate_type, and produces to payment.events with the aggregate_id as message key. The original outbox row can be deleted asynchronously — the WAL entry is what carried the event.
Deploying CDC
-
Enable logical replication on the source
Postgres:
wal_level=logical, aREPLICATIONuser, a publication, and a replication slot. MySQL: row-format binlog. Db2:DATA CAPTURE CHANGES. -
Decide the snapshot strategy
Initial snapshot of existing rows is the heaviest part of CDC startup.
initialfor the first deployment;schema_onlywhen historical state isn’t needed;incrementalfor tables too large to read in one go. -
Configure the connector
Filter to only the tables CDC actually needs (otherwise you’re shipping every internal table’s changes). Pick the right SMT — Outbox Router, ExtractNewRecordState, or Unwrap.
-
Topic conventions
One topic per table by default. With outbox, route by
aggregate_type→ one topic per business event class (payment.events, customer.events). -
Consumer-side dedupe
CDC is at-least-once. Use the source LSN or outbox row id as the idempotency key downstream.
-
Wire alerts
Connector lag (LSN behind), restart count, snapshot duration, slot disk usage. A slot with no consumer retains WAL forever and fills the source DB’s disk — this is the single most common Debezium production incident.
Connector configuration
{
"name": "payments-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "payments-db.svc.internal",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/secrets/db.pwd:db.pwd}",
"database.dbname": "payments",
"database.server.name": "payments",
"plugin.name": "pgoutput",
"slot.name": "debezium_payments",
"publication.name": "dbz_payments",
"table.include.list": "public.outbox",
"snapshot.mode": "never",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.route.topic.replacement": "${routedByValue}.events",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.id": "id",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"heartbeat.interval.ms": "10000",
"heartbeat.action.query": "INSERT INTO heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()"
}
}
The heartbeat.action.query is the silent hero. Without traffic on the source DB the WAL doesn’t advance and the slot holds WAL forever; an idle outbox table can fill the DB’s disk in days. The heartbeat keeps WAL flowing.
Idempotent consumer
CDC delivers at-least-once. The consumer must dedupe. The simplest reliable pattern: a processed-events table, keyed by the outbox event id, written in the same transaction as the side effect.
@KafkaListener(topics = "Payment.events")
public void onEvent(ConsumerRecord<String, Payment> r,
Acknowledgment ack) {
UUID eventId = UUID.fromString(r.headers().lastHeader("id").value().toString());
int inserted = jdbc.update(
"INSERT INTO processed_events (event_id) VALUES (?) ON CONFLICT DO NOTHING",
eventId);
if (inserted == 0) { ack.acknowledge(); return; } // duplicate
searchClient.index(r.value());
ack.acknowledge();
}
Schema drift
The source DDL changes — a new column, a renamed field, a dropped index. CDC keeps streaming; consumers may break silently if their deserialiser is strict. Three rules:
- Use Schema Registry with BACKWARD compatibility. New columns become optional fields with defaults. Removing a column requires explicit consumer migration first.
- Pin column types in the connector. Don’t let CDC infer; declare the schema in the connector configuration so a type change is a CI failure, not a runtime surprise.
- Always alert on schema-evolution events. Debezium emits a schema-change topic; route it to the platform team for review.
Reconciliation
Reconciliation is the audit that catches what CDC missed: lost events from a slot reset, applied-twice events from a manual replay, downstream divergence from a bad consumer.
Pattern: a daily job hashes a window of source rows and the corresponding downstream rows; any mismatch is an alert. Don’t skip this. CDC alone is not an audit story to a regulator.
Common pitfalls
A Debezium slot with no consumer or a stuck consumer retains WAL indefinitely. Source DB disk fills, write stops, application down. Always alert on slot LSN lag; always have a runbook to drop the slot if Debezium is being decommissioned.
Initial snapshot of a 500M-row table at peak traffic locks for hours. Run the initial snapshot during a maintenance window with an incremental strategy or off a read replica.
The outbox grows; without an index on created_at a clean-up job locks the table. Index the column you’ll use to delete by, and partition by month if volume is high.
Database changes made directly via SQL bypass application code — including any application-level outbox writes. CDC catches the row change but not the “event” the application would have emitted. Treat direct SQL writes as out of bounds, or use the table-CDC pattern instead of outbox for those flows.
When not to use CDC
- The source DB is read-only or rarely changes. A nightly batch export is simpler.
- The downstream needs joins across multiple tables. CDC ships per-table changes; reconstructing joins is hard. Consider materialised views or stream processing instead.
- The DB has no replication log. Some legacy systems and proprietary platforms don’t expose one. Application-level outbox or polling is the only option.
- You need synchronous propagation. CDC is async by definition. For strong cross-system consistency, use a database-level distributed transaction or co-locate the systems.