Overview

Distributed systems hold the same fact in multiple places: a payment lives in the core ledger, a search index, a fraud feature store, and an analytics warehouse. Keeping these in sync is the single most consequential distributed-systems problem in financial services. Get it wrong and the bank shows different balances on the app vs the statement — and you find out in a complaint.

The robust pattern is change data capture (CDC): read the transaction log of the source database, ship the changes through Kafka, apply them downstream. Debezium is the de-facto open-source implementation; for IBM Db2 and z/OS, IBM’s InfoSphere Data Replication (IIDR) plays the same role.

Single source of truth, single write path

If two services write to the same data, you have a synchronisation problem you can’t solve with synchronisation. The first move is always: pick one writer, derive the rest. CDC then becomes feasible because it has a single log to follow.

The dual-write problem

The naive pattern: when an event occurs, write to the database and publish to Kafka. The two writes are not atomic. Four failure modes:

The solution is to make the two writes one write. Two patterns achieve this: CDC (read the DB log, ship to Kafka) and outbox (write the event to a DB table inside the same transaction, then ship the table’s rows via CDC). Outbox is CDC applied to your own table; you’ll usually use both.

CDC with Debezium

Debezium is a Kafka Connect source connector that reads the database’s replication log (Postgres logical decoding, MySQL binlog, Db2 transaction log, SQL Server CDC). Each row change becomes a Kafka message with the before-image, after-image, operation type, and a position marker (LSN, GTID, log sequence).

Transactional outbox

The application writes the business row and the event into the same transaction. An outbox table holds events; CDC ships them to Kafka. If the transaction commits, both are present; if it rolls back, neither is. Atomicity solved.

outbox-schema.sqlsql
CREATE TABLE outbox (
  id            uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_id  varchar(64) NOT NULL,
  aggregate_type varchar(64) NOT NULL,
  event_type    varchar(64) NOT NULL,
  payload       jsonb       NOT NULL,
  created_at    timestamptz NOT NULL DEFAULT now()
);

-- Debezium — using its outbox event router SMT — expects:
--   id       → message id (deduplication key downstream)
--   aggregate_id   → Kafka message key (partitioning + ordering)
--   aggregate_type → topic suffix (one outbox → many topics)
--   event_type     → header for filtering
--   payload  → message value
PaymentService.javajava
@Transactional
public void submit(Payment p) {
  paymentRepo.save(p);                     // business row
  outboxRepo.save(new OutboxEvent(    // event in same txn
      p.getId(),                           // aggregate_id
      "Payment",                           // aggregate_type → payment topic
      "PaymentSubmitted",                  // event_type
      Json.toJson(p)));                    // payload
  // commit happens at method exit; both rows appear or neither does
}

Debezium reads the outbox table’s WAL entries, applies the Outbox Event Router SMT (Single Message Transform) to route by aggregate_type, and produces to payment.events with the aggregate_id as message key. The original outbox row can be deleted asynchronously — the WAL entry is what carried the event.

Deploying CDC

  1. Enable logical replication on the source

    Postgres: wal_level=logical, a REPLICATION user, a publication, and a replication slot. MySQL: row-format binlog. Db2: DATA CAPTURE CHANGES.

  2. Decide the snapshot strategy

    Initial snapshot of existing rows is the heaviest part of CDC startup. initial for the first deployment; schema_only when historical state isn’t needed; incremental for tables too large to read in one go.

  3. Configure the connector

    Filter to only the tables CDC actually needs (otherwise you’re shipping every internal table’s changes). Pick the right SMT — Outbox Router, ExtractNewRecordState, or Unwrap.

  4. Topic conventions

    One topic per table by default. With outbox, route by aggregate_type → one topic per business event class (payment.events, customer.events).

  5. Consumer-side dedupe

    CDC is at-least-once. Use the source LSN or outbox row id as the idempotency key downstream.

  6. Wire alerts

    Connector lag (LSN behind), restart count, snapshot duration, slot disk usage. A slot with no consumer retains WAL forever and fills the source DB’s disk — this is the single most common Debezium production incident.

Connector configuration

payments-connector.jsonjson
{
  "name": "payments-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "payments-db.svc.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db.pwd:db.pwd}",
    "database.dbname": "payments",
    "database.server.name": "payments",

    "plugin.name": "pgoutput",
    "slot.name": "debezium_payments",
    "publication.name": "dbz_payments",

    "table.include.list": "public.outbox",
    "snapshot.mode": "never",

    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.route.topic.replacement": "${routedByValue}.events",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.payload": "payload",
    "transforms.outbox.table.field.event.id": "id",

    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",

    "heartbeat.interval.ms": "10000",
    "heartbeat.action.query": "INSERT INTO heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()"
  }
}

The heartbeat.action.query is the silent hero. Without traffic on the source DB the WAL doesn’t advance and the slot holds WAL forever; an idle outbox table can fill the DB’s disk in days. The heartbeat keeps WAL flowing.

Idempotent consumer

CDC delivers at-least-once. The consumer must dedupe. The simplest reliable pattern: a processed-events table, keyed by the outbox event id, written in the same transaction as the side effect.

SearchIndexer.javajava
@KafkaListener(topics = "Payment.events")
public void onEvent(ConsumerRecord<String, Payment> r,
                    Acknowledgment ack) {

  UUID eventId = UUID.fromString(r.headers().lastHeader("id").value().toString());

  int inserted = jdbc.update(
      "INSERT INTO processed_events (event_id) VALUES (?) ON CONFLICT DO NOTHING",
      eventId);
  if (inserted == 0) { ack.acknowledge(); return; }      // duplicate

  searchClient.index(r.value());
  ack.acknowledge();
}

Schema drift

The source DDL changes — a new column, a renamed field, a dropped index. CDC keeps streaming; consumers may break silently if their deserialiser is strict. Three rules:

  • Use Schema Registry with BACKWARD compatibility. New columns become optional fields with defaults. Removing a column requires explicit consumer migration first.
  • Pin column types in the connector. Don’t let CDC infer; declare the schema in the connector configuration so a type change is a CI failure, not a runtime surprise.
  • Always alert on schema-evolution events. Debezium emits a schema-change topic; route it to the platform team for review.

Reconciliation

Reconciliation is the audit that catches what CDC missed: lost events from a slot reset, applied-twice events from a manual replay, downstream divergence from a bad consumer.

Pattern: a daily job hashes a window of source rows and the corresponding downstream rows; any mismatch is an alert. Don’t skip this. CDC alone is not an audit story to a regulator.

Common pitfalls

Replication slot disk fill

A Debezium slot with no consumer or a stuck consumer retains WAL indefinitely. Source DB disk fills, write stops, application down. Always alert on slot LSN lag; always have a runbook to drop the slot if Debezium is being decommissioned.

Snapshot under load

Initial snapshot of a 500M-row table at peak traffic locks for hours. Run the initial snapshot during a maintenance window with an incremental strategy or off a read replica.

Outbox without an index

The outbox grows; without an index on created_at a clean-up job locks the table. Index the column you’ll use to delete by, and partition by month if volume is high.

CDC is downstream of the database, not the application

Database changes made directly via SQL bypass application code — including any application-level outbox writes. CDC catches the row change but not the “event” the application would have emitted. Treat direct SQL writes as out of bounds, or use the table-CDC pattern instead of outbox for those flows.

When not to use CDC

  • The source DB is read-only or rarely changes. A nightly batch export is simpler.
  • The downstream needs joins across multiple tables. CDC ships per-table changes; reconstructing joins is hard. Consider materialised views or stream processing instead.
  • The DB has no replication log. Some legacy systems and proprietary platforms don’t expose one. Application-level outbox or polling is the only option.
  • You need synchronous propagation. CDC is async by definition. For strong cross-system consistency, use a database-level distributed transaction or co-locate the systems.