Overview

Event-driven architecture (EDA) is the architectural style where state changes are published as discrete, immutable facts and consumers react asynchronously. In financial services it underpins three classes of system: real-time payments (faster payments, instant transfers), trading and risk (order events, position updates), and core banking modernisation (CDC streams from a system of record into downstream products).

Two brokers dominate this space: Apache Kafka for high-throughput, replayable event streams, and IBM MQ for guaranteed-once messaging with strong transactional semantics. They solve different problems and most banks run both.

Event vs Message

An event is a fact about something that happened (PaymentSubmitted). It has no intended recipient and may be consumed many times. A message is a command or request directed at a specific consumer (SettleTransfer). Kafka is optimised for events; MQ is optimised for messages. Mixing the two in the same broker is the most common architectural mistake in this space.

Kafka vs MQ

The choice is not philosophical — it is determined by the durability model, consumer pattern, and replay requirement.

PropertyKafkaIBM MQ
Storage modelAppend-only log, retained for hours/daysQueue, message removed on consume
Consumer modelPull, offset-based, multiple groupsPush or pull, one consumer per message
ReplayNative (rewind to any offset)Not supported; need a separate audit store
Throughput100k+ msg/sec per broker10–30k msg/sec per qmgr
TransactionsIdempotent producer, EOSXA, two-phase commit
OrderingPer partitionPer queue (FIFO)
Best fitEvent sourcing, CDC, analytics streamsCommand processing, point-to-point integration

Topology

A production Kafka cluster has three logical layers: brokers (the data plane), the metadata service (KRaft, replacing ZooKeeper from 3.5 onwards), and the surrounding ecosystem — Schema Registry, Kafka Connect, and Streams applications.

Event flow

Every event in a well-designed system traverses six stages. Understanding which stage handles which concern is what separates a system that scales from one that collapses under its own retries.

Building a flow

From an empty cluster to a production event flow runs through five steps. Topic provisioning, schema registration, and connector deployment should all be GitOps‑managed — clicking through a UI is how production becomes irreproducible.

  1. Define the topic

    Pick partitions for parallelism (16–64 is typical), a replication factor of 3 with min.insync.replicas=2, and a retention policy that matches the consumer SLA. Compaction is a separate decision — only compact topics that represent the latest state of a key.

  2. Register the schema

    Avro or Protobuf — both work. Register the schema in Schema Registry with BACKWARD compatibility for events that consumers must keep reading; FULL for canonical contracts. Never use NONE in production.

  3. Build the producer

    Configure acks=all, enable.idempotence=true, and a deterministic key (account id, customer id, trade id). The key controls partitioning and therefore ordering; pick it once and never change.

  4. Build the consumer

    Manual offset commit, idempotent processing logic, structured DLQ on permanent failure, and a poll-loop pause / resume around backpressure. Always commit offsets after the side effect succeeds, never before.

  5. Wire observability

    Lag metrics per consumer group, throughput per topic, request rate per producer, ISR shrink/expand events. Without lag alerts, a slow consumer silently rots until disk fills.

Topic provisioning

create-topic.shbash
# 32 partitions, RF=3, min.isr=2, 7-day retention
kafka-topics --bootstrap-server kafka:9092 --create \
  --topic payments.submitted.v1 \
  --partitions 32 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000 \
  --config compression.type=lz4 \
  --config cleanup.policy=delete

Schemas & contracts

The schema is the API. In a synchronous service the API contract is the OpenAPI spec; in event-driven systems it is the schema registered against a topic. Schema evolution rules are the equivalent of versioning rules for a REST API — just enforced earlier.

payment-submitted-v2.avscavro
{
  "type": "record",
  "name": "PaymentSubmitted",
  "namespace": "com.acmebank.payments.events",
  "fields": [
    { "name": "paymentId",    "type": "string" },
    { "name": "originator",   "type": "string" },
    { "name": "beneficiary",  "type": "string" },
    { "name": "amount",
      "type": { "type": "bytes", "logicalType": "decimal",
                "precision": 18, "scale": 2 } },
    { "name": "currency",     "type": "string" },
    { "name": "submittedAt",  "type": { "type": "long", "logicalType": "timestamp-millis" } },
    { "name": "reference",    "type": ["null", "string"], "default": null },
    // added in v2 with default — backward compatible
    { "name": "channel",      "type": "string", "default": "unknown" }
  ]
}

Three rules to enforce in CI:

  • No removed fields in BACKWARD mode. Once a field is published, consumers depend on it.
  • No required fields added. New fields must have a default. This is the difference between an additive change and a breaking change.
  • Use logicalType: decimal for money, never double. Floating point in financial events is an audit issue waiting to happen.

Idempotent consumers

At-least-once delivery is the default for both Kafka and MQ in any realistic configuration. The consumer must therefore tolerate duplicates. The standard pattern is a processed-events table keyed by event id, written in the same transaction as the side effect.

PaymentEventConsumer.javajava
@KafkaListener(topics = "payments.submitted.v1", groupId = "risk-engine")
public void onPayment(ConsumerRecord<String, PaymentSubmitted> record,
                      Acknowledgment ack) {

  String eventId = record.value().getPaymentId() + ":" + record.offset();

  jdbc.execute(conn -> {
    // Single transaction: dedupe + side effect
    try {
      conn.update("INSERT INTO processed_events(event_id) VALUES (?)", eventId);
    } catch (DuplicateKeyException e) {
      log.info("skip duplicate {}", eventId);
      return;
    }
    riskService.scorePayment(record.value());
  });

  ack.acknowledge();  // commit offset only after DB commit succeeds
}
Don’t commit before processing

Auto-commit (enable.auto.commit=true) advances the offset on a timer regardless of whether processing succeeded. A consumer crash after commit but before processing means silent data loss. Always use manual commit, always commit after the side effect persists, never before.

Exactly-once

True exactly-once across two systems requires either two-phase commit (slow, blocking, generally avoided in modern systems) or the transactional outbox pattern. Kafka’s exactly-once semantics (EOS) only cover Kafka-to-Kafka flows — produce, consume, and produce-to-output-topic in one transaction.

kafka-streams-eos.propertiesproperties
# Kafka Streams exactly-once configuration
application.id=risk-scoring
processing.guarantee=exactly_once_v2
producer.acks=all
producer.enable.idempotence=true
producer.transaction.timeout.ms=60000
consumer.isolation.level=read_committed
state.dir=/var/lib/kafka-streams

For Kafka-to-database flows (the common case), the transactional outbox is the correct pattern: write the event into an outbox table inside the same transaction as the business write, then a CDC connector (Debezium) ships outbox rows to Kafka. This is covered in detail in the Data Synchronisation article.

HA & DR

Production Kafka clusters in financial services run with three brokers per data centre and two data centres. The cluster spans within a DC; replication across DCs uses MirrorMaker 2 or Cluster Linking for active/passive DR.

Three rules:

  • Replicate within an AZ-aware cluster, not across DCs at the broker level. Stretching a cluster across DCs makes min.isr sensitive to inter-DC latency — one network hiccup and producers stall.
  • Consumer offsets are not free in DR. MM2 translates offsets via a checkpoint topic, but failover still requires deliberate offset reset. Build the runbook before you need it.
  • Schema Registry is a SPOF if you ignore it. Run it in a cluster, replicate the _schemas topic, and put the registry behind a load balancer with failover.

Common pitfalls

Hot partition by key choice

If 80% of traffic is for one customer (the bank’s biggest corporate client), keying by customerId sends 80% of events to one partition — one consumer thread — one CPU. Use a composite key (customerId + transactionId) when distribution matters more than per-customer ordering.

Auto-commit data loss

Default Kafka consumer auto-commits offsets every 5 seconds. A crash after commit but before processing means silent message loss. Disable auto-commit. Always.

Topic explosion

One topic per microservice per event type sounds clean until you have 4 000 topics and the metadata refresh time on every broker is in the seconds. Group related events into a single topic with a type field; let consumers filter.

Schema Registry as deployment blocker

Hard-failing producer startup when Schema Registry is unreachable means a Schema Registry incident becomes a producer-side incident. Use a local schema cache and tolerate brief Registry unavailability.

When not to use

Event-driven architecture is not a default. It’s a deliberate trade: lower coupling and replay capability in exchange for harder debugging, eventual consistency, and operational complexity.

  • Synchronous request/response with a single answer. A REST call is simpler. Don’t put a queue between two services that always need to talk to each other in lockstep.
  • Strong cross-aggregate consistency. If two writes must succeed or fail together and the business cannot tolerate any window of inconsistency, use a database transaction, not events.
  • Low traffic, low criticality. Running Kafka for 100 messages a day is overkill. Use a simpler queue (RabbitMQ, SQS, or even a database table) until volume justifies the operational cost.
  • Audit-first command flows. When the business needs every command to be acknowledged synchronously and traced individually, IBM MQ with XA still beats Kafka. Kafka is a stream; MQ is a transaction.

Where event-driven wins: when the same fact must reach multiple consumers, when replay is a regulatory requirement, when downstream systems evolve at different speeds, and when peak throughput exceeds what a synchronous call chain can absorb. That’s most modern banking platforms — but not all of them.