Overview
Event-driven architecture (EDA) is the architectural style where state changes are published as discrete, immutable facts and consumers react asynchronously. In financial services it underpins three classes of system: real-time payments (faster payments, instant transfers), trading and risk (order events, position updates), and core banking modernisation (CDC streams from a system of record into downstream products).
Two brokers dominate this space: Apache Kafka for high-throughput, replayable event streams, and IBM MQ for guaranteed-once messaging with strong transactional semantics. They solve different problems and most banks run both.
An event is a fact about something that happened (PaymentSubmitted). It has no intended recipient and may be consumed many times. A message is a command or request directed at a specific consumer (SettleTransfer). Kafka is optimised for events; MQ is optimised for messages. Mixing the two in the same broker is the most common architectural mistake in this space.
Kafka vs MQ
The choice is not philosophical — it is determined by the durability model, consumer pattern, and replay requirement.
| Property | Kafka | IBM MQ |
|---|---|---|
| Storage model | Append-only log, retained for hours/days | Queue, message removed on consume |
| Consumer model | Pull, offset-based, multiple groups | Push or pull, one consumer per message |
| Replay | Native (rewind to any offset) | Not supported; need a separate audit store |
| Throughput | 100k+ msg/sec per broker | 10–30k msg/sec per qmgr |
| Transactions | Idempotent producer, EOS | XA, two-phase commit |
| Ordering | Per partition | Per queue (FIFO) |
| Best fit | Event sourcing, CDC, analytics streams | Command processing, point-to-point integration |
Topology
A production Kafka cluster has three logical layers: brokers (the data plane), the metadata service (KRaft, replacing ZooKeeper from 3.5 onwards), and the surrounding ecosystem — Schema Registry, Kafka Connect, and Streams applications.
Event flow
Every event in a well-designed system traverses six stages. Understanding which stage handles which concern is what separates a system that scales from one that collapses under its own retries.
Building a flow
From an empty cluster to a production event flow runs through five steps. Topic provisioning, schema registration, and connector deployment should all be GitOps‑managed — clicking through a UI is how production becomes irreproducible.
-
Define the topic
Pick partitions for parallelism (16–64 is typical), a replication factor of 3 with
min.insync.replicas=2, and a retention policy that matches the consumer SLA. Compaction is a separate decision — only compact topics that represent the latest state of a key. -
Register the schema
Avro or Protobuf — both work. Register the schema in Schema Registry with
BACKWARDcompatibility for events that consumers must keep reading;FULLfor canonical contracts. Never useNONEin production. -
Build the producer
Configure
acks=all,enable.idempotence=true, and a deterministic key (account id, customer id, trade id). The key controls partitioning and therefore ordering; pick it once and never change. -
Build the consumer
Manual offset commit, idempotent processing logic, structured DLQ on permanent failure, and a poll-loop pause / resume around backpressure. Always commit offsets after the side effect succeeds, never before.
-
Wire observability
Lag metrics per consumer group, throughput per topic, request rate per producer, ISR shrink/expand events. Without lag alerts, a slow consumer silently rots until disk fills.
Topic provisioning
# 32 partitions, RF=3, min.isr=2, 7-day retention
kafka-topics --bootstrap-server kafka:9092 --create \
--topic payments.submitted.v1 \
--partitions 32 \
--replication-factor 3 \
--config min.insync.replicas=2 \
--config retention.ms=604800000 \
--config compression.type=lz4 \
--config cleanup.policy=delete
Schemas & contracts
The schema is the API. In a synchronous service the API contract is the OpenAPI spec; in event-driven systems it is the schema registered against a topic. Schema evolution rules are the equivalent of versioning rules for a REST API — just enforced earlier.
{
"type": "record",
"name": "PaymentSubmitted",
"namespace": "com.acmebank.payments.events",
"fields": [
{ "name": "paymentId", "type": "string" },
{ "name": "originator", "type": "string" },
{ "name": "beneficiary", "type": "string" },
{ "name": "amount",
"type": { "type": "bytes", "logicalType": "decimal",
"precision": 18, "scale": 2 } },
{ "name": "currency", "type": "string" },
{ "name": "submittedAt", "type": { "type": "long", "logicalType": "timestamp-millis" } },
{ "name": "reference", "type": ["null", "string"], "default": null },
// added in v2 with default — backward compatible
{ "name": "channel", "type": "string", "default": "unknown" }
]
}
Three rules to enforce in CI:
- No removed fields in
BACKWARDmode. Once a field is published, consumers depend on it. - No required fields added. New fields must have a default. This is the difference between an additive change and a breaking change.
- Use
logicalType: decimalfor money, neverdouble. Floating point in financial events is an audit issue waiting to happen.
Idempotent consumers
At-least-once delivery is the default for both Kafka and MQ in any realistic configuration. The consumer must therefore tolerate duplicates. The standard pattern is a processed-events table keyed by event id, written in the same transaction as the side effect.
@KafkaListener(topics = "payments.submitted.v1", groupId = "risk-engine")
public void onPayment(ConsumerRecord<String, PaymentSubmitted> record,
Acknowledgment ack) {
String eventId = record.value().getPaymentId() + ":" + record.offset();
jdbc.execute(conn -> {
// Single transaction: dedupe + side effect
try {
conn.update("INSERT INTO processed_events(event_id) VALUES (?)", eventId);
} catch (DuplicateKeyException e) {
log.info("skip duplicate {}", eventId);
return;
}
riskService.scorePayment(record.value());
});
ack.acknowledge(); // commit offset only after DB commit succeeds
}
Auto-commit (enable.auto.commit=true) advances the offset on a timer regardless of whether processing succeeded. A consumer crash after commit but before processing means silent data loss. Always use manual commit, always commit after the side effect persists, never before.
Exactly-once
True exactly-once across two systems requires either two-phase commit (slow, blocking, generally avoided in modern systems) or the transactional outbox pattern. Kafka’s exactly-once semantics (EOS) only cover Kafka-to-Kafka flows — produce, consume, and produce-to-output-topic in one transaction.
# Kafka Streams exactly-once configuration
application.id=risk-scoring
processing.guarantee=exactly_once_v2
producer.acks=all
producer.enable.idempotence=true
producer.transaction.timeout.ms=60000
consumer.isolation.level=read_committed
state.dir=/var/lib/kafka-streams
For Kafka-to-database flows (the common case), the transactional outbox is the correct pattern: write the event into an outbox table inside the same transaction as the business write, then a CDC connector (Debezium) ships outbox rows to Kafka. This is covered in detail in the Data Synchronisation article.
HA & DR
Production Kafka clusters in financial services run with three brokers per data centre and two data centres. The cluster spans within a DC; replication across DCs uses MirrorMaker 2 or Cluster Linking for active/passive DR.
Three rules:
- Replicate within an AZ-aware cluster, not across DCs at the broker level. Stretching a cluster across DCs makes
min.isrsensitive to inter-DC latency — one network hiccup and producers stall. - Consumer offsets are not free in DR. MM2 translates offsets via a checkpoint topic, but failover still requires deliberate offset reset. Build the runbook before you need it.
- Schema Registry is a SPOF if you ignore it. Run it in a cluster, replicate the
_schemastopic, and put the registry behind a load balancer with failover.
Common pitfalls
If 80% of traffic is for one customer (the bank’s biggest corporate client), keying by customerId sends 80% of events to one partition — one consumer thread — one CPU. Use a composite key (customerId + transactionId) when distribution matters more than per-customer ordering.
Default Kafka consumer auto-commits offsets every 5 seconds. A crash after commit but before processing means silent message loss. Disable auto-commit. Always.
One topic per microservice per event type sounds clean until you have 4 000 topics and the metadata refresh time on every broker is in the seconds. Group related events into a single topic with a type field; let consumers filter.
Hard-failing producer startup when Schema Registry is unreachable means a Schema Registry incident becomes a producer-side incident. Use a local schema cache and tolerate brief Registry unavailability.
When not to use
Event-driven architecture is not a default. It’s a deliberate trade: lower coupling and replay capability in exchange for harder debugging, eventual consistency, and operational complexity.
- Synchronous request/response with a single answer. A REST call is simpler. Don’t put a queue between two services that always need to talk to each other in lockstep.
- Strong cross-aggregate consistency. If two writes must succeed or fail together and the business cannot tolerate any window of inconsistency, use a database transaction, not events.
- Low traffic, low criticality. Running Kafka for 100 messages a day is overkill. Use a simpler queue (RabbitMQ, SQS, or even a database table) until volume justifies the operational cost.
- Audit-first command flows. When the business needs every command to be acknowledged synchronously and traced individually, IBM MQ with XA still beats Kafka. Kafka is a stream; MQ is a transaction.
Where event-driven wins: when the same fact must reach multiple consumers, when replay is a regulatory requirement, when downstream systems evolve at different speeds, and when peak throughput exceeds what a synchronous call chain can absorb. That’s most modern banking platforms — but not all of them.