Overview

Strimzi is the operator that turns OpenShift into a managed Kafka platform. You declare a Kafka custom resource; the operator reconciles brokers, ZooKeeper or KRaft controllers, listeners, certificates, users, topics, and connectors. In a regulated bank running on private (often air-gapped) OpenShift, this declarative model is what makes Kafka tractable: every change is a Git commit, every reconcile leaves an audit trail, and there is no console to click through that the regulator can’t reproduce.

This article covers the deployment as it actually runs in a regulated KSA bank: no internet egress, internal image registry, internal Keycloak as the IdP, internal CA for mTLS, FIPS-validated crypto, and SAMA-aligned audit logging. The patterns transfer to other private cloud and regulated environments — PSD2 banks in the EU, OSFI-regulated banks in Canada, federal-tier government clouds — but the examples are written from the bank perspective.

Air-gapped vs private

“Private OCP” in this article means a cluster with no public internet egress, internal image registry mirror (Quay or Harbor), and IdP/CA on-prem. Strictly air-gapped clusters add one more layer (no in-cluster mirror; images delivered via signed bundles) but the Kafka deployment patterns are identical.

Strimzi vs AMQ Streams

The same code runs under two product names. Pick by your support contract, not by features.

PropertyStrimziRed Hat AMQ Streams
SourceUpstream CNCF projectProductised Strimzi by Red Hat
Kafka versionLatest community KafkaCurated, validated Kafka version
SupportCommunityRed Hat enterprise support, SLA
FIPSDepends on OCP / JVMFIPS-validated build available
ChannelOperatorHub.io / HelmRed Hat OperatorHub on OCP
Best fitSelf-supported teams, dev environmentsRegulated production, audit-required environments

For everything that follows, the YAML and the patterns are identical — the article uses strimzi.io APIs because they apply to both. Where the AMQ Streams build differs (image references, support channels), I’ll call it out.

Components

Strimzi installs three operators. Each owns a distinct concern.

Two operational implications fall out of this:

  • The CR is the truth, not the cluster. A broker pod that doesn’t match the spec is a reconciliation candidate; the operator will replace it. Don’t edit broker configs in a running pod — the next reconcile will undo it.
  • Topics and users live as resources. Creating a topic via kafka-topics --create works once but isn’t reproducible. Always create them as KafkaTopic CRs through Git, so the same topic exists identically in dev, test, and prod.

Private OCP topology

The deployment shape on a private OCP cluster has five distinct namespaces and a careful boundary between data plane (Kafka brokers) and shared services (operators, IdP, monitoring).

The boundaries that matter:

  • One Kafka cluster, one namespace. Mixing two Kafka clusters in the same namespace is technically possible and operationally a mistake — certificate rotations, network policies, and quotas all become harder to reason about.
  • Application namespaces never run brokers. Producers and consumers connect to the bootstrap Service in kafka-prod; they don’t share a namespace with the cluster.
  • The IdP is a hard dependency. If Keycloak is down, OAuth-authenticated clients can’t connect. Run Keycloak in HA, replicate to the DR cluster, and consider a small set of mTLS-only system clients (Connect, MirrorMaker 2) as a fallback path.

Deployment

  1. Mirror images to the internal registry

    The Strimzi or AMQ Streams images live on quay.io / registry.redhat.io. In a private cluster they need to be mirrored to the internal Quay or Harbor first. oc image mirror handles this; the resulting ImageContentSourcePolicy lives in Git.

  2. Install the Cluster Operator

    For AMQ Streams: a Subscription against the Red Hat OperatorHub channel. For Strimzi upstream: a Helm chart or the bundled YAML manifest. Either way, install it cluster-scoped so it can watch all namespaces.

  3. Provision the storage class

    Brokers need durable, fast block storage. On bare-metal OCP that is typically OpenShift Data Foundation (Ceph RBD); on cloud OCP it’s vendor block (gp3, premium, SSD). Pre-create the StorageClass with volumeBindingMode: WaitForFirstConsumer so PVCs land in the right zone.

  4. Apply the Kafka custom resource

    The Kafka CR (next section) drives the whole cluster. The first apply takes 5–10 minutes — the operator generates CAs, issues per-broker certs, and rolls out brokers one at a time.

  5. Create topics and users via CRs

    KafkaTopic for each topic; KafkaUser for each app identity. Both reconcile through the Topic and User Operators. ACLs go on the KafkaUser.

  6. Wire NetworkPolicies

    Default-deny in the namespace; explicit ingress for app namespaces, monitoring, MirrorMaker 2; explicit egress to the IdP. This is the work that takes 80% of the deployment time and is the load-bearing security control.

Kafka custom resource

One YAML drives the whole cluster. Below is what runs in production: KRaft (no ZooKeeper), three brokers across three zones, three listeners (internal mTLS, OAuth for apps, mTLS for system clients), JBOD storage, authorization, and Prometheus metrics.

kafka-prod.yamlyaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: events
  namespace: kafka-prod
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: 3.7.0
    metadataVersion: 3.7-IV4
    replicas: 3

    listeners:
      # 1. Internal plain-TLS listener for in-cluster apps (Service DNS only)
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication: { type: tls }

      # 2. OAuth listener — bearer JWT validated against internal Keycloak
      - name: oauth
        port: 9094
        type: internal
        tls: true
        authentication:
          type: oauth
          validIssuerUri: https://idp.acme-bank.svc/realms/prod
          jwksEndpointUri: https://idp.acme-bank.svc/realms/prod/protocol/openid-connect/certs
          userNameClaim: preferred_username
          checkAudience: true
          tlsTrustedCertificates:
            - { secretName: internal-ca-bundle, certificate: ca.crt }

      # 3. External listener via OpenShift Routes — mTLS only, partner systems
      - name: partner
        port: 9095
        type: route
        tls: true
        authentication: { type: tls }
        configuration:
          bootstrap: { host: events.api.acme-bank.com }
          brokers:
            - { broker: 0, host: events-0.api.acme-bank.com }
            - { broker: 1, host: events-1.api.acme-bank.com }
            - { broker: 2, host: events-2.api.acme-bank.com }

    authorization:
      type: simple
      superUsers:
        - CN=cluster-admin
        - CN=mirrormaker

    storage:
      type: jbod
      volumes:
        - { id: 0, type: persistent-claim, size: 500Gi, class: ocs-storagecluster-ceph-rbd, deleteClaim: false }
        - { id: 1, type: persistent-claim, size: 500Gi, class: ocs-storagecluster-ceph-rbd, deleteClaim: false }

    resources:
      requests: { cpu: 2, memory: 8Gi }
      limits:   { cpu: 4, memory: 12Gi }

    jvmOptions:
      -Xms: 6g
      -Xmx: 6g
      javaSystemProperties:
        - { name: javax.net.ssl.trustStoreType, value: PKCS12 }

    config:
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      auto.create.topics.enable: "false"
      delete.topic.enable: "true"
      log.message.format.version: "3.7"
      inter.broker.protocol.version: "3.7"
      log.retention.hours: 168
      num.partitions: 12

    rack:
      topologyKey: topology.kubernetes.io/zone

    template:
      pod:
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule

    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef: { name: kafka-metrics, key: kafka.yaml }

  entityOperator:
    topicOperator: {}
    userOperator: {}

The configuration that always pays off in production:

  • auto.create.topics.enable: false — topics must come from KafkaTopic CRs in Git. Auto-creation hides governance gaps.
  • min.insync.replicas: 2 with RF=3 — the standard durability tradeoff. A single broker outage is tolerated; two are not.
  • rack.topologyKey — tells Kafka to spread partition replicas across zones. Without it, all three replicas can land in one zone — an AZ outage takes the partition offline.

Auth & mTLS

Three authentication paths, each for a different consumer class. Don’t blur them.

ListenerAuthUsed by
tls (9093)mTLS, internal CASystem clients: MirrorMaker 2, Kafka Connect, Cruise Control
oauth (9094)JWT bearer, Keycloak introspection or JWKSBank applications — producers and consumers
partner (9095, route)mTLS, partner CA bundleExternal B2B partners and TPPs over OpenShift Routes

App identity via KafkaUser + OAuth

An application’s Kafka identity is a KafkaUser CR with OAuth authentication and a Keycloak client mapped to it. The User Operator generates the secret containing the credentials; the application mounts it.

kafkauser-payments.yamlyaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: payments-producer
  namespace: kafka-prod
  labels:
    strimzi.io/cluster: events
spec:
  authentication:
    type: tls          # issue a client cert from the Clients CA
  authorization:
    type: simple
    acls:
      - resource: { type: topic, name: payments., patternType: prefix }
        operations: [ Write, Describe ]
      - resource: { type: topic, name: payments.dlq, patternType: literal }
        operations: [ Write, Read, Describe ]
      - resource: { type: cluster, name: "" }
        operations: [ DescribeConfigs ]

The User Operator creates a Secret payments-producer with user.crt, user.key, and user.password. The application namespace gets a copy via External Secrets or a synced ConfigMap; the application mounts it and configures the Kafka client.

Topic governance via KafkaTopic

kafkatopic-payments-submitted.yamlyaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: payments.submitted.v1
  namespace: kafka-prod
  labels:
    strimzi.io/cluster: events
spec:
  partitions: 32
  replicas: 3
  config:
    retention.ms: "604800000"      # 7 days
    min.insync.replicas: "2"
    cleanup.policy: "delete"
    compression.type: "lz4"
    message.format.version: "3.7"
CA rotation is the silent operational task

Strimzi auto-renews the Cluster CA and Clients CA before they expire (default 365 days, renewal at 30 days remaining). The renewal triggers a rolling restart of every broker and every connected client that uses the cert chain. Schedule the renewal window deliberately; don’t let it surprise you on a Sunday.

Storage

Three storage decisions determine whether the cluster will hold up under load.

  • StorageClass with volumeBindingMode: WaitForFirstConsumer — ensures the PV is provisioned in the same zone the broker pod schedules to. With Immediate binding, the PV may be in zone A while the pod ends up in zone B; you discover this when the pod CrashLoops on attach.
  • JBOD over single disk — gives Kafka multiple log directories. Loss of one disk doesn’t take the broker offline; replication recovers the affected partitions.
  • Quotas on the storage class — a runaway producer can fill 500 GB in hours. Set per-namespace storage quotas; alert at 70% disk utilisation per broker.

NetworkPolicies

Default-deny in the namespace; explicit allows for app namespaces, monitoring, and the IdP. The pattern that holds:

networkpolicy-kafka.yamlyaml
# Default deny everything in the kafka-prod namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: kafka-prod }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
---
# Allow producers/consumers from labelled app namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-apps-to-brokers, namespace: kafka-prod }
spec:
  podSelector: { matchLabels: { strimzi.io/kind: Kafka } }
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels: { acme.bank/kafka-client: "true" }
      ports:
        - { port: 9094, protocol: TCP }   # OAuth listener only
---
# Allow brokers to reach the IdP for JWKS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-brokers-to-idp, namespace: kafka-prod }
spec:
  podSelector: { matchLabels: { strimzi.io/kind: Kafka } }
  policyTypes: [Egress]
  egress:
    - to:
        - namespaceSelector: { matchLabels: { name: identity } }
          podSelector: { matchLabels: { app: keycloak } }
      ports:
        - { port: 8443, protocol: TCP }

App namespaces are labelled acme.bank/kafka-client=true at provisioning time. A namespace without that label cannot reach the brokers — even if a developer hard-codes the bootstrap URL.

HA & multi-AZ

Three brokers spread across three zones is the baseline. The properties that have to be true:

  • topologySpreadConstraints with DoNotSchedule — not ScheduleAnyway. The latter allows two brokers in one zone if the third zone is unavailable, which silently breaks the rack-aware replica placement.
  • rack.topologyKey on the Kafka spec — Kafka assigns replicas across racks (zones); without it, all three replicas can land on brokers in the same zone.
  • PodDisruptionBudget of maxUnavailable: 1 — node drain doesn’t take more than one broker down at a time.
  • StorageClass per zone or topology-aware — the PV must be in the same zone as the pod that mounts it, or the pod won’t schedule.

DR with MirrorMaker 2

Active/passive DR uses KafkaMirrorMaker2 to replicate topics, consumer offsets, and ACLs from the primary cluster to the DR cluster. Both clusters are full Strimzi deployments; only the MM2 instance lives on the DR side.

mirrormaker2.yamlyaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: events-dr
  namespace: kafka-dr
spec:
  version: 3.7.0
  replicas: 3
  connectCluster: target
  clusters:
    - alias: source
      bootstrapServers: events-kafka-bootstrap.kafka-prod.svc:9093
      tls: { trustedCertificates: [{ secretName: prod-ca-cert, certificate: ca.crt }] }
      authentication: { type: tls, certificateAndKey: { secretName: mm2-source, certificate: user.crt, key: user.key } }
    - alias: target
      bootstrapServers: events-kafka-bootstrap.kafka-dr.svc:9093
      tls: { trustedCertificates: [{ secretName: dr-ca-cert, certificate: ca.crt }] }
      authentication: { type: tls, certificateAndKey: { secretName: mm2-target, certificate: user.crt, key: user.key } }
  mirrors:
    - sourceCluster: source
      targetCluster: target
      topicsPattern: "payments\\..*,risk\\..*"
      groupsPattern: ".*"
      sourceConnector:
        config:
          replication.factor: "3"
          offset-syncs.topic.replication.factor: "3"
          sync.topic.acls.enabled: "false"
          refresh.topics.interval.seconds: "30"
      checkpointConnector:
        config:
          checkpoints.topic.replication.factor: "3"
          sync.group.offsets.enabled: "true"

Failover is a runbook, not an automation. The DR cluster has the data; promotion to active requires updating the bootstrap DNS and resetting consumer group offsets via the MM2 checkpoint topic. Test this every quarter, not when the primary is down.

Observability

Strimzi ships JMX-Prometheus exporter sidecars and ServiceMonitors out of the box; Grafana dashboards are in the upstream repo. Five alerts that have to fire before anything else:

  • UnderReplicatedPartitions > 0 — durability is degraded.
  • OfflinePartitionsCount > 0 — data unavailable; this is a P1.
  • ActiveControllerCount != 1 — KRaft quorum issue.
  • Disk utilisation > 80% per broker — ahead of the cliff.
  • Cert expiry < 30 days — CA renewal needed.

Audit logs (auth events, ACL denies) are forwarded to Splunk or QRadar via a Vector sidecar in the broker pods. The SAMA audit trail is the SIEM, not the broker disk.

Common pitfalls

Single-zone deployment

The default Kafka CR replicas: 3 lands all three brokers wherever the scheduler likes. On a single-zone cluster, an AZ outage takes everything offline. Always set topologySpreadConstraints with DoNotSchedule and rack.topologyKey; verify by listing pod-to-node-to-zone mapping after rollout.

Auto-create topics in production

If auto.create.topics.enable is true, a typo in a producer’s topic name silently creates a new topic with default partitions and replication factor. By the time it’s noticed, weeks of traffic are in the wrong place. Disable it; let KafkaTopic CRs be the only path.

CA rotation surprise

The default Cluster and Clients CAs renew automatically before expiry, triggering a rolling restart. If clients pin the CA fingerprint (some Java clients do), the renewal breaks them. Switch clients to verifying via the published CA secret; never pin fingerprints in app code.

PVC ReadWriteOnce

Most block storage on OCP is RWO. A pod that can’t schedule (taint, node down) leaves its PVC bound to a missing node. The replacement pod can’t mount the same volume. The fix is StatefulSet-aware storage with topology zone or volumeBindingMode: WaitForFirstConsumer — configure the StorageClass correctly the first time.

KRaft maturity

KRaft (no ZooKeeper) is production-ready as of Kafka 3.5+ and is the only path forward (ZooKeeper is removed in Kafka 4.x). For greenfield deployments use KRaft; for migrations from a ZooKeeper-backed cluster, plan a deliberate migration window with a tested rollback.

When not to use

  • You don’t already operate OCP. Strimzi’s value is the operator pattern; without an existing K8s/OCP investment, vanilla Kafka on VMs may be simpler. Don’t introduce OpenShift just to run Kafka.
  • Managed Kafka satisfies your data residency. If Confluent Cloud, MSK, or IBM Event Streams (managed) is allowed by your regulator and security stance, the operational savings are large. Strimzi shines when you can’t use managed.
  • Sub-three-broker scale. A small two-broker cluster doesn’t justify the operator footprint. RabbitMQ, NATS, or even a database queue may fit better.
  • The team has no Kafka operator experience. Strimzi simplifies day-2 only if the team understands Kafka itself. Without that, you trade one set of unknowns for another.

Where Strimzi on private OCP wins is the regulated bank case — data residency requirements, internal CA, internal IdP, FIPS-validated runtime, GitOps-managed configuration, audit-traceable changes — all the constraints that rule out managed Kafka and reward declarative platform engineering.