Overview
Integration platforms in financial services have CI/CD problems that vanilla microservice teams don’t face: BAR files for IBM ACE, Kafka topics that can’t be re-created without data loss, MQ queues that need clustering aware of network zones, OAuth clients that need IdP configuration synchronised with deployments, and a regulator who wants every change traced to an approver.
The pattern that holds: GitOps for state, Tekton for events. Git is the single source of truth for what should be deployed; Tekton runs the pipelines that build artifacts; ArgoCD reconciles the cluster to git. Promotions happen by merging a PR, not by a kubectl from a developer’s laptop.
A platform engineering team owns the runtime, the pipeline, the dev experience, and the contracts with consumer teams. The deliverable is a self-service path: a service team can ship a new integration without raising a ticket. If teams still file tickets to deploy, you have a CI team, not a platform team.
Pipeline shape
The crucial property: nothing reaches a cluster except via git. No kubectl apply from a laptop, no manual oc rsh-and-edit. Every change is a commit; every commit has an author and a reviewer. This is what regulators care about and what most banks fail to implement consistently.
Tekton
Tekton is Kubernetes-native CI: pipelines and tasks are CRDs, runs are pods. The benefit over Jenkins is operational — the runtime is the cluster you already operate — and the cost is that Tekton is a primitive; you build the pipeline yourself.
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: build-ace-flow
spec:
params:
- { name: git-url }
- { name: git-revision }
- { name: image }
workspaces:
- name: source
tasks:
- name: clone
taskRef: { name: git-clone }
params:
- { name: url, value: $(params.git-url) }
- { name: revision, value: $(params.git-revision) }
workspaces: [{ name: output, workspace: source }]
- name: unit-test
runAfter: [clone]
taskRef: { name: ace-unit-test }
workspaces: [{ name: source, workspace: source }]
- name: build-bar
runAfter: [unit-test]
taskRef: { name: ace-build-bar }
workspaces: [{ name: source, workspace: source }]
- name: image-build
runAfter: [build-bar]
taskRef: { name: buildah }
params:
- { name: IMAGE, value: $(params.image) }
workspaces: [{ name: source, workspace: source }]
- name: trivy-scan
runAfter: [image-build]
taskRef: { name: trivy-scan }
params:
- { name: image, value: $(params.image) }
- { name: severity, value: "HIGH,CRITICAL" }
- name: sign-image
runAfter: [trivy-scan]
taskRef: { name: cosign-sign }
params:
- { name: image, value: $(params.image) }
- name: bump-env-manifest
runAfter: [sign-image]
taskRef: { name: git-pr-bump }
params:
- { name: env-repo, value: https://git/.../env-dev }
- { name: image, value: $(params.image) }
Each task is reusable; the pipeline is one team’s composition. Push reusable tasks to a shared catalog repo so platform updates apply everywhere.
GitOps with ArgoCD
Two repos per environment: the app repo (source) and the env repo (manifests). The CI pipeline opens a PR against the env repo with the new image tag; ArgoCD watches the env repo and reconciles the cluster.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-prod
namespace: argocd
spec:
project: integration
source:
repoURL: https://git/acme/env-prod
targetRevision: main
path: apps/payments
helm:
valueFiles: [values.prod.yaml]
destination:
server: https://prod-cluster
namespace: payments
syncPolicy:
automated:
selfHeal: true
prune: false # prod: never auto-prune; humans approve deletes
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 3
backoff: { duration: 10s, factor: 2, maxDuration: 2m }
revisionHistoryLimit: 10
Environment promotion
Three environments is enough for almost any platform: dev (everyone, free for all), test (integration testing, fixed data), prod.
| Environment | Sync | Approval | Data |
|---|---|---|---|
| dev | Automated on commit to main | None | Synthetic, freely reset |
| test | Automated on PR merge from dev branch | 1 platform reviewer | Anonymised prod-like |
| prod | PR with image tag + change record | Service owner + change manager | Real |
Promotion is image-tag-only. The same image that passed test is what runs in prod. Configuration that varies by environment lives in env-specific Helm values files; never rebuild between environments.
Secrets
Three things must be true: secrets are not in git, secrets are versioned, and secrets are accessible to pods without runtime humans.
- Vault as source of truth. HashiCorp Vault or IBM Cloud Pak for Integration secrets; pods authenticate via Kubernetes service account JWT.
- External Secrets Operator. Pulls from Vault, creates Kubernetes Secrets in the namespace; ArgoCD doesn’t see the secrets, just
ExternalSecretobjects. - Sealed Secrets as fallback. Encrypted secret in git, decrypted by a controller in the cluster. Ok for low-risk env-specific configuration, not for production credentials.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: payments-mq-creds
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-prod
kind: ClusterSecretStore
target:
name: payments-mq-creds
creationPolicy: Owner
data:
- { secretKey: username, remoteRef: { key: payments/mq, property: user } }
- { secretKey: password, remoteRef: { key: payments/mq, property: pwd } }
Observability
The platform itself needs the same telemetry it gives its consumers. Three signals at minimum:
- Pipeline duration & success rate. Slow pipelines erode developer trust faster than failing pipelines. Track p50/p95 by stage.
- Sync drift. ArgoCD’s out-of-sync status — track per app per environment. Persistent drift means manual changes to the cluster, which means GitOps is bypassed.
- Image freshness. Time since the image in production was built. An image older than 30 days probably has unpatched CVEs.
Release patterns
Three release patterns cover almost every integration deployment:
- Rolling. Default for stateless services; replace pods one at a time behind a load balancer. Fine for HTTP/REST.
- Blue/green. Stand up green alongside blue, switch traffic at the gateway. Use for releases that change message contracts where in-flight requests with the old contract must drain.
- Canary. Route a small percentage of traffic to the new version; ramp on metrics. Use for high-risk changes or anywhere business metrics are sensitive (fraud detection, risk scoring).
Kafka topic partition increases, MQ queue config changes, schema registry compatibility shifts — none of these are pod-level rollouts. Encode them as separate, idempotent platform tasks (Tekton or Ansible) that run before the pod-level deploy. Don’t bake them into the application Helm chart.
Audit & compliance
The regulator’s question is the same in every audit: “show me who approved this change.” GitOps answers that question by construction: every change is a commit with an author; every prod commit was a PR with a reviewer; every reviewer maps to an identity. Make sure the chain is unbroken.
- Branch protection on prod env repo: require 2 reviewers, 1 from a different team, no force-pushes, signed commits.
- Image provenance: sign images with cosign; ArgoCD admission policy refuses unsigned images in prod.
- SBOM per artifact: attach a Software Bill of Materials to every image; the next CVE response is a query against SBOMs, not a reverse-engineering exercise.
- Change record link: require a ServiceNow / Jira change ticket reference in the prod PR title; auditors want to trace from change ticket to commit to deploy.
Common pitfalls
One platform engineer with cluster admin can fix anything in 30 seconds. They can also undo six months of GitOps discipline. Restrict cluster admin to break-glass; require justification per use; alert on every non-ArgoCD apply.
The single most common cause of credential leaks is debug output in a CI pipeline. Configure Tekton tasks with strict log filtering; reject any task that prints environment variables; rotate secrets immediately if any leak is suspected.
If every team has its own pipeline, you have N pipelines to maintain and audit. Define a small number (1–3) of paved paths; allow teams to deviate only with explicit approval. The flexibility cost is real but the audit and operational savings are larger.
Rollback in GitOps is a revert PR. That sounds easy — until ArgoCD has auto-pruned a resource and the revert recreates it with a different IP, breaking downstream service-mesh routing. Test rollback regularly; never assume it works.
Production checklist
- App repo and env repo separated; only env repo drives deploys.
- Branch protection on prod env repo; 2 reviewers; signed commits.
- ArgoCD with selfHeal on, auto-prune off in prod.
- Tekton tasks signed with cosign; only signed tasks run in prod.
- Trivy/Clair scan on every image; HIGH/CRITICAL fail the pipeline.
- Cosign image signing; admission policy enforces signatures in prod.
- SBOM attached to every image; queryable by CVE.
- Vault + External Secrets; no plaintext secrets in git.
- One golden path per service shape; deviations need approval.
- Pipeline duration p95 alert; sync drift alert; image freshness alert.
- Documented and quarterly-tested rollback runbook.
- Cluster admin restricted; non-ArgoCD apply alerts wired to oncall.