Forward-Looking Platform Architecture

Platform Evolution — Crossplane, Temporal & AI-Augmented DevEx

The Backstage IDP closed the gap from idea to running app. The next gap is from running app to self-service cloud resources, durable workflows, and AI-amplified engineering velocity — the operating model regulated enterprises need to scale developer productivity without giving up control.

5 min
Form → running app
14
Service breakeven
4
DORA metrics tracked
Zero
Tickets to onboard
← Backstage IDP foundation

📐 The Operating Model

A platform isn't a product — it's a product organization. The job of a platform team is to make the path of least resistance also be the path of greatest compliance, observability, and reliability. Everything below is built around that principle.

Platform-as-Product

Internal developers are customers. Treat the platform like a product line: golden paths as the primary SKU, paved roads as supported variants, escape hatches when the abstraction breaks down. Measure adoption, not just availability.

Team Topology

Stream-aligned product teams consume the platform. The platform team builds golden paths and operates shared services. Enabling teams (security, SRE, data) embed temporarily to coach adoption, not own it permanently.

Funding Model

Central funding for the platform substrate (EKS, ArgoCD, observability) — chargeback for variable resources via Crossplane Composition cost tags. Showback dashboards in Backstage make spend legible to engineering leadership without billing surprises.

Compliance Integration

Controls live in the platform, not in checklists. Crossplane Compositions enforce encryption, private subnet placement, and backup retention by construction. Kyverno policies validate every Claim. Every change flows through Git — audit trail is automatic.

🏗️ Layered IaC — Terraform for Substrate, Crossplane for Self-Service

Pure-Terraform IDPs hit a wall: developers can scaffold an app in five minutes but still file a ticket for the database that app needs. The bottleneck moves but doesn't disappear. Crossplane closes the loop — cloud resources become Kubernetes APIs, exposed through Backstage golden paths, reconciled by ArgoCD, governed by Compositions written once.

Layer 1 · Substrate (Terraform — change-controlled by platform team)

VPC, EKS, IAM baselines, Karpenter node pools, Crossplane install, ArgoCD install, network policies, observability collectors. Changes go through pull request with plan review. Slow on purpose.

Layer 2 · Cloud Resource Catalog (Crossplane Compositions — platform team authored)

XPostgreSQLInstance, XS3Bucket, XSQSQueue, XElastiCacheCluster — each Composition encodes the company's compliance defaults: encryption at rest, private subnet placement, backup retention, cost tags, IAM least-privilege. Versioned. Tested. Reviewed.

Layer 3 · Golden Paths (Backstage scaffolder templates)

Developer fills out a form: "I need a Node.js service with a PostgreSQL database." Scaffolder commits the App Repo, the GitOps Repo, and the Crossplane Claim. ArgoCD reconciles all three. Five minutes later: running service, provisioned database, secrets injected, observability dashboards live.

Layer 4 · Workload Patterns (Temporal workers, ML services, event consumers)

Stateless apps are the easy case. The IDP also supports durable workflow services on Temporal, batch ML inference workloads, and Kafka consumers — same scaffolding flow, same GitOps reconciliation, same observability baseline.

Why this layering matters

Terraform owns what changes monthly; Crossplane owns what changes daily. Compliance defaults are written once in a Composition and inherited by every subsequent Claim — not copy-pasted into every Terraform module. Platform team capacity shifts from "ticket queue for cloud resources" to "improving the catalog of resources developers can self-serve."

Worked example — XPostgreSQLInstance

A developer requests a database through Backstage. The scaffolder commits this Claim:

apiVersion: platform.example.com/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: claims-service-db
  namespace: claims-service
spec:
  parameters:
    storageGB: 50
    instanceClass: db.t3.medium
    multiAZ: true
  compositionRef:
    name: postgresql-aws-standard
  writeConnectionSecretToRef:
    name: claims-service-db-connection

The Composition (authored once by the platform team) materializes:

The developer never sees AWS console, never writes Terraform, never files a ticket. The platform team writes the Composition once and every future PostgreSQL request inherits the controls.

⏱️ Temporal — Durable Workflows for Stateful Domains

Insurance, financial services, and any regulated domain runs on long-lived stateful processes — underwriting decisions, claims adjudication, settlement reconciliation, policy lifecycle events. These workflows span hours, days, or weeks; they retry; they involve human-in-the-loop steps; they need full audit trails. REST endpoints and cron jobs aren't the right primitive. Temporal is.

Why Temporal belongs in the platform catalog

  • Durable execution — workflows survive worker restarts, deployments, and partial failures by construction
  • Retry & compensation — built-in primitives for the patterns every regulated workflow needs
  • Audit trail — every workflow step is an event in history; compliance teams query it directly
  • Polyglot — workers in Java, Go, Python, TypeScript; same workflow definition language
  • Mature operator — Helm chart, ArgoCD-friendly, runs on the same EKS substrate as everything else

Where it slots into the IDP

  • Backstage scaffolder template: "Temporal Worker Service" alongside Node.js / Python / Go templates
  • Crossplane Claim: XTemporalNamespace provisions the namespace with retention policies and access controls
  • GitOps repo: worker deployment, service monitor, structured logging — same pattern as any other service
  • Observability: Temporal SDK metrics flow into Prometheus → Grafana Cloud alongside app RED metrics
  • Same security posture: GitHub OAuth in Backstage, IRSA for worker pods, External Secrets for credentials

The architectural narrative for regulated workloads

"Stateless services run on the IDP today — Node.js, Python, Go scaffolded through Backstage, deployed via ArgoCD. The same flow extends to durable workflow workers on Temporal. A claims adjudication workflow that spans seven days with three approval steps and four external API calls runs on the same platform substrate, with the same observability, the same security baseline, and the same developer experience as a stateless API."

This is the layer most platform teams under-invest in — and it's the layer that matters most for insurance, banking, and any domain where business processes don't fit in a request-response cycle. Building it as part of the platform from day one prevents the inevitable proliferation of bespoke BPM tools, hand-rolled state machines, and brittle cron-job orchestration.

🤖 AI-Augmented Developer Experience

AI tooling is the largest single lever on engineering productivity available right now. The platform team's job isn't to choose whether developers use AI — it's to make sure they use it safely, consistently, and measurably. That means provisioning the tools, integrating them into the golden paths, and instrumenting the outcomes.

Tooling strategy

ToolWhere it fitsPlatform integration
GitHub Copilot Suite IDE Inline code completion, chat, code review, PR summaries inside developer IDEs Org-level license, SSO via Entra/Okta, audit logs to SIEM, content exclusion policies for sensitive repos, custom instructions enforcing org coding standards
Claude Code Agentic Terminal-native agentic coding for complex multi-file refactors, migrations, test generation, scaffolding work Anthropic API via enterprise agreement, MCP servers exposing internal tools (Backstage catalog, Jira, GitLab, Datadog), policy controls on which repos and tools agents can touch
Backstage AI Plugin Catalog Surface AI-recommended templates, owners, and similar services directly in the developer portal Catalog metadata + entity relationships feed retrieval; LLM proposes the right golden path for the developer's request
AI Code Review Bots CI Automated review on every PR — security, accessibility, performance, test coverage gaps GitLab CI / GitHub Actions integration; results posted as PR comments alongside human review; never blocking without human override

The two tools complement each other

Copilot Suite — the constant companion

Always-on inside the IDE. Tab-complete, inline chat, code-aware suggestions during the moment-to-moment work of writing software. The right tool for the 80% of developer time that's incremental: a function here, a test case there, a refactor in scope.

Where it shines: reducing keystrokes, surfacing API patterns, generating boilerplate, accelerating debugging, summarizing PRs. The compounding small wins.

Claude Code — the agentic specialist

Terminal-native, designed for the 20% of work that's larger in scope: a multi-file migration, a security upgrade across 50 services, generating a test suite for a legacy module, scaffolding a new pattern across the codebase.

Where it shines: tasks that span files, require iteration with tools, benefit from the agent reading the whole codebase before acting. The set-it-and-review-it work.

Platform team responsibilities

The honest framing

"AI tooling adoption isn't a productivity hack — it's a platform capability. The companies that win the next five years won't be the ones with the most licenses; they'll be the ones whose platforms make AI tools the default, safest, most measured path. Procurement is the easy part. Integration, guardrails, and measurement is the platform team's job."

📊 Developer Productivity Metrics — DORA + SPACE

A platform's success isn't measured by how cleverly it's architected. It's measured by whether engineers ship faster, more safely, and more happily because of it. DORA gives the delivery view; SPACE adds the human view; together they tell the truth.

The DORA four — operational delivery health

MetricWhat it measuresHow the platform improves it
Deployment Frequency How often code reaches production Golden-path CI/CD, automated GitOps deploys, ephemeral preview environments via scaffolder
Lead Time for Change Time from commit to running in production Trivy + npm audit gating in pipeline; ArgoCD auto-sync; Crossplane Claims removing infra wait time
Change Failure Rate Percentage of deploys that cause incidents Property-based tests, Kyverno policies, default-deny network policies, pre-merge security scans
Mean Time to Recovery Time to restore service after an incident Grafana Cloud dashboards auto-provisioned per service, ArgoCD rollback in seconds, runbooks linked from Backstage entity pages

The SPACE five — developer experience health

Satisfaction

Quarterly developer survey. NPS for the platform. Friction reports captured in 1:1s and routed back to platform backlog.

Performance

DORA metrics by team. Time-to-first-deploy for new hires. Service reliability per SLO.

Activity

PRs merged, deploys, scaffolder usage, Backstage page views. Activity alone is noise — paired with outcomes it's signal.

Communication

Cross-team PR review velocity, documentation freshness in Backstage TechDocs, Slack/Teams response times for platform team.

Efficiency

Context-switching cost, time-in-flow signals, build/test wait times. The metric that surfaces friction the others miss.

AI augmentation overlay

Copilot acceptance rates, Claude Code task completion times, AI-assisted PR throughput by team. New dimension; same framework.

What good looks like — directional targets

Lead time for change
< 1 day
Elite DORA tier; achievable with golden paths and ArgoCD auto-sync
Deployment frequency
Daily per service
Multiple deploys per service per day for active streams
Time to first deploy (new hire)
< 5 days
From laptop to production via Backstage scaffolder
Platform NPS
> +30
Surveyed quarterly; tied to platform team OKRs
Change failure rate
< 15%
Elite DORA tier; achieved via pipeline gates and test rigor
Copilot acceptance rate
30–40%
Industry benchmark; track per-team to surface coaching opportunities

🛡️ Compliance & Security by Construction

Regulated industries can't choose between speed and control. The platform is how you get both — by encoding controls in the substrate so the fast path is also the compliant path.

Identity & access

  • SSO (Entra/Okta) for human access to Backstage, ArgoCD, Grafana, AWS console
  • IRSA / Pod Identity for workload-to-AWS authentication — no static credentials
  • External Secrets Operator for app-layer credentials, sourced from AWS Secrets Manager with rotation
  • Backstage entity ownership tied to GitHub team membership — single source of truth

Network & isolation

  • Calico default-deny on every namespace; explicit allow-rules per service pattern
  • VPC endpoints for AWS service traffic — STS, ECR, Secrets Manager, EKS — keeping data on the AWS backbone
  • Per-app ServiceAccount with read-only Role; ArgoCD AppProject scoped to namespace and source repo
  • Tier 2 production path: dedicated workload clusters per environment; Tier 3: AWS Organizations multi-account isolation

Policy & admission control

  • Kyverno admission policies validate every Crossplane Claim against compliance rules before sync
  • Pipeline gates (Trivy, npm audit) block known-bad dependencies and CVEs before build
  • SBOM generation per build, attached to container image via OCI artifacts
  • Image signing via Sigstore + cosign for production deploys

Audit & evidence

  • Every change to platform substrate or app deploys flows through Git — immutable history
  • ArgoCD reconciliation logs are the deployment audit trail
  • Crossplane Composition versions are reviewable evidence of compliance defaults
  • Temporal workflow histories provide step-by-step audit trail for business processes

The SOX / regulatory framing

Compliance teams ask three questions: Who can change production? How is change reviewed? Where's the evidence? The answer in this operating model is the same for every question: Git. Production changes are pull requests. Reviews are PR approvals from CODEOWNERS-defined groups. Evidence is the immutable Git history plus ArgoCD reconciliation logs plus Temporal workflow histories. Auditors learn to read PRs instead of ticket queues — and the audit gets faster every quarter as the platform matures.

🚀 Migration Path — How to Get There from Where You Are

No team gets to greenfield this. The realistic path is incremental: capture quick wins, build the substrate in parallel, migrate workload-by-workload as teams pull rather than push.

Phase 1 · Foundation (months 1–3)

Phase 2 · Self-Service Cloud Resources (months 3–6)

Phase 3 · Durable Workflows & AI Integration (months 6–9)

Phase 4 · Optimization & Scale (months 9–12)

📚 What This Builds On

Every concept on this page is grounded in production-shaped work I've already shipped or scars I've already collected.

ConceptExisting evidence
Backstage golden paths, scaffolder, GitOps integration Backstage IDP portfolio — 329 tests, 11 Terraform modules, $218/mo run cost, 5-min scaffold time
Modular Terraform, ArgoCD App-of-Apps, Karpenter cost optimization GitOps EKS portfolio — 352+ tests, 70% SPOT savings, dual-cluster pattern
External Secrets Operator + AWS Secrets Manager pattern Backstage IDP V2 rebuild — dataFrom.extract sync, ClusterSecretStore, v1beta1→v1 migration battle-tested
Property-based testing for platform invariants 14 PBT tests for Separation of Duties + 329 platform tests via fast-check
Cross-functional driver across many teams Northwestern Mutual Culture of Quality framework — 14+ teams, 90→45 min assessments, 1:6–8 coordinator model
AI tooling fluency & evangelism Q Developer beta partner since Dec 2022, AWS GenAI evangelism with VP Carpenter, 90% delivery time reduction documented
Regulated environment delivery Mizuho Bank (front office equities, current), Penguin Random House PCI pipeline (since 2019), Northwestern Mutual SDLC standardization

💡 Lessons Carried Forward

Spec-driven development scales beyond POCs

The Backstage IDP project formalized requirements, design, and correctness properties before any code. That discipline scales: a platform serving 500 engineers needs the same rigor on every Composition, every golden path, every policy. Spec-first means the test suite writes itself and the audit trail is built into the workflow.

The platform team's job is to pay back time

Every minute spent on the platform should pay back many minutes of stream-aligned developer time. The Backstage IDP's 14-service breakeven calculation is the right framing: platform investments are quantifiable, and platform leaders should be willing to defend that math. If a feature can't justify its cost in developer hours saved, it doesn't ship.

AI tooling is a platform capability, not a productivity hack

Provisioning Copilot licenses isn't enough. Integrating AI into golden paths, defining the guardrails, measuring the outcomes, building a prompt library — that's platform work. The companies that get this right will out-ship the ones that just buy seats.

Crossplane and Terraform aren't competitors

Terraform owns the substrate; Crossplane owns the catalog. Trying to do everything in one tool is the classic "when all you have is a hammer" trap. Layering the IaC strategy is the same insight as layering the team topology — different rates of change deserve different tools.

Temporal earns its place by removing brittle alternatives

Every regulated enterprise has cron jobs running multi-day workflows, hand-rolled state machines, and BPM tools nobody understands. Temporal isn't a "new toy" — it's the consolidation of three categories of fragility into one well-operated platform service. The business case is the cron jobs you delete.