πŸ—οΈ Backstage IDP — Self-Service Developer Platform

Proof of Concept — Internal Developer Platform on AWS EKS

Enable developers to scaffold production-ready Node.js applications in minutes — not days — through a single self-service portal, with automated CI/CD, infrastructure, observability, and deployment built in.

50 Jira Tasks
11 Terraform Modules
6 Pipeline Stages
~$218 Monthly AWS Cost
329 Automated Tests
219 Files in Monorepo

πŸ“‹ The Problem

Today, spinning up a new service requires coordination across multiple teams and tools — request a Git repo, configure CI/CD pipelines manually, set up Kubernetes manifests, request DNS entries and TLS certificates, wait for the platform team to wire up deployments.

This process takes days to weeks and creates bottlenecks that slow down delivery.

βœ… The Solution

A Backstage-powered Internal Developer Platform that gives developers a single "Create" button to go from idea to running application on AWS EKS — fully automated, fully standardized.

πŸ‘©β€πŸ’» Developer fills out a form Everything else is automated

πŸš€ How It Works

When a developer clicks "Create" in Backstage, the entire workflow is automated end-to-end:

Workflow

From scaffolding form to GitOps deployment, rendered as a clean responsive flow instead of fixed-width ASCII.

πŸ‘©β€πŸ’» Developer Choose template Backstage Form Name Owner Service data Scaffold Repo + code Catalog entry GitOps metadata GitLab CI Lint Test Security Build + push Update GitOps ArgoCD Detects change Deploy to EKS namespace Ingress, service, deployment, secrets, metrics Live app URL in about 5 minutes
Portal automationSource + CIGitOps deploy
Time from form submission to running application: ~5 minutes

πŸ§ͺ Testing Your Scaffolded App

After scaffolding a new app, here's how to verify everything is working. Replace <app-name> with the name you chose during scaffolding.

Check How to Verify Expected Result
App URL https://<app-name>.vibe-coder.biz Default Express response
Health Check curl https://<app-name>.vibe-coder.biz/health {"status":"ok"}
Metrics curl https://<app-name>.vibe-coder.biz/metrics Prometheus-format metrics
Pods Running kubectl get pods -n <app-name> 2 pods in Running state
ArgoCD Visit argocd.vibe-coder.biz β†’ app tile Synced + Healthy
GitLab CI gitlab.com/devop212/nodejs-apps/<app-name>/-/pipelines All stages passed (lint β†’ test β†’ security β†’ build β†’ deploy)
Backstage Catalog Visit backstage.vibe-coder.biz β†’ Catalog App listed as Component with repo, pipeline, and deployment links
If all seven checks pass, your scaffolded app is fully deployed and production-ready.

πŸ“¦ What Gets Generated

Every scaffolded application creates two repositories with separation of duties:

App Repo (devop212/nodejs-apps/<app-name>)

Component What's Included
Source CodeNode.js Express app with health endpoint, Dockerfile, package.json
CI/CD PipelineGitLab CI with lint, test, security scan, build, push, update-gitops stages
Container ImagesStored in AWS ECR, tagged with commit SHA
Service Catalog Entrycatalog-info.yaml tracked in Backstage
Security Scanningnpm audit + Trivy container scanning before every build

GitOps Repo (devop212/gitops/<app-name>)

Component What's Included
Kubernetes ManifestsDeployment, Service, Ingress, Namespace, ExternalSecret
Network PoliciesDefault-deny + allow ALB ingress, DNS egress, HTTPS egress, Prometheus metrics
RBACScoped ServiceAccount, Role (read-only), RoleBinding per app
ArgoCD ResourcesApplication YAML + AppProject (destination restricted to app namespace, sourceRepos scoped to GitOps repo)
ObservabilityServiceMonitor + Grafana dashboard ConfigMap

Platform-Wide Features

Component What's Included
HTTPSAutomatic TLS via AWS ACM wildcard certificate
DNS<app-name>.vibe-coder.biz via Route53
GitOps DeploymentArgoCD watches GitOps repo, auto-syncs on every merge to main
MonitoringPrometheus metrics + Grafana Cloud dashboards (RED metrics)
LoggingPromtail ships container logs to Grafana Cloud Loki
Separation of Duties: Developers own the App Repo (source code, CI pipeline). Platform/SRE team owns the GitOps Repo (K8s manifests, ArgoCD resources, network policies). CI pipeline in App Repo triggers cross-repo commit to update image tag in GitOps Repo. A group-level GITOPS_DEPLOY_TOKEN enables automatic authentication for all scaffolded apps.

πŸ—οΈ Architecture Overview

Platform layers

High-level architecture split into portal, delivery, cloud runtime, and hosted observability.

BACKSTAGE PORTAL Web UI Service Catalog Template Engine DELIVERY LAYER GitLab CI pipeline: lint β†’ test β†’ security β†’ build β†’ deploy AWS CLOUD Β· us-east-2 Infrastructure VPC + networking IAM roles Route53 wildcard DNS ACM TLS Runtime + GitOps EKS cluster ECR registry ALB ingress ArgoCD auto-sync Cluster observability Prometheus remote-write Promtail log shipping Metrics + logs exported Grafana Cloud Β· dashboards Β· alerts Β· logs

🚧 Deployment Operations

A dedicated deployment operations spec covers the full journey from code-complete to a running, validated platform. The deployment follows a strict four-phase dependency order:

Deployment phases

A cleaner four-phase sequence with explicit handoffs and validation gates.

PHASE 1 Infrastructure Terraform apply VPC, EKS, ECR, DNS, ACM LB controller, observability PHASE 2 Platform services ArgoCD install Repo credentials GitLab CI variables PHASE 3 Application deploy Build and push Backstage Apply manifests Verify secrets and telemetry PHASE 4 Validation + handoff End-to-end smoke test README + Jira validation Operational sign-off

Backstage on EKS — Deployment Topology

Cluster topology

Cloud edge services feed into a shared ALB and EKS namespaces with clear operational boundaries.

Route53backstage.vibe-coder.biz ACMWildcard TLS Shared ALBAWS LB Controller EKS CLUSTER Β· backstage-idp-poc namespace: backstage Ingress β†’ Service :7007 β†’ Deployment ExternalSecret β†’ backstage-secrets ECR image pinned by SHA namespace: argocd ArgoCD server and applications GitOps sync for platform + apps HTTPS via shared ALB external-secrets ESO v2.1.0 ClusterSecretStore AWS Secrets Manager sync monitoring Prometheus remote-write Promtail to Loki Grafana Cloud dashboards kube-system AWS Load Balancer Controller Ingress reconciliation Shared ALB group routing

Deployment Artifacts Produced

Artifact Purpose
terraform/terraform.tfvars.exampleTemplate with all required Terraform variable assignments
scripts/bootstrap-tf-backend.shCreates S3 bucket + DynamoDB lock table for Terraform state
k8s/backstage/*.yamlDeployment, Service, Ingress, and Secret manifests for Backstage on EKS
scripts/deploy-backstage.shBuild, push to ECR, substitute manifests, kubectl apply, verify rollout. Accepts --app-repo (required) and --app-branch (default: main) for multi-repo workflow
README.md Deployment GuideStep-by-step instructions covering all 10 deployment requirements

πŸ› οΈ Technology Stack

πŸ—οΈ Portal

  • Backstage (CNCF) — Developer portal & service catalog
  • GitHub OAuth — Identity verification mapped to catalog entities
  • Software Templates — Golden path scaffolding
  • ArgoCD & GitLab plugins — Entity page integrations

πŸ”„ CI/CD

  • GitLab CI — 6-stage automated pipeline
  • npm audit — Dependency vulnerability scanning
  • Trivy — Container image scanning

☸️ Orchestration

  • AWS EKS — Managed Kubernetes
  • ArgoCD — GitOps continuous delivery
  • AWS ALB — HTTPS ingress with TLS termination

πŸ—οΈ Infrastructure

  • Terraform — 11 modular IaC modules
  • S3 + DynamoDB — Remote state with locking
  • AWS ACM — Free auto-renewing TLS certificates

πŸ“Š Observability

  • Prometheus — Cluster-side metric collection
  • Promtail — Container log shipping to Loki
  • Grafana Cloud — Hosted dashboards & alerting

πŸ”’ Security

  • Calico — NetworkPolicy enforcement (namespace isolation)
  • npm audit — Dependency scanning (all branches)
  • Trivy — Container scanning (CRITICAL, HIGH)
  • RBAC — Per-app ServiceAccount with least-privilege Role

πŸ—οΈ Infrastructure Stack

All infrastructure is provisioned via Terraform — reproducible, version-controlled, and auditable.

Terraform module layout

The infrastructure stack reads better as responsive cards than as a wide tree diagram on smaller screens.

VPC

Private and public subnets, routing, NAT, security groups.

IAM

Cluster, node, ECR, and controller roles with IRSA integration.

EKS

Managed Kubernetes control plane with autoscaling workers.

ECR

Container registry with image retention and immutable tags.

DNS

Route53 zone and wildcard routing for platform and app endpoints.

ACM

Wildcard TLS certificate managed by AWS with auto-renewal.

Addons

Load Balancer Controller and External Secrets Operator.

Calico

Tiered network isolation and policy enforcement for namespaces.

Secrets

AWS Secrets Manager synchronized into Kubernetes when needed.

Observability

Prometheus and Promtail forwarding telemetry to Grafana Cloud.

Backend

S3 and DynamoDB power Terraform remote state and locking.

Module Purpose
VPCPrivate/public subnets, NAT gateway, security groups
IAMEKS cluster role, node role, ECR access, LB controller IRSA
EKSManaged Kubernetes cluster with auto-scaling node group
ECRContainer image registry with 30-image retention policy
DNSRoute53 zone for vibe-coder.biz + wildcard *.vibe-coder.biz
ACMFree AWS-managed wildcard TLS certificate, auto-renewing
EKS AddonsAWS Load Balancer Controller + External Secrets Operator (ESO v2.1.0)
CalicoTigera Operator for Kubernetes NetworkPolicy enforcement (EKS CNI mode)
Secrets ManagerCentralized secret storage synced to K8s via ESO ClusterSecretStore
ObservabilityPrometheus (remote-write) + Promtail (log shipping) to Grafana Cloud
BackendS3 bucket + DynamoDB for Terraform state locking

πŸ”§ CI/CD Pipeline

Every scaffolded app includes a 6-stage GitLab CI pipeline with security scanning and cross-repo GitOps updates:

1. LINT
ESLint code quality
2. TEST
Jest unit tests
3. SECURITY
npm audit + Trivy
4. BUILD
Docker image build
5. PUSH
Push to AWS ECR
6. UPDATE-GITOPS
Cross-repo image tag

πŸ” Security Scanning — Stage 3

  • dependency-auditnpm audit --audit-level=high
  • Runs on all branches (MR + main)
  • Produces JSON report artifact
  • Pipeline halts if high/critical vulnerabilities found

🐳 Container Scanning — Stage 3

  • container-scan — Trivy --severity CRITICAL,HIGH
  • Runs on main branch only
  • Scans built Docker image for OS-level CVEs
  • --exit-code 0 with allow_failure: true (POC mode)

πŸ”„ Cross-Repo GitOps Update — Stage 6

  • Clones GitOps repo (devop212/gitops/<app-name>)
  • Updates image tag in k8s/deployment.yaml
  • Commits with [skip ci] to prevent CI loops
  • Uses group-level GITOPS_DEPLOY_TOKEN for authentication
  • All scaffolded apps inherit token automatically — no per-app setup
Dual-repo workflow: App Repo CI builds and pushes the image, then triggers a cross-repo commit to the GitOps Repo. ArgoCD watches the GitOps Repo and auto-syncs the deployment when the image tag changes.

πŸ§ͺ Testing Strategy

The project uses a dual testing approach combining unit tests with property-based tests (PBT) using fast-check. This ensures both specific correctness and universal invariants hold across randomly generated inputs.

πŸ”¬ Property-Based Tests (fast-check)

  • K8s manifest correctness — For any valid deployment parameters, manifests produce correct Deployment, Service, Ingress, and Secret
  • tfvars template completeness — For any set of required Terraform variables, the template contains all assignments
  • README guide completeness — For any set of deployment topics, the guide covers all of them
  • ArgoCD sync policy — Scaffolded apps always have auto-sync enabled
  • Pipeline definition — All 6 CI stages present for any app name
  • Scaffolding completeness — All required files generated for any valid input
  • Duplicate name rejection — Existing names always rejected
  • Scaffolding rollback — Failed scaffolds clean up all artifacts
  • Auth provider config consistency — Production has GitHub provider (no guest); development has both
  • Deploy script no heredoc — No EOF/heredoc syntax in any line of the deploy script
  • Missing GitHub credentials — Validator rejects incomplete credential subsets
  • ExternalSecret auto-sync — Uses dataFrom.extract pattern for automatic key sync
  • Backend module registration — GitHub provider present, guest provider absent
  • Repository content partitioning — Every file maps to exactly one of three target repos
  • Minimum 100 iterations per property, tagged with design references

πŸ“ Unit Tests

  • K8s manifests — Image substitution, envFrom, ports, probes, ALB annotations, secret keys
  • Deploy script — Missing ACM_CERTIFICATE_ARN exits, ECR login format, image tag substitution
  • README content — Deployment guide section exists, all 10 steps referenced, prerequisites complete
  • Config validator — Missing config keys reported, valid configs pass
  • Scaffolder actions — Duplicate check, rollback behavior, e2e workflow
329 automated tests across 17 test suites — all passing. Property-based tests catch edge cases that hand-written examples miss.

πŸ“Š Observability

Full observability powered by Grafana Cloud — no self-hosted Grafana on the cluster.

Observability

Metrics and logs leave the cluster through dedicated collectors and land in a hosted control plane.

EKS CLUSTER Prometheus RED metrics ServiceMonitor discovery Promtail Container logs Loki shipping GRAFANA CLOUD Dashboards RED views Loki LogQL Alerts Hosted rules remote-write push logs

πŸ“ˆ Metrics (Prometheus)

  • prom-client in every scaffolded Node.js app
  • RED metrics: Request rate, Error rate, Duration
  • Remote-write to Grafana Cloud Prometheus
  • ServiceMonitor auto-discovers app endpoints

πŸ“ Logs (Promtail → Loki)

  • DaemonSet collects all container logs
  • Ships to Grafana Cloud Loki
  • Structured JSON logging
  • LogQL for querying and filtering

πŸ“Š Dashboards

  • Auto-provisioned per scaffolded app
  • Linked from Backstage entity page
  • Grafana Cloud free tier (10k metrics, 50 GB logs)
  • No Grafana pod on cluster — zero overhead

⚑ Development Approach

This project follows a spec-driven development methodology, using AI tooling to accelerate implementation. Architecture decisions, design reviews, and operational checkpoints remained human-driven throughout.

πŸ“‹ Spec-Driven Workflow

  • Three formal specs — self-service platform, deployment operations, and separation of duties
  • Requirements → Design → Tasks — each phase reviewed and approved before proceeding
  • 12 formal requirements with acceptance criteria per spec
  • Correctness properties defined before implementation began

πŸ› οΈ Tooling & Acceleration

  • AI-assisted code generation for boilerplate-heavy artifacts (Terraform modules, K8s manifests, CI pipelines)
  • Property-based testing with fast-check to validate invariants across random inputs
  • All generated code reviewed, tested, and iterated on before deployment
  • Manual operator checkpoints for infrastructure provisioning, DNS, and secrets management

πŸ’° Cost Estimate & ROI

Monthly AWS Infrastructure Cost

Resource Spec Monthly Cost
EKS Control Plane1 cluster (standard support)$73
EC2 Worker Nodes3x t3.medium (on-demand)$91
NAT Gateway1 AZ + ~10 GB data processing$33
ALB1 load balancer + minimal LCUs$18
ECR~5 GB stored images$1
Route531 hosted zone + queries$1
ACMWildcard certificateFree
S3 + DynamoDBTerraform state backend$1
Total AWS~$218/mo

Platform Tooling Costs

Item Cost
GitLab (gitlab.com Free tier)$0
Grafana Cloud (Free tier — 10k metrics, 50 GB logs)$0
Backstage (open source, CNCF)$0
ArgoCD (open source, CNCF)$0
Trivy (open source)$0
Total Platform Tooling$0
Total monthly run cost: ~$218/mo (~$2,616/yr). Can be reduced to ~$150/mo with Reserved Instances or Savings Plans on EC2.

Break-Even Calculation

Manual cost per service:  ~1.5 days x $600/day (loaded eng cost) = $900
IDP cost per service:     ~0 (automated, 7 minutes of dev time)
IDP build cost:           ~20 days x $600/day = $12,000 (one-time)
IDP run cost:             ~$218/mo = $2,616/yr

Break-even:               12,000 / 900 = ~14 services
The platform pays for itself after ~14 scaffolded services.
At 5 new services/month, ROI is positive within 3 months.
At 2 new services/month, ROI is positive within 7 months.

🎯 Key Benefits

πŸ‘©β€πŸ’» For Developers

  • Self-service — No tickets, no waiting — scaffold a new service in minutes
  • Standardized — Every app follows the same structure, CI/CD, and deployment pattern
  • Visibility — See build status, deployment health, and repo links in one place

πŸ—οΈ For Platform Engineering

  • Consistency — Golden path templates enforce best practices across all services
  • Reduced toil — No more manual repo setup, pipeline config, or K8s manifest creation
  • Governance — All services cataloged, tracked, and follow the same security baseline

πŸ“Š For Leadership

  • Faster time-to-market — Days → minutes for new service creation
  • Lower operational cost — Automated infrastructure reduces manual effort
  • Scalability — Add new templates (Python, Go, etc.) without changing the platform
  • Audit trail — Every service tracked in the catalog with full lineage

πŸ“š Lessons Learned

πŸ’‘ Spec-Driven Development Pays Off

  • Writing formal requirements and design docs before code forced clarity on scope and dependencies
  • The deployment ops spec revealed a strict four-phase dependency chain that would have been discovered painfully late without upfront design
  • Correctness properties defined in the design phase became the property-based tests — no guessing what to test
  • Separating code generation from manual operator checkpoints kept the project grounded in reality

⚑ AI Tooling Accelerates Delivery

  • AI-assisted development handled boilerplate-heavy work: K8s manifests, Terraform modules, test scaffolding, CI pipeline definitions
  • Every design decision and architecture choice was human-reviewed and iterated on
  • Operational context (DNS propagation, cluster state, secret rotation) still requires hands-on engineering judgment
  • The biggest time savings came from automating the tedious parts while focusing human effort on architecture and integration

🚧 Infrastructure Has a Long Feedback Loop

  • EKS clusters take 15-20 minutes to provision — fast iteration isn't possible at the infrastructure layer
  • DNS delegation depends on registrar propagation (up to 48 hours) — this is the longest blocking dependency in the entire deployment
  • ACM certificate validation is chained to DNS — you can't parallelize these steps
  • Having the deployment dependency graph documented upfront (Phase 1 → 2 → 3 → 4) prevented wasted time trying to deploy before infrastructure was ready

πŸ“¦ Monorepo Was the Right Call for a POC

  • Keeping Terraform, Backstage, K8s manifests, deploy scripts, and tests in one repo simplified everything
  • Cross-cutting changes (e.g., adding a new Terraform output that's referenced in a K8s manifest) are atomic commits
  • A single CI/CD variable set at the GitLab group level covers all components
  • For production, you'd likely split into separate repos per concern — but for a POC, the overhead isn't worth it

πŸ”¬ Property-Based Testing Catches What Unit Tests Miss

  • Unit tests verify specific examples; property tests verify invariants across random inputs
  • The K8s manifest property test caught an edge case where certain ACM ARN formats broke YAML parsing — a hand-written test wouldn't have found this
  • The tfvars completeness property ensures that adding a new Terraform variable without updating the template is immediately caught
  • fast-check's shrinking feature pinpoints the minimal failing input, making debugging straightforward

πŸ› οΈ Operational Readiness Is a Separate Concern

  • The first spec (self-service platform) produced all the code. The second spec (deployment ops) was needed to actually run it.
  • Code-complete is not deploy-ready — there's a significant gap between "tests pass locally" and "running on EKS with DNS, TLS, and secrets configured"
  • Manual checkpoint tasks (Terraform apply, DNS delegation, ArgoCD install) can't be automated away — they require human judgment and access to external systems
  • A deployment runbook (README Deployment Guide) is as important as the code itself for reproducibility
V1 Deployment — Teardown & Rebuild Decision (March 2026)

After successfully provisioning infrastructure (EKS, VPC, ECR, Route53, ACM, ArgoCD, observability stack), the first deployment attempt hit several issues that warranted a clean rebuild rather than incremental fixes:

🐳 Docker ARM vs x86 Mismatch

  • Building on Apple Silicon (M-series Mac) produced ARM64 images by default
  • EKS nodes run x86_64 (t3.medium) — pods crashed with exec format error
  • Even --platform linux/amd64 rebuild didn't resolve due to Docker layer caching
  • V2 fix: Multi-stage, multi-platform Dockerfile using docker buildx with explicit --platform linux/amd64 and --no-cache

πŸ” Plain K8s Secrets Are Fragile

  • Manually creating K8s secrets via kubectl create secret is error-prone and not GitOps-friendly
  • Secrets aren't version-controlled, can't be audited, and are lost on namespace deletion
  • No rotation mechanism — tokens expire and require manual intervention
  • V2 fix: AWS Secrets Manager + External Secrets Operator (ESO) for automatic sync from a central secret store into K8s

🌐 Duplicate Route53 Zones

  • Terraform created a new Route53 zone while a pre-existing zone (from the registrar) already existed
  • ACM validation CNAME was placed in the wrong zone — validation hung for 1+ hour
  • Resolution: removed Terraform zone from state, imported pre-existing zone, ACM validated in 13 seconds
  • V2 fix: Import pre-existing zones before first apply, or use data source to reference existing zones

πŸ”„ What Was Preserved

  • Route53 hosted zone (Z085463018BZ8K52UEDMW) — registrar NS records already point here
  • ACM wildcard certificate (*.vibe-coder.biz) — already validated and issued
  • All code artifacts, tests, specs, and documentation
  • GitLab repo, CI/CD variables, and Grafana Cloud configuration

πŸ“‹ V2 Architecture Changes

  • Multi-platform Docker builddocker buildx targeting linux/amd64 explicitly
  • Multi-stage Dockerfile — build stage + slim runtime stage for smaller images
  • AWS Secrets Manager — centralized secret storage with IAM-based access control
  • External Secrets Operator — Kubernetes operator that syncs secrets from AWS into K8s automatically
  • Terraform zone import — pre-existing Route53 zone imported into state, not recreated

⏱️ Time Impact

  • V1 infrastructure provisioning: ~2 hours (including Route53 debugging)
  • V1 ArgoCD + secrets setup: ~30 minutes
  • V1 Docker build + failed deploy: ~45 minutes
  • Teardown decision: immediate — fixing in-place would take longer than rebuilding clean
  • Key insight: When multiple foundational issues compound, a clean rebuild is faster than patching
The key takeaway: Spec-driven development with AI-assisted tooling works best when the engineer provides structure (requirements, design, correctness properties) and leverages automation for velocity. The V1 → V2 rebuild demonstrates that real-world deployment surfaces issues that no amount of local testing can predict — and that's expected. Starting from a solid spec means rebuilds start from a foundation, not from scratch.
V2 Deployment Progress (March 8–10, 2026)

V2 infrastructure was rebuilt from scratch with architectural improvements. All seven phases are complete. ArgoCD installed and configured. Backstage is live at https://backstage.vibe-coder.biz with full catalog, scaffolder, and ArgoCD integration. E2E smoke test passed — scaffolded app hellow-world-node-v03 running at https://hellow-world-node-v03.vibe-coder.biz with full GitLab CI pipeline, ArgoCD GitOps sync, shared ALB, HTTPS via ACM, and Separation of Duties policies active.

🌐 VPC Endpoints — Hybrid Networking

  • Private subnets need access to AWS services (ECR, STS, Secrets Manager) without routing through NAT
  • Added 11 VPC endpoints: S3, ECR API, ECR DKR, STS, Secrets Manager, EKS, EKS-auth, EC2, Logs, ELB
  • Hybrid approach: NAT Gateway for internet access + VPC endpoints for AWS service traffic
  • Result: Faster API calls, lower NAT data processing costs, nodes join cluster reliably

πŸ” CRD Chicken-and-Egg Problem

  • ClusterSecretStore CRD doesn't exist until ESO Helm chart is installed
  • Terraform can't create a ClusterSecretStore resource if the CRD isn't registered yet
  • Removed ClusterSecretStore from Terraform, applied via kubectl apply -f after ESO is running
  • Lesson: CRD-dependent resources should be applied post-Helm, not in the same Terraform run

⏱️ Helm Race Conditions

  • Observability module (Prometheus + Promtail) tried to deploy before EKS addons were ready
  • kube-prometheus-stack CRDs take time to register — subsequent Helm releases fail if CRDs aren't ready
  • Added depends_on = [module.eks, module.eks_addons] and 900s timeout
  • Lesson: Helm chart ordering matters — explicit dependencies prevent intermittent failures

πŸ”„ Terraform State Recovery

  • Helm releases created outside Terraform (manual debugging) weren't tracked in state
  • Subsequent terraform apply tried to recreate existing resources, causing conflicts
  • Used terraform import for Helm releases, K8s namespace, and K8s secret
  • Lesson: Always import manually-created resources back into state before continuing

πŸ”— Node Join Failures

  • EKS nodes in private subnets couldn't reach the EKS API server endpoint
  • Without VPC endpoints for EKS and EKS-auth, nodes can't authenticate or register
  • Adding EKS + EKS-auth VPC endpoints resolved the issue immediately
  • Lesson: Private-only EKS clusters need VPC endpoints for the control plane

πŸ“¦ ESO API Version Breaking Change

  • ESO v2.1.0 only serves external-secrets.io/v1 API, dropped v1beta1
  • Both ClusterSecretStore and ExternalSecret manifests referenced v1beta1
  • Updated both to external-secrets.io/v1 — ExternalSecret synced immediately
  • Lesson: Always check CRD API versions after Helm chart upgrades

πŸ”€ Shared ALB via Ingress Group

  • Each scaffolded app created its own ALB — wasteful ($18/mo each) and broke wildcard DNS routing
  • Added alb.ingress.kubernetes.io/group.name: backstage-idp to all ingress files
  • All services (Backstage, ArgoCD, scaffolded apps) now share a single ALB
  • Lesson: ALB Ingress Controller groups are essential for multi-service clusters with wildcard DNS

πŸ”’ Express 5 + Node 22 Security Hardening

  • Express 4.x had high-severity CVEs in transitive dependencies flagged by npm audit
  • Upgraded scaffold template: Express ^5.1.0, prom-client ^15.1.3, Node 22 Alpine
  • Added apk update && apk upgrade --no-cache to Dockerfile for OS-level patching
  • Lesson: Security scanning in CI is only useful if the template starts clean — fix the source, not the scan

πŸ”§ TypeScript Schema Compatibility

  • Backstage scaffolder actions used Zod-style lambda schemas that aren't compatible with the JSON Schema format expected by @backstage/plugin-scaffolder-node
  • 17 TypeScript errors across 5 files blocked yarn tsc and the deploy script
  • Converted all schemas to JSON Schema format with as const type assertions
  • Lesson: Backstage plugin APIs evolve — verify schema format against the installed version

🌐 Stale DNS Wildcard Record

  • V1 deployment left a *.vibe-coder.biz A record pointing to an old ALB in us-west-1
  • Terraform tried to create the record but it already existed — InvalidChangeBatch error
  • Used terraform import to adopt the existing record, then terraform apply updated the alias to the V2 ALB
  • Lesson: Always check for stale DNS records from previous deployments before applying

βš™οΈ Config Validator Crash Loop

  • Backstage pod entered CrashLoopBackOff after deploy — config validator rejected missing keys
  • Missing: aws.accountId, aws.ecrRegistryUrl, argocd.baseUrl, argocd.authToken
  • Added required keys to app-config.production.yaml, rebuilt and redeployed
  • Lesson: Custom config validators are great for catching misconfig early, but ensure production configs are complete before first deploy

πŸ” Guest Auth 401 in Production — Resolved

  • All API calls returned 401 after deploy — catalog, permissions, notifications all broken
  • Root cause: Guest auth provider needs dangerouslyAllowOutsideDevelopment: true in production mode
  • Without it, /api/auth/guest/refresh returns 403, cascading to 401 on all endpoints
  • Permanently resolved in Phase 8: Guest auth replaced entirely with GitHub OAuth — no more dangerouslyAllowOutsideDevelopment flag needed

πŸ“¦ ArgoCD CRD Annotation Size Limit

  • kubectl apply failed: applicationsets.argoproj.io CRD annotations exceed 256KB limit
  • Known issue with ArgoCD on K8s < 1.30 — client-side apply stores full manifest in annotations
  • Fix: kubectl apply --server-side --force-conflicts bypasses the annotation size limit
  • Lesson: Always use server-side apply for large CRDs like ArgoCD, Istio, or Crossplane

πŸ”— ArgoCD Repo URL Needs .git Suffix

  • ArgoCD returned 400 when registering a scaffolded app — repository not found
  • Root cause: publish:gitlab output remoteUrl doesn't include .git suffix
  • ArgoCD's ls-remote validation requires the .git suffix for HTTPS repos
  • Fix: Normalize repoUrl to append .git if missing, plus 3s delay and 3-attempt retry for race conditions

πŸ” Backstage Service-to-Service Auth

  • Catalog API returned 401 when called from a custom scaffolder action via raw fetch()
  • Backstage's new backend system requires service tokens for internal API calls
  • Injected discovery + auth services to get a proper plugin-to-plugin token
  • Lesson: Never use raw HTTP for Backstage internal calls — use the SDK's auth services

πŸ“ YAML Colons Break GitLab CI Scripts

  • GitLab CI rejected the pipeline: script config should be a string
  • Lines like sed -i "s|image:.*|image: $TAG|" contain colons that YAML parses as key-value mappings
  • Also hit by git commit -m "chore: update image" — the chore: was a YAML key
  • Fix: Wrap script lines containing colons in single quotes in the YAML template

πŸ“¦ npm ci Requires a Lockfile

  • GitLab CI lint stage failed: npm ci requires package-lock.json
  • Scaffolded repos only have package.json — no lockfile is generated by the template
  • Same issue in the Dockerfile: RUN npm ci --only=production also failed
  • Fix: Changed to npm install in both CI template and Dockerfile for POC simplicity

πŸ”’ Separation of Duties

Tier 1 namespace isolation enforces least-privilege access and network segmentation across all scaffolded workloads and platform namespaces. Every scaffolded app is isolated by default with explicit allow-rules for required traffic.

Namespace isolation

Security controls are easier to scan when grouped into workload and platform boundaries.

WORKLOAD NAMESPACE Β· per app NetworkPolicies Default deny Allow ALB ingress to app port Allow DNS, HTTPS egress, metrics scrape RBAC ServiceAccount Read-only Role RoleBinding ArgoCD scope Destination: app namespace Source: app repo only Least privilege sync PLATFORM NAMESPACES backstage ALB ingress DNS + HTTPS egress argocd Namespace-local traffic API server + HTTPS egress monitoring Metrics scraping Grafana HTTPS egress kube-system DNS service target System traffic boundary

πŸ›‘οΈ Network Isolation

  • Calico (Tigera Operator) installed via Terraform for NetworkPolicy enforcement
  • Default-deny on every namespace — no traffic flows unless explicitly allowed
  • Namespace labels (idp-role) drive policy selectors for cross-namespace rules
  • Service CIDR 172.20.0.0/16 included in platform egress rules for ClusterIP access

πŸ” RBAC & Least Privilege

  • Per-app ServiceAccount with read-only Role (no secrets, no write verbs)
  • platform-admin ClusterRole for security resource management
  • developer-readonly ClusterRole for safe cluster visibility
  • ArgoCD AppProject scopes each app to its own namespace and repo

πŸ”‘ CI/CD Credential Scoping

  • Project-scoped AWS credentials (APP_AWS_ACCESS_KEY_ID) instead of shared group-level keys
  • Per-app ECR IAM policy template restricting push/pull to app-specific repository
  • Blast radius limited: compromised credentials only affect one app's ECR repo

πŸ“ˆ Production Path (Tier 2/3)

  • Tier 1 (current): Namespace isolation within single cluster
  • Tier 2: Multi-cluster with dedicated workload clusters
  • Tier 3: Multi-account with AWS Organizations and separate blast radius
  • Architecture documentation covers migration paths between tiers
14 property-based tests validate SoD correctness properties across randomly generated inputs — ensuring NetworkPolicies, RBAC, and ArgoCD scoping are structurally correct for any valid app name.

πŸ” GitHub Authentication

Phase 8 replaced the insecure guest auth provider with GitHub OAuth, eliminating unauthenticated access to the production portal. Guest auth with dangerouslyAllowOutsideDevelopment is a known security anti-pattern — GitHub OAuth provides identity verification, audit trail, and maps authenticated users to Backstage catalog entities.

Auth flow

Secrets sync and GitHub OAuth are shown as two connected control paths rather than a terminal diagram.

SECRETS DELIVERY AWS Secrets Manager OAuth + platform tokens External Secrets dataFrom.extract sync Backstage pod Env vars at runtime USER SIGN-IN JOURNEY User browser Backstage sign-in Redirect + callback handler GitHub OAuth Session

βš™οΈ Configuration Changes

  • Backend: plugin-auth-backend-module-guest-provider replaced with plugin-auth-backend-module-github-provider in packages/backend/src/index.ts
  • Frontend: providers={['guest']} changed to providers={['github']} in packages/app/src/App.tsx
  • Production config: auth.providers.guest section removed; auth.providers.github.production added with ${GITHUB_CLIENT_ID} and ${GITHUB_CLIENT_SECRET}
  • Dev config: Retains guest: {} for local development convenience alongside the new GitHub provider

πŸ”‘ Secrets Pipeline

  • GitHub OAuth App registered with callback URL https://backstage.vibe-coder.biz/api/auth/github/handler/frame
  • Credentials stored in AWS Secrets Manager at backstage-idp-poc/backstage-secrets
  • ExternalSecret auto-syncs via existing dataFrom.extract pattern — no manifest changes needed
  • Config validator checks for GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET at startup

πŸ›‘οΈ Why This Matters

  • Eliminates unauthenticated access to the production portal
  • Provides identity verification and audit trail for all portal actions
  • Maps GitHub users to Backstage catalog entities via emailMatchingUserEntityProfileEmail resolver
  • Removes the dangerouslyAllowOutsideDevelopment: true security anti-pattern from production
Terraform update: Secrets Manager module secret_keys updated to include GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET alongside existing credentials.

πŸ”€ Repository Split

The monorepo is being split into three purpose-specific repositories, driven by differing change cadences: app code changes frequently, Terraform rarely, and platform ops at its own pace. Independent CI/CD, versioning, and access control per concern reduces blast radius of changes.

Repository split

The new layout emphasizes three concerns with distinct change cadence and ownership.

Current monorepo backstage Β· terraform Β· k8s Β· scripts Β· tests Β· docs App repo packages / plugins / templates Backstage config + Dockerfile Independent build lifecycle IaC repo Terraform modules + root config Backend bootstrap + tfvars example Infrastructure-only workflows Platform ops repo k8s / deploy scripts / docs / tests Runbooks, policies, migration scripts Operational ownership boundary

πŸ“¦ App Repository

  • Backstage source code: packages/, plugins/, templates/, examples/
  • Configuration: app-config.yaml, app-config.production.yaml, app-config.local.yaml
  • Build artifacts: package.json, yarn.lock, .yarnrc.yml, tsconfig.json, Dockerfile
  • Independently buildable: yarn install & yarn build:backend

πŸ—οΈ IaC Repository

  • Terraform modules: VPC, IAM, EKS, ECR, DNS, ACM, Addons, Calico, Secrets Manager, Observability, Backend
  • Root configs: main.tf, variables.tf, outputs.tf, providers.tf, backend.tf
  • Includes terraform.tfvars.example and scripts/bootstrap-tf-backend.sh
  • Independently executable: terraform init & terraform plan

βš™οΈ Platform Ops Repository

  • K8s manifests: k8s/backstage/ (deployment, service, ingress, external-secret, cluster-secret-store)
  • Deploy scripts: deploy-backstage.sh (updated with --app-repo parameter)
  • Tests, docs, specs, and project documentation
  • No Backstage source code or Terraform after split

Migration Scripts

Script Purpose
scripts/split-app-repo.shExtracts backstage/ contents into a standalone App Repository
scripts/split-iac-repo.shExtracts terraform/ contents + bootstrap script into a standalone IaC Repository
scripts/split-platform-repo.shDocuments which directories to remove and retain for the Platform Ops Repository
Deploy script updated: The --app-repo (required) and --app-branch (default: main) parameters enable the deploy script to clone the App Repository for Docker builds after the monorepo split. Error handling covers clone failures and missing branches.

🌐 Domain & Access

Service URL
Backstage Portalhttps://backstage.vibe-coder.biz
ArgoCD Dashboardhttps://argocd.vibe-coder.biz
Scaffolded Appshttps://<app-name>.vibe-coder.biz
Grafana Cloudhttps://bedigital.grafana.net
GitLab Reposhttps://gitlab.com/devop212/<group>/<app-name>
AWS Regionus-east-2 (Ohio)

πŸš€ What's Next

This POC demonstrates the core self-service workflow. Future iterations could include:

πŸ“¦ Additional Templates

  • Python, Go, Java — same self-service experience

πŸ’° Cost Tagging

  • Automatic AWS cost allocation tags per scaffolded service

πŸ”€ Repository Split Execution

  • Execute the three-repo migration using split-app-repo.sh, split-iac-repo.sh, split-platform-repo.sh

πŸ”” Alerting Rules

  • Auto-provisioned Grafana Cloud alert rules per service

🌍 Multi-Environment

  • Dev → Staging → Production promotion workflows

πŸ›‘οΈ SAST/DAST

  • Static and dynamic application security testing in the pipeline

πŸ“‹ Project Tracking

This project is tracked in Jira under the BDB project with four epics:

Epic: Backstage Self-Service Developer Platform

  • 25 implementation tasks across Terraform, Backstage, CI/CD, and Kubernetes
  • Covers the self-service platform spec (Requirements 1-12)
  • All tasks labeled with backstage-idp

Epic: IDP Deployment Operations

  • 13 tasks covering infrastructure provisioning through end-to-end validation
  • Includes manual checkpoint tasks for operator-driven steps
  • Story points: infrastructure (5), DNS/ACM (3), ArgoCD (3), GitLab vars (2), Backstage deploy (5), Grafana (2), smoke test (3), README (3)

Epic: IDP Separation of Duties

  • 13 tasks covering Calico install, namespace labels, NetworkPolicies, RBAC, ArgoCD scoping, CI/CD credential scoping
  • 14 property-based tests + unit tests validating correctness properties
  • Includes platform policies, ClusterRoles, and production-path architecture docs

Epic: GitHub Auth & Repository Split

  • 12 tasks covering GitHub OAuth configuration, backend/frontend auth changes, deploy script updates, and three-repo migration scripts
  • 6 new property-based tests: auth config consistency, no heredoc syntax, missing credentials detection, ExternalSecret auto-sync, backend module registration, repository content partitioning
  • Eliminates insecure guest auth and prepares for independent repo CI/CD

🀝 Let's Connect

Interested in discussing this project or exploring how these patterns can be applied to your infrastructure?

Brian Uckert

Cloud Infrastructure Architect

Be-Digital.biz

brian.uckert@be-digital.biz