Backstage IDP Self-Service Platform

📋 The Problem

Today, spinning up a new service requires coordination across multiple teams and tools — request a Git repo, configure CI/CD pipelines manually, set up Kubernetes manifests, request DNS entries and TLS certificates, wait for the platform team to wire up deployments.

This process takes days to weeks and creates bottlenecks that slow down delivery.

✅ The Solution

A Backstage-powered Internal Developer Platform that gives developers a single "Create" button to go from idea to running application on AWS EKS — fully automated, fully standardized.

👩‍💻 Developer fills out a form → Everything else is automated

🚀 How It Works

When a developer clicks "Create" in Backstage, the entire workflow is automated end-to-end:

Workflow

From scaffolding form to GitOps deployment, rendered as a clean responsive flow instead of fixed-width ASCII.

Portal automationSource + CIGitOps deploy

Time from form submission to running application: ~5 minutes

🧪 Testing Your Scaffolded App

After scaffolding a new app, here's how to verify everything is working. Replace <app-name> with the name you chose during scaffolding.

Check	How to Verify	Expected Result
App URL	`https://<app-name>.vibe-coder.biz`	Default Express response
Health Check	`curl https://<app-name>.vibe-coder.biz/health`	`{"status":"ok"}`
Metrics	`curl https://<app-name>.vibe-coder.biz/metrics`	Prometheus-format metrics
Pods Running	`kubectl get pods -n <app-name>`	2 pods in Running state
ArgoCD	Visit `argocd.vibe-coder.biz` → app tile	Synced + Healthy
GitLab CI	`gitlab.com/devop212/nodejs-apps/<app-name>/-/pipelines`	All stages passed (lint → test → security → build → deploy)
Backstage Catalog	Visit `backstage.vibe-coder.biz` → Catalog	App listed as Component with repo, pipeline, and deployment links

If all seven checks pass, your scaffolded app is fully deployed and production-ready.

📦 What Gets Generated

Every scaffolded application creates two repositories with separation of duties:

App Repo (`devop212/nodejs-apps/<app-name>`)

Component	What's Included
Source Code	Node.js Express app with health endpoint, Dockerfile, package.json
CI/CD Pipeline	GitLab CI with lint, test, security scan, build, push, update-gitops stages
Container Images	Stored in AWS ECR, tagged with commit SHA
Service Catalog Entry	`catalog-info.yaml` tracked in Backstage
Security Scanning	npm audit + Trivy container scanning before every build

GitOps Repo (`devop212/gitops/<app-name>`)

Component	What's Included
Kubernetes Manifests	Deployment, Service, Ingress, Namespace, ExternalSecret
Network Policies	Default-deny + allow ALB ingress, DNS egress, HTTPS egress, Prometheus metrics
RBAC	Scoped ServiceAccount, Role (read-only), RoleBinding per app
ArgoCD Resources	Application YAML + AppProject (destination restricted to app namespace, sourceRepos scoped to GitOps repo)
Observability	ServiceMonitor + Grafana dashboard ConfigMap

Platform-Wide Features

Component	What's Included
HTTPS	Automatic TLS via AWS ACM wildcard certificate
DNS	`<app-name>.vibe-coder.biz` via Route53
GitOps Deployment	ArgoCD watches GitOps repo, auto-syncs on every merge to main
Monitoring	Prometheus metrics + Grafana Cloud dashboards (RED metrics)
Logging	Promtail ships container logs to Grafana Cloud Loki

Separation of Duties: Developers own the App Repo (source code, CI pipeline). Platform/SRE team owns the GitOps Repo (K8s manifests, ArgoCD resources, network policies). CI pipeline in App Repo triggers cross-repo commit to update image tag in GitOps Repo. A group-level GITOPS_DEPLOY_TOKEN enables automatic authentication for all scaffolded apps.

🏗️ Architecture Overview

Platform layers

High-level architecture split into portal, delivery, cloud runtime, and hosted observability.

🚧 Deployment Operations

A dedicated deployment operations spec covers the full journey from code-complete to a running, validated platform. The deployment follows a strict four-phase dependency order:

Deployment phases

A cleaner four-phase sequence with explicit handoffs and validation gates.

Backstage on EKS — Deployment Topology

Cluster topology

Cloud edge services feed into a shared ALB and EKS namespaces with clear operational boundaries.

Deployment Artifacts Produced

Artifact	Purpose
`terraform/terraform.tfvars.example`	Template with all required Terraform variable assignments
`scripts/bootstrap-tf-backend.sh`	Creates S3 bucket + DynamoDB lock table for Terraform state
`k8s/backstage/*.yaml`	Deployment, Service, Ingress, and Secret manifests for Backstage on EKS
`scripts/deploy-backstage.sh`	Build, push to ECR, substitute manifests, kubectl apply, verify rollout. Accepts `--app-repo` (required) and `--app-branch` (default: main) for multi-repo workflow
`README.md` Deployment Guide	Step-by-step instructions covering all 10 deployment requirements

🛠️ Technology Stack

🏗️ Portal

Backstage (CNCF) — Developer portal & service catalog
GitHub OAuth — Identity verification mapped to catalog entities
Software Templates — Golden path scaffolding
ArgoCD & GitLab plugins — Entity page integrations

🔄 CI/CD

GitLab CI — 6-stage automated pipeline
npm audit — Dependency vulnerability scanning
Trivy — Container image scanning

☸️ Orchestration

AWS EKS — Managed Kubernetes
ArgoCD — GitOps continuous delivery
AWS ALB — HTTPS ingress with TLS termination

🏗️ Infrastructure

Terraform — 11 modular IaC modules
S3 + DynamoDB — Remote state with locking
AWS ACM — Free auto-renewing TLS certificates

📊 Observability

Prometheus — Cluster-side metric collection
Promtail — Container log shipping to Loki
Grafana Cloud — Hosted dashboards & alerting

🔒 Security

Calico — NetworkPolicy enforcement (namespace isolation)
npm audit — Dependency scanning (all branches)
Trivy — Container scanning (CRITICAL, HIGH)
RBAC — Per-app ServiceAccount with least-privilege Role

🏗️ Infrastructure Stack

All infrastructure is provisioned via Terraform — reproducible, version-controlled, and auditable.

Terraform module layout

The infrastructure stack reads better as responsive cards than as a wide tree diagram on smaller screens.

VPC

Private and public subnets, routing, NAT, security groups.

IAM

Cluster, node, ECR, and controller roles with IRSA integration.

EKS

Managed Kubernetes control plane with autoscaling workers.

ECR

Container registry with image retention and immutable tags.

DNS

Route53 zone and wildcard routing for platform and app endpoints.

ACM

Wildcard TLS certificate managed by AWS with auto-renewal.

Addons

Load Balancer Controller and External Secrets Operator.

Calico

Tiered network isolation and policy enforcement for namespaces.

Secrets

AWS Secrets Manager synchronized into Kubernetes when needed.

Observability

Prometheus and Promtail forwarding telemetry to Grafana Cloud.

Backend

S3 and DynamoDB power Terraform remote state and locking.

Module	Purpose
VPC	Private/public subnets, NAT gateway, security groups
IAM	EKS cluster role, node role, ECR access, LB controller IRSA
EKS	Managed Kubernetes cluster with auto-scaling node group
ECR	Container image registry with 30-image retention policy
DNS	Route53 zone for `vibe-coder.biz` + wildcard `*.vibe-coder.biz`
ACM	Free AWS-managed wildcard TLS certificate, auto-renewing
EKS Addons	AWS Load Balancer Controller + External Secrets Operator (ESO v2.1.0)
Calico	Tigera Operator for Kubernetes NetworkPolicy enforcement (EKS CNI mode)
Secrets Manager	Centralized secret storage synced to K8s via ESO ClusterSecretStore
Observability	Prometheus (remote-write) + Promtail (log shipping) to Grafana Cloud
Backend	S3 bucket + DynamoDB for Terraform state locking

🔧 CI/CD Pipeline

Every scaffolded app includes a 6-stage GitLab CI pipeline with security scanning and cross-repo GitOps updates:

1. LINT

ESLint code quality

→

2. TEST

Jest unit tests

→

3. SECURITY

npm audit + Trivy

→

4. BUILD

Docker image build

→

5. PUSH

Push to AWS ECR

→

6. UPDATE-GITOPS

Cross-repo image tag

🔍 Security Scanning — Stage 3

dependency-audit — npm audit --audit-level=high
Runs on all branches (MR + main)
Produces JSON report artifact
Pipeline halts if high/critical vulnerabilities found

🐳 Container Scanning — Stage 3

container-scan — Trivy --severity CRITICAL,HIGH
Runs on main branch only
Scans built Docker image for OS-level CVEs
--exit-code 0 with allow_failure: true (POC mode)

🔄 Cross-Repo GitOps Update — Stage 6

Clones GitOps repo (devop212/gitops/<app-name>)
Updates image tag in k8s/deployment.yaml
Commits with [skip ci] to prevent CI loops
Uses group-level GITOPS_DEPLOY_TOKEN for authentication
All scaffolded apps inherit token automatically — no per-app setup

Dual-repo workflow: App Repo CI builds and pushes the image, then triggers a cross-repo commit to the GitOps Repo. ArgoCD watches the GitOps Repo and auto-syncs the deployment when the image tag changes.

🧪 Testing Strategy

The project uses a dual testing approach combining unit tests with property-based tests (PBT) using fast-check. This ensures both specific correctness and universal invariants hold across randomly generated inputs.

🔬 Property-Based Tests (fast-check)

K8s manifest correctness — For any valid deployment parameters, manifests produce correct Deployment, Service, Ingress, and Secret
tfvars template completeness — For any set of required Terraform variables, the template contains all assignments
README guide completeness — For any set of deployment topics, the guide covers all of them
ArgoCD sync policy — Scaffolded apps always have auto-sync enabled
Pipeline definition — All 6 CI stages present for any app name
Scaffolding completeness — All required files generated for any valid input
Duplicate name rejection — Existing names always rejected
Scaffolding rollback — Failed scaffolds clean up all artifacts
Auth provider config consistency — Production has GitHub provider (no guest); development has both
Deploy script no heredoc — No EOF/heredoc syntax in any line of the deploy script
Missing GitHub credentials — Validator rejects incomplete credential subsets
ExternalSecret auto-sync — Uses dataFrom.extract pattern for automatic key sync
Backend module registration — GitHub provider present, guest provider absent
Repository content partitioning — Every file maps to exactly one of three target repos
Minimum 100 iterations per property, tagged with design references

📝 Unit Tests

K8s manifests — Image substitution, envFrom, ports, probes, ALB annotations, secret keys
Deploy script — Missing ACM_CERTIFICATE_ARN exits, ECR login format, image tag substitution
README content — Deployment guide section exists, all 10 steps referenced, prerequisites complete
Config validator — Missing config keys reported, valid configs pass
Scaffolder actions — Duplicate check, rollback behavior, e2e workflow

329 automated tests across 17 test suites — all passing. Property-based tests catch edge cases that hand-written examples miss.

📊 Observability

Full observability powered by Grafana Cloud — no self-hosted Grafana on the cluster.

Observability

Metrics and logs leave the cluster through dedicated collectors and land in a hosted control plane.

📈 Metrics (Prometheus)

prom-client in every scaffolded Node.js app
RED metrics: Request rate, Error rate, Duration
Remote-write to Grafana Cloud Prometheus
ServiceMonitor auto-discovers app endpoints

📝 Logs (Promtail → Loki)

DaemonSet collects all container logs
Ships to Grafana Cloud Loki
Structured JSON logging
LogQL for querying and filtering

📊 Dashboards

Auto-provisioned per scaffolded app
Linked from Backstage entity page
Grafana Cloud free tier (10k metrics, 50 GB logs)
No Grafana pod on cluster — zero overhead

⚡ Development Approach

This project follows a spec-driven development methodology, using AI tooling to accelerate implementation. Architecture decisions, design reviews, and operational checkpoints remained human-driven throughout.

📋 Spec-Driven Workflow

Three formal specs — self-service platform, deployment operations, and separation of duties
Requirements → Design → Tasks — each phase reviewed and approved before proceeding
12 formal requirements with acceptance criteria per spec
Correctness properties defined before implementation began

🛠️ Tooling & Acceleration

AI-assisted code generation for boilerplate-heavy artifacts (Terraform modules, K8s manifests, CI pipelines)
Property-based testing with fast-check to validate invariants across random inputs
All generated code reviewed, tested, and iterated on before deployment
Manual operator checkpoints for infrastructure provisioning, DNS, and secrets management

💰 Cost Estimate & ROI

Monthly AWS Infrastructure Cost

Resource	Spec	Monthly Cost
EKS Control Plane	1 cluster (standard support)	$73
EC2 Worker Nodes	3x t3.medium (on-demand)	$91
NAT Gateway	1 AZ + ~10 GB data processing	$33
ALB	1 load balancer + minimal LCUs	$18
ECR	~5 GB stored images	$1
Route53	1 hosted zone + queries	$1
ACM	Wildcard certificate	Free
S3 + DynamoDB	Terraform state backend	$1
Total AWS		~$218/mo

Platform Tooling Costs

Item	Cost
GitLab (gitlab.com Free tier)	$0
Grafana Cloud (Free tier — 10k metrics, 50 GB logs)	$0
Backstage (open source, CNCF)	$0
ArgoCD (open source, CNCF)	$0
Trivy (open source)	$0
Total Platform Tooling	$0

Total monthly run cost: ~$218/mo (~$2,616/yr). Can be reduced to ~$150/mo with Reserved Instances or Savings Plans on EC2.

Break-Even Calculation

Manual cost per service:  ~1.5 days x $600/day (loaded eng cost) = $900
IDP cost per service:     ~0 (automated, 7 minutes of dev time)
IDP build cost:           ~20 days x $600/day = $12,000 (one-time)
IDP run cost:             ~$218/mo = $2,616/yr

Break-even:               12,000 / 900 = ~14 services

The platform pays for itself after ~14 scaffolded services.
At 5 new services/month, ROI is positive within 3 months.
At 2 new services/month, ROI is positive within 7 months.

🎯 Key Benefits

👩‍💻 For Developers

Self-service — No tickets, no waiting — scaffold a new service in minutes
Standardized — Every app follows the same structure, CI/CD, and deployment pattern
Visibility — See build status, deployment health, and repo links in one place

🏗️ For Platform Engineering

Consistency — Golden path templates enforce best practices across all services
Reduced toil — No more manual repo setup, pipeline config, or K8s manifest creation
Governance — All services cataloged, tracked, and follow the same security baseline

📊 For Leadership

Faster time-to-market — Days → minutes for new service creation
Lower operational cost — Automated infrastructure reduces manual effort
Scalability — Add new templates (Python, Go, etc.) without changing the platform
Audit trail — Every service tracked in the catalog with full lineage

📚 Lessons Learned

💡 Spec-Driven Development Pays Off

Writing formal requirements and design docs before code forced clarity on scope and dependencies
The deployment ops spec revealed a strict four-phase dependency chain that would have been discovered painfully late without upfront design
Correctness properties defined in the design phase became the property-based tests — no guessing what to test
Separating code generation from manual operator checkpoints kept the project grounded in reality

⚡ AI Tooling Accelerates Delivery

AI-assisted development handled boilerplate-heavy work: K8s manifests, Terraform modules, test scaffolding, CI pipeline definitions
Every design decision and architecture choice was human-reviewed and iterated on
Operational context (DNS propagation, cluster state, secret rotation) still requires hands-on engineering judgment
The biggest time savings came from automating the tedious parts while focusing human effort on architecture and integration

🚧 Infrastructure Has a Long Feedback Loop

EKS clusters take 15-20 minutes to provision — fast iteration isn't possible at the infrastructure layer
DNS delegation depends on registrar propagation (up to 48 hours) — this is the longest blocking dependency in the entire deployment
ACM certificate validation is chained to DNS — you can't parallelize these steps
Having the deployment dependency graph documented upfront (Phase 1 → 2 → 3 → 4) prevented wasted time trying to deploy before infrastructure was ready

📦 Monorepo Was the Right Call for a POC

Keeping Terraform, Backstage, K8s manifests, deploy scripts, and tests in one repo simplified everything
Cross-cutting changes (e.g., adding a new Terraform output that's referenced in a K8s manifest) are atomic commits
A single CI/CD variable set at the GitLab group level covers all components
For production, you'd likely split into separate repos per concern — but for a POC, the overhead isn't worth it

🔬 Property-Based Testing Catches What Unit Tests Miss

Unit tests verify specific examples; property tests verify invariants across random inputs
The K8s manifest property test caught an edge case where certain ACM ARN formats broke YAML parsing — a hand-written test wouldn't have found this
The tfvars completeness property ensures that adding a new Terraform variable without updating the template is immediately caught
fast-check's shrinking feature pinpoints the minimal failing input, making debugging straightforward

🛠️ Operational Readiness Is a Separate Concern

The first spec (self-service platform) produced all the code. The second spec (deployment ops) was needed to actually run it.
Code-complete is not deploy-ready — there's a significant gap between "tests pass locally" and "running on EKS with DNS, TLS, and secrets configured"
Manual checkpoint tasks (Terraform apply, DNS delegation, ArgoCD install) can't be automated away — they require human judgment and access to external systems
A deployment runbook (README Deployment Guide) is as important as the code itself for reproducibility

V1 Deployment — Teardown & Rebuild Decision (March 2026)

After successfully provisioning infrastructure (EKS, VPC, ECR, Route53, ACM, ArgoCD, observability stack), the first deployment attempt hit several issues that warranted a clean rebuild rather than incremental fixes:

🐳 Docker ARM vs x86 Mismatch

Building on Apple Silicon (M-series Mac) produced ARM64 images by default
EKS nodes run x86_64 (t3.medium) — pods crashed with exec format error
Even --platform linux/amd64 rebuild didn't resolve due to Docker layer caching
V2 fix: Multi-stage, multi-platform Dockerfile using docker buildx with explicit --platform linux/amd64 and --no-cache

🔐 Plain K8s Secrets Are Fragile

Manually creating K8s secrets via kubectl create secret is error-prone and not GitOps-friendly
Secrets aren't version-controlled, can't be audited, and are lost on namespace deletion
No rotation mechanism — tokens expire and require manual intervention
V2 fix: AWS Secrets Manager + External Secrets Operator (ESO) for automatic sync from a central secret store into K8s

🌐 Duplicate Route53 Zones

Terraform created a new Route53 zone while a pre-existing zone (from the registrar) already existed
ACM validation CNAME was placed in the wrong zone — validation hung for 1+ hour
Resolution: removed Terraform zone from state, imported pre-existing zone, ACM validated in 13 seconds
V2 fix: Import pre-existing zones before first apply, or use data source to reference existing zones

🔄 What Was Preserved

Route53 hosted zone (Z085463018BZ8K52UEDMW) — registrar NS records already point here
ACM wildcard certificate (*.vibe-coder.biz) — already validated and issued
All code artifacts, tests, specs, and documentation
GitLab repo, CI/CD variables, and Grafana Cloud configuration

📋 V2 Architecture Changes

Multi-platform Docker build — docker buildx targeting linux/amd64 explicitly
Multi-stage Dockerfile — build stage + slim runtime stage for smaller images
AWS Secrets Manager — centralized secret storage with IAM-based access control
External Secrets Operator — Kubernetes operator that syncs secrets from AWS into K8s automatically
Terraform zone import — pre-existing Route53 zone imported into state, not recreated

⏱️ Time Impact

V1 infrastructure provisioning: ~2 hours (including Route53 debugging)
V1 ArgoCD + secrets setup: ~30 minutes
V1 Docker build + failed deploy: ~45 minutes
Teardown decision: immediate — fixing in-place would take longer than rebuilding clean
Key insight: When multiple foundational issues compound, a clean rebuild is faster than patching

The key takeaway: Spec-driven development with AI-assisted tooling works best when the engineer provides structure (requirements, design, correctness properties) and leverages automation for velocity. The V1 → V2 rebuild demonstrates that real-world deployment surfaces issues that no amount of local testing can predict — and that's expected. Starting from a solid spec means rebuilds start from a foundation, not from scratch.

V2 Deployment Progress (March 8–10, 2026)

V2 infrastructure was rebuilt from scratch with architectural improvements. All seven phases are complete. ArgoCD installed and configured. Backstage is live at https://backstage.vibe-coder.biz with full catalog, scaffolder, and ArgoCD integration. E2E smoke test passed — scaffolded app hellow-world-node-v03 running at https://hellow-world-node-v03.vibe-coder.biz with full GitLab CI pipeline, ArgoCD GitOps sync, shared ALB, HTTPS via ACM, and Separation of Duties policies active.

🌐 VPC Endpoints — Hybrid Networking

Private subnets need access to AWS services (ECR, STS, Secrets Manager) without routing through NAT
Added 11 VPC endpoints: S3, ECR API, ECR DKR, STS, Secrets Manager, EKS, EKS-auth, EC2, Logs, ELB
Hybrid approach: NAT Gateway for internet access + VPC endpoints for AWS service traffic
Result: Faster API calls, lower NAT data processing costs, nodes join cluster reliably

🐔 CRD Chicken-and-Egg Problem

ClusterSecretStore CRD doesn't exist until ESO Helm chart is installed
Terraform can't create a ClusterSecretStore resource if the CRD isn't registered yet
Removed ClusterSecretStore from Terraform, applied via kubectl apply -f after ESO is running
Lesson: CRD-dependent resources should be applied post-Helm, not in the same Terraform run

⏱️ Helm Race Conditions

Observability module (Prometheus + Promtail) tried to deploy before EKS addons were ready
kube-prometheus-stack CRDs take time to register — subsequent Helm releases fail if CRDs aren't ready
Added depends_on = [module.eks, module.eks_addons] and 900s timeout
Lesson: Helm chart ordering matters — explicit dependencies prevent intermittent failures

🔄 Terraform State Recovery

Helm releases created outside Terraform (manual debugging) weren't tracked in state
Subsequent terraform apply tried to recreate existing resources, causing conflicts
Used terraform import for Helm releases, K8s namespace, and K8s secret
Lesson: Always import manually-created resources back into state before continuing

🔗 Node Join Failures

EKS nodes in private subnets couldn't reach the EKS API server endpoint
Without VPC endpoints for EKS and EKS-auth, nodes can't authenticate or register
Adding EKS + EKS-auth VPC endpoints resolved the issue immediately
Lesson: Private-only EKS clusters need VPC endpoints for the control plane

📦 ESO API Version Breaking Change

ESO v2.1.0 only serves external-secrets.io/v1 API, dropped v1beta1
Both ClusterSecretStore and ExternalSecret manifests referenced v1beta1
Updated both to external-secrets.io/v1 — ExternalSecret synced immediately
Lesson: Always check CRD API versions after Helm chart upgrades

🔀 Shared ALB via Ingress Group

Each scaffolded app created its own ALB — wasteful ($18/mo each) and broke wildcard DNS routing
Added alb.ingress.kubernetes.io/group.name: backstage-idp to all ingress files
All services (Backstage, ArgoCD, scaffolded apps) now share a single ALB
Lesson: ALB Ingress Controller groups are essential for multi-service clusters with wildcard DNS

🔒 Express 5 + Node 22 Security Hardening

Express 4.x had high-severity CVEs in transitive dependencies flagged by npm audit
Upgraded scaffold template: Express ^5.1.0, prom-client ^15.1.3, Node 22 Alpine
Added apk update && apk upgrade --no-cache to Dockerfile for OS-level patching
Lesson: Security scanning in CI is only useful if the template starts clean — fix the source, not the scan

🔧 TypeScript Schema Compatibility

Backstage scaffolder actions used Zod-style lambda schemas that aren't compatible with the JSON Schema format expected by @backstage/plugin-scaffolder-node
17 TypeScript errors across 5 files blocked yarn tsc and the deploy script
Converted all schemas to JSON Schema format with as const type assertions
Lesson: Backstage plugin APIs evolve — verify schema format against the installed version

🌐 Stale DNS Wildcard Record

V1 deployment left a *.vibe-coder.biz A record pointing to an old ALB in us-west-1
Terraform tried to create the record but it already existed — InvalidChangeBatch error
Used terraform import to adopt the existing record, then terraform apply updated the alias to the V2 ALB
Lesson: Always check for stale DNS records from previous deployments before applying

⚙️ Config Validator Crash Loop

Backstage pod entered CrashLoopBackOff after deploy — config validator rejected missing keys
Missing: aws.accountId, aws.ecrRegistryUrl, argocd.baseUrl, argocd.authToken
Added required keys to app-config.production.yaml, rebuilt and redeployed
Lesson: Custom config validators are great for catching misconfig early, but ensure production configs are complete before first deploy

🔐 Guest Auth 401 in Production — Resolved

All API calls returned 401 after deploy — catalog, permissions, notifications all broken
Root cause: Guest auth provider needs dangerouslyAllowOutsideDevelopment: true in production mode
Without it, /api/auth/guest/refresh returns 403, cascading to 401 on all endpoints
Permanently resolved in Phase 8: Guest auth replaced entirely with GitHub OAuth — no more dangerouslyAllowOutsideDevelopment flag needed

📦 ArgoCD CRD Annotation Size Limit

kubectl apply failed: applicationsets.argoproj.io CRD annotations exceed 256KB limit
Known issue with ArgoCD on K8s < 1.30 — client-side apply stores full manifest in annotations
Fix: kubectl apply --server-side --force-conflicts bypasses the annotation size limit
Lesson: Always use server-side apply for large CRDs like ArgoCD, Istio, or Crossplane

🔗 ArgoCD Repo URL Needs .git Suffix

ArgoCD returned 400 when registering a scaffolded app — repository not found
Root cause: publish:gitlab output remoteUrl doesn't include .git suffix
ArgoCD's ls-remote validation requires the .git suffix for HTTPS repos
Fix: Normalize repoUrl to append .git if missing, plus 3s delay and 3-attempt retry for race conditions

🔐 Backstage Service-to-Service Auth

Catalog API returned 401 when called from a custom scaffolder action via raw fetch()
Backstage's new backend system requires service tokens for internal API calls
Injected discovery + auth services to get a proper plugin-to-plugin token
Lesson: Never use raw HTTP for Backstage internal calls — use the SDK's auth services

📝 YAML Colons Break GitLab CI Scripts

GitLab CI rejected the pipeline: script config should be a string
Lines like sed -i "s|image:.*|image: $TAG|" contain colons that YAML parses as key-value mappings
Also hit by git commit -m "chore: update image" — the chore: was a YAML key
Fix: Wrap script lines containing colons in single quotes in the YAML template

📦 npm ci Requires a Lockfile

GitLab CI lint stage failed: npm ci requires package-lock.json
Scaffolded repos only have package.json — no lockfile is generated by the template
Same issue in the Dockerfile: RUN npm ci --only=production also failed
Fix: Changed to npm install in both CI template and Dockerfile for POC simplicity

🔒 Separation of Duties

Tier 1 namespace isolation enforces least-privilege access and network segmentation across all scaffolded workloads and platform namespaces. Every scaffolded app is isolated by default with explicit allow-rules for required traffic.

Namespace isolation

Security controls are easier to scan when grouped into workload and platform boundaries.

🛡️ Network Isolation

Calico (Tigera Operator) installed via Terraform for NetworkPolicy enforcement
Default-deny on every namespace — no traffic flows unless explicitly allowed
Namespace labels (idp-role) drive policy selectors for cross-namespace rules
Service CIDR 172.20.0.0/16 included in platform egress rules for ClusterIP access

🔐 RBAC & Least Privilege

Per-app ServiceAccount with read-only Role (no secrets, no write verbs)
platform-admin ClusterRole for security resource management
developer-readonly ClusterRole for safe cluster visibility
ArgoCD AppProject scopes each app to its own namespace and repo

🔑 CI/CD Credential Scoping

Project-scoped AWS credentials (APP_AWS_ACCESS_KEY_ID) instead of shared group-level keys
Per-app ECR IAM policy template restricting push/pull to app-specific repository
Blast radius limited: compromised credentials only affect one app's ECR repo

📈 Production Path (Tier 2/3)

Tier 1 (current): Namespace isolation within single cluster
Tier 2: Multi-cluster with dedicated workload clusters
Tier 3: Multi-account with AWS Organizations and separate blast radius
Architecture documentation covers migration paths between tiers

14 property-based tests validate SoD correctness properties across randomly generated inputs — ensuring NetworkPolicies, RBAC, and ArgoCD scoping are structurally correct for any valid app name.

🔐 GitHub Authentication

Phase 8 replaced the insecure guest auth provider with GitHub OAuth, eliminating unauthenticated access to the production portal. Guest auth with dangerouslyAllowOutsideDevelopment is a known security anti-pattern — GitHub OAuth provides identity verification, audit trail, and maps authenticated users to Backstage catalog entities.

Auth flow

Secrets sync and GitHub OAuth are shown as two connected control paths rather than a terminal diagram.

⚙️ Configuration Changes

Backend: plugin-auth-backend-module-guest-provider replaced with plugin-auth-backend-module-github-provider in packages/backend/src/index.ts
Frontend: providers={['guest']} changed to providers={['github']} in packages/app/src/App.tsx
Production config: auth.providers.guest section removed; auth.providers.github.production added with ${GITHUB_CLIENT_ID} and ${GITHUB_CLIENT_SECRET}
Dev config: Retains guest: {} for local development convenience alongside the new GitHub provider

🔑 Secrets Pipeline

GitHub OAuth App registered with callback URL https://backstage.vibe-coder.biz/api/auth/github/handler/frame
Credentials stored in AWS Secrets Manager at backstage-idp-poc/backstage-secrets
ExternalSecret auto-syncs via existing dataFrom.extract pattern — no manifest changes needed
Config validator checks for GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET at startup

🛡️ Why This Matters

Eliminates unauthenticated access to the production portal
Provides identity verification and audit trail for all portal actions
Maps GitHub users to Backstage catalog entities via emailMatchingUserEntityProfileEmail resolver
Removes the dangerouslyAllowOutsideDevelopment: true security anti-pattern from production

Terraform update: Secrets Manager module secret_keys updated to include GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET alongside existing credentials.

🔀 Repository Split

The monorepo is being split into three purpose-specific repositories, driven by differing change cadences: app code changes frequently, Terraform rarely, and platform ops at its own pace. Independent CI/CD, versioning, and access control per concern reduces blast radius of changes.

Repository split

The new layout emphasizes three concerns with distinct change cadence and ownership.

📦 App Repository

Backstage source code: packages/, plugins/, templates/, examples/
Configuration: app-config.yaml, app-config.production.yaml, app-config.local.yaml
Build artifacts: package.json, yarn.lock, .yarnrc.yml, tsconfig.json, Dockerfile
Independently buildable: yarn install & yarn build:backend

🏗️ IaC Repository

Terraform modules: VPC, IAM, EKS, ECR, DNS, ACM, Addons, Calico, Secrets Manager, Observability, Backend
Root configs: main.tf, variables.tf, outputs.tf, providers.tf, backend.tf
Includes terraform.tfvars.example and scripts/bootstrap-tf-backend.sh
Independently executable: terraform init & terraform plan

⚙️ Platform Ops Repository

K8s manifests: k8s/backstage/ (deployment, service, ingress, external-secret, cluster-secret-store)
Deploy scripts: deploy-backstage.sh (updated with --app-repo parameter)
Tests, docs, specs, and project documentation
No Backstage source code or Terraform after split

Migration Scripts

Script	Purpose
`scripts/split-app-repo.sh`	Extracts `backstage/` contents into a standalone App Repository
`scripts/split-iac-repo.sh`	Extracts `terraform/` contents + bootstrap script into a standalone IaC Repository
`scripts/split-platform-repo.sh`	Documents which directories to remove and retain for the Platform Ops Repository

Deploy script updated: The --app-repo (required) and --app-branch (default: main) parameters enable the deploy script to clone the App Repository for Docker builds after the monorepo split. Error handling covers clone failures and missing branches.

🌐 Domain & Access

Service	URL
Backstage Portal	`https://backstage.vibe-coder.biz`
ArgoCD Dashboard	`https://argocd.vibe-coder.biz`
Scaffolded Apps	`https://<app-name>.vibe-coder.biz`
Grafana Cloud	`https://bedigital.grafana.net`
GitLab Repos	`https://gitlab.com/devop212/<group>/<app-name>`
AWS Region	`us-east-2` (Ohio)

🚀 What's Next

This POC demonstrates the core self-service workflow. Future iterations could include:

📦 Additional Templates

Python, Go, Java — same self-service experience

💰 Cost Tagging

Automatic AWS cost allocation tags per scaffolded service

🔀 Repository Split Execution

Execute the three-repo migration using split-app-repo.sh, split-iac-repo.sh, split-platform-repo.sh

🔔 Alerting Rules

Auto-provisioned Grafana Cloud alert rules per service

🌍 Multi-Environment

Dev → Staging → Production promotion workflows

🛡️ SAST/DAST

Static and dynamic application security testing in the pipeline

📋 Project Tracking

This project is tracked in Jira under the BDB project with four epics:

Epic: Backstage Self-Service Developer Platform

25 implementation tasks across Terraform, Backstage, CI/CD, and Kubernetes
Covers the self-service platform spec (Requirements 1-12)
All tasks labeled with backstage-idp

Epic: IDP Deployment Operations

13 tasks covering infrastructure provisioning through end-to-end validation
Includes manual checkpoint tasks for operator-driven steps
Story points: infrastructure (5), DNS/ACM (3), ArgoCD (3), GitLab vars (2), Backstage deploy (5), Grafana (2), smoke test (3), README (3)

Epic: IDP Separation of Duties

13 tasks covering Calico install, namespace labels, NetworkPolicies, RBAC, ArgoCD scoping, CI/CD credential scoping
14 property-based tests + unit tests validating correctness properties
Includes platform policies, ClusterRoles, and production-path architecture docs

Epic: GitHub Auth & Repository Split

12 tasks covering GitHub OAuth configuration, backend/frontend auth changes, deploy script updates, and three-repo migration scripts
6 new property-based tests: auth config consistency, no heredoc syntax, missing credentials detection, ExternalSecret auto-sync, backend module registration, repository content partitioning
Eliminates insecure guest auth and prepares for independent repo CI/CD

🤝 Let's Connect

Interested in discussing this project or exploring how these patterns can be applied to your infrastructure?

Brian Uckert

Cloud Infrastructure Architect

Be-Digital.biz

brian.uckert@be-digital.biz

📋 The Problem

✅ The Solution

🚀 How It Works

🧪 Testing Your Scaffolded App

📦 What Gets Generated

App Repo (devop212/nodejs-apps/<app-name>)

GitOps Repo (devop212/gitops/<app-name>)

Platform-Wide Features

🏗️ Architecture Overview

🚧 Deployment Operations

Backstage on EKS — Deployment Topology

Deployment Artifacts Produced

🛠️ Technology Stack

🏗️ Portal

🔄 CI/CD

☸️ Orchestration

🏗️ Infrastructure

📊 Observability

🔒 Security

🏗️ Infrastructure Stack

VPC

IAM

EKS

ECR

DNS

ACM

Addons

Calico

Secrets

Observability

Backend

🔧 CI/CD Pipeline

🔍 Security Scanning — Stage 3

🐳 Container Scanning — Stage 3

🔄 Cross-Repo GitOps Update — Stage 6

🧪 Testing Strategy

🔬 Property-Based Tests (fast-check)

📝 Unit Tests

📊 Observability

📈 Metrics (Prometheus)

📝 Logs (Promtail → Loki)

📊 Dashboards

⚡ Development Approach

📋 Spec-Driven Workflow

🛠️ Tooling & Acceleration

💰 Cost Estimate & ROI

Monthly AWS Infrastructure Cost

Platform Tooling Costs

Break-Even Calculation

🎯 Key Benefits

👩‍💻 For Developers

🏗️ For Platform Engineering

📊 For Leadership

📚 Lessons Learned

💡 Spec-Driven Development Pays Off

⚡ AI Tooling Accelerates Delivery

🚧 Infrastructure Has a Long Feedback Loop

📦 Monorepo Was the Right Call for a POC

🔬 Property-Based Testing Catches What Unit Tests Miss

🛠️ Operational Readiness Is a Separate Concern

🐳 Docker ARM vs x86 Mismatch

🔐 Plain K8s Secrets Are Fragile

🌐 Duplicate Route53 Zones

🔄 What Was Preserved

📋 V2 Architecture Changes

⏱️ Time Impact

🌐 VPC Endpoints — Hybrid Networking

🐔 CRD Chicken-and-Egg Problem

⏱️ Helm Race Conditions

🔄 Terraform State Recovery

🔗 Node Join Failures

📦 ESO API Version Breaking Change

🔀 Shared ALB via Ingress Group

🔒 Express 5 + Node 22 Security Hardening

🔧 TypeScript Schema Compatibility

🌐 Stale DNS Wildcard Record

⚙️ Config Validator Crash Loop

🔐 Guest Auth 401 in Production — Resolved

📦 ArgoCD CRD Annotation Size Limit

🔗 ArgoCD Repo URL Needs .git Suffix

App Repo (`devop212/nodejs-apps/<app-name>`)

GitOps Repo (`devop212/gitops/<app-name>`)