Proof of Concept — Internal Developer Platform on AWS EKS
Enable developers to scaffold production-ready Node.js applications in minutes — not days — through a single self-service portal, with automated CI/CD, infrastructure, observability, and deployment built in.
50Jira Tasks
11Terraform Modules
6Pipeline Stages
~$218Monthly AWS Cost
329Automated Tests
219Files in Monorepo
π The Problem
Today, spinning up a new service requires coordination across multiple teams and tools — request a Git repo, configure CI/CD pipelines manually, set up Kubernetes manifests, request DNS entries and TLS certificates, wait for the platform team to wire up deployments.
This process takes days to weeks and creates bottlenecks that slow down delivery.
β The Solution
A Backstage-powered Internal Developer Platform that gives developers a single "Create" button to go from idea to running application on AWS EKS — fully automated, fully standardized.
Promtail ships container logs to Grafana Cloud Loki
Separation of Duties: Developers own the App Repo (source code, CI pipeline). Platform/SRE team owns the GitOps Repo (K8s manifests, ArgoCD resources, network policies). CI pipeline in App Repo triggers cross-repo commit to update image tag in GitOps Repo. A group-level GITOPS_DEPLOY_TOKEN enables automatic authentication for all scaffolded apps.
ποΈ Architecture Overview
Platform layers
High-level architecture split into portal, delivery, cloud runtime, and hosted observability.
π§ Deployment Operations
A dedicated deployment operations spec covers the full journey from code-complete to a running, validated platform. The deployment follows a strict four-phase dependency order:
Deployment phases
A cleaner four-phase sequence with explicit handoffs and validation gates.
Backstage on EKS — Deployment Topology
Cluster topology
Cloud edge services feed into a shared ALB and EKS namespaces with clear operational boundaries.
Deployment Artifacts Produced
Artifact
Purpose
terraform/terraform.tfvars.example
Template with all required Terraform variable assignments
scripts/bootstrap-tf-backend.sh
Creates S3 bucket + DynamoDB lock table for Terraform state
k8s/backstage/*.yaml
Deployment, Service, Ingress, and Secret manifests for Backstage on EKS
scripts/deploy-backstage.sh
Build, push to ECR, substitute manifests, kubectl apply, verify rollout. Accepts --app-repo (required) and --app-branch (default: main) for multi-repo workflow
README.md Deployment Guide
Step-by-step instructions covering all 10 deployment requirements
π οΈ Technology Stack
ποΈ Portal
Backstage (CNCF) — Developer portal & service catalog
GitHub OAuth — Identity verification mapped to catalog entities
Tigera Operator for Kubernetes NetworkPolicy enforcement (EKS CNI mode)
Secrets Manager
Centralized secret storage synced to K8s via ESO ClusterSecretStore
Observability
Prometheus (remote-write) + Promtail (log shipping) to Grafana Cloud
Backend
S3 bucket + DynamoDB for Terraform state locking
π§ CI/CD Pipeline
Every scaffolded app includes a 6-stage GitLab CI pipeline with security scanning and cross-repo GitOps updates:
1. LINT
ESLint code quality
→
2. TEST
Jest unit tests
→
3. SECURITY
npm audit + Trivy
→
4. BUILD
Docker image build
→
5. PUSH
Push to AWS ECR
→
6. UPDATE-GITOPS
Cross-repo image tag
π Security Scanning — Stage 3
dependency-audit — npm audit --audit-level=high
Runs on all branches (MR + main)
Produces JSON report artifact
Pipeline halts if high/critical vulnerabilities found
π³ Container Scanning — Stage 3
container-scan — Trivy --severity CRITICAL,HIGH
Runs on main branch only
Scans built Docker image for OS-level CVEs
--exit-code 0 with allow_failure: true (POC mode)
π Cross-Repo GitOps Update — Stage 6
Clones GitOps repo (devop212/gitops/<app-name>)
Updates image tag in k8s/deployment.yaml
Commits with [skip ci] to prevent CI loops
Uses group-level GITOPS_DEPLOY_TOKEN for authentication
All scaffolded apps inherit token automatically — no per-app setup
Dual-repo workflow: App Repo CI builds and pushes the image, then triggers a cross-repo commit to the GitOps Repo. ArgoCD watches the GitOps Repo and auto-syncs the deployment when the image tag changes.
π§ͺ Testing Strategy
The project uses a dual testing approach combining unit tests with property-based tests (PBT) using fast-check. This ensures both specific correctness and universal invariants hold across randomly generated inputs.
π¬ Property-Based Tests (fast-check)
K8s manifest correctness — For any valid deployment parameters, manifests produce correct Deployment, Service, Ingress, and Secret
tfvars template completeness — For any set of required Terraform variables, the template contains all assignments
README guide completeness — For any set of deployment topics, the guide covers all of them
ArgoCD sync policy — Scaffolded apps always have auto-sync enabled
Pipeline definition — All 6 CI stages present for any app name
Scaffolding completeness — All required files generated for any valid input
Duplicate name rejection — Existing names always rejected
Scaffolding rollback — Failed scaffolds clean up all artifacts
Auth provider config consistency — Production has GitHub provider (no guest); development has both
Deploy script no heredoc — No EOF/heredoc syntax in any line of the deploy script
329 automated tests across 17 test suites — all passing. Property-based tests catch edge cases that hand-written examples miss.
π Observability
Full observability powered by Grafana Cloud — no self-hosted Grafana on the cluster.
Observability
Metrics and logs leave the cluster through dedicated collectors and land in a hosted control plane.
π Metrics (Prometheus)
prom-client in every scaffolded Node.js app
RED metrics: Request rate, Error rate, Duration
Remote-write to Grafana Cloud Prometheus
ServiceMonitor auto-discovers app endpoints
π Logs (Promtail → Loki)
DaemonSet collects all container logs
Ships to Grafana Cloud Loki
Structured JSON logging
LogQL for querying and filtering
π Dashboards
Auto-provisioned per scaffolded app
Linked from Backstage entity page
Grafana Cloud free tier (10k metrics, 50 GB logs)
No Grafana pod on cluster — zero overhead
β‘ Development Approach
This project follows a spec-driven development methodology, using AI tooling to accelerate implementation. Architecture decisions, design reviews, and operational checkpoints remained human-driven throughout.
π Spec-Driven Workflow
Three formal specs — self-service platform, deployment operations, and separation of duties
Requirements → Design → Tasks — each phase reviewed and approved before proceeding
12 formal requirements with acceptance criteria per spec
Correctness properties defined before implementation began
π οΈ Tooling & Acceleration
AI-assisted code generation for boilerplate-heavy artifacts (Terraform modules, K8s manifests, CI pipelines)
Property-based testing with fast-check to validate invariants across random inputs
All generated code reviewed, tested, and iterated on before deployment
Manual operator checkpoints for infrastructure provisioning, DNS, and secrets management
Total monthly run cost: ~$218/mo (~$2,616/yr). Can be reduced to ~$150/mo with Reserved Instances or Savings Plans on EC2.
Break-Even Calculation
Manual cost per service: ~1.5 days x $600/day (loaded eng cost) = $900
IDP cost per service: ~0 (automated, 7 minutes of dev time)
IDP build cost: ~20 days x $600/day = $12,000 (one-time)
IDP run cost: ~$218/mo = $2,616/yr
Break-even: 12,000 / 900 = ~14 services
The platform pays for itself after ~14 scaffolded services.
At 5 new services/month, ROI is positive within 3 months.
At 2 new services/month, ROI is positive within 7 months.
The biggest time savings came from automating the tedious parts while focusing human effort on architecture and integration
π§ Infrastructure Has a Long Feedback Loop
EKS clusters take 15-20 minutes to provision — fast iteration isn't possible at the infrastructure layer
DNS delegation depends on registrar propagation (up to 48 hours) — this is the longest blocking dependency in the entire deployment
ACM certificate validation is chained to DNS — you can't parallelize these steps
Having the deployment dependency graph documented upfront (Phase 1 → 2 → 3 → 4) prevented wasted time trying to deploy before infrastructure was ready
π¦ Monorepo Was the Right Call for a POC
Keeping Terraform, Backstage, K8s manifests, deploy scripts, and tests in one repo simplified everything
Cross-cutting changes (e.g., adding a new Terraform output that's referenced in a K8s manifest) are atomic commits
A single CI/CD variable set at the GitLab group level covers all components
For production, you'd likely split into separate repos per concern — but for a POC, the overhead isn't worth it
π¬ Property-Based Testing Catches What Unit Tests Miss
Unit tests verify specific examples; property tests verify invariants across random inputs
The K8s manifest property test caught an edge case where certain ACM ARN formats broke YAML parsing — a hand-written test wouldn't have found this
The tfvars completeness property ensures that adding a new Terraform variable without updating the template is immediately caught
fast-check's shrinking feature pinpoints the minimal failing input, making debugging straightforward
π οΈ Operational Readiness Is a Separate Concern
The first spec (self-service platform) produced all the code. The second spec (deployment ops) was needed to actually run it.
Code-complete is not deploy-ready — there's a significant gap between "tests pass locally" and "running on EKS with DNS, TLS, and secrets configured"
Manual checkpoint tasks (Terraform apply, DNS delegation, ArgoCD install) can't be automated away — they require human judgment and access to external systems
A deployment runbook (README Deployment Guide) is as important as the code itself for reproducibility
After successfully provisioning infrastructure (EKS, VPC, ECR, Route53, ACM, ArgoCD, observability stack), the first deployment attempt hit several issues that warranted a clean rebuild rather than incremental fixes:
π³ Docker ARM vs x86 Mismatch
Building on Apple Silicon (M-series Mac) produced ARM64 images by default
EKS nodes run x86_64 (t3.medium) — pods crashed with exec format error
Even --platform linux/amd64 rebuild didn't resolve due to Docker layer caching
V2 fix: Multi-stage, multi-platform Dockerfile using docker buildx with explicit --platform linux/amd64 and --no-cache
π Plain K8s Secrets Are Fragile
Manually creating K8s secrets via kubectl create secret is error-prone and not GitOps-friendly
Secrets aren't version-controlled, can't be audited, and are lost on namespace deletion
No rotation mechanism — tokens expire and require manual intervention
V2 fix: AWS Secrets Manager + External Secrets Operator (ESO) for automatic sync from a central secret store into K8s
π Duplicate Route53 Zones
Terraform created a new Route53 zone while a pre-existing zone (from the registrar) already existed
ACM validation CNAME was placed in the wrong zone — validation hung for 1+ hour
Resolution: removed Terraform zone from state, imported pre-existing zone, ACM validated in 13 seconds
V2 fix: Import pre-existing zones before first apply, or use data source to reference existing zones
π What Was Preserved
Route53 hosted zone (Z085463018BZ8K52UEDMW) — registrar NS records already point here
ACM wildcard certificate (*.vibe-coder.biz) — already validated and issued
All code artifacts, tests, specs, and documentation
GitLab repo, CI/CD variables, and Grafana Cloud configuration
Teardown decision: immediate — fixing in-place would take longer than rebuilding clean
Key insight: When multiple foundational issues compound, a clean rebuild is faster than patching
The key takeaway: Spec-driven development with AI-assisted tooling works best when the engineer provides structure (requirements, design, correctness properties) and leverages automation for velocity. The V1 → V2 rebuild demonstrates that real-world deployment surfaces issues that no amount of local testing can predict — and that's expected. Starting from a solid spec means rebuilds start from a foundation, not from scratch.
V2 Deployment Progress (March 8–10, 2026)
V2 infrastructure was rebuilt from scratch with architectural improvements. All seven phases are complete. ArgoCD installed and configured. Backstage is live at https://backstage.vibe-coder.biz with full catalog, scaffolder, and ArgoCD integration. E2E smoke test passed — scaffolded app hellow-world-node-v03 running at https://hellow-world-node-v03.vibe-coder.biz with full GitLab CI pipeline, ArgoCD GitOps sync, shared ALB, HTTPS via ACM, and Separation of Duties policies active.
π VPC Endpoints — Hybrid Networking
Private subnets need access to AWS services (ECR, STS, Secrets Manager) without routing through NAT
Added apk update && apk upgrade --no-cache to Dockerfile for OS-level patching
Lesson: Security scanning in CI is only useful if the template starts clean — fix the source, not the scan
π§ TypeScript Schema Compatibility
Backstage scaffolder actions used Zod-style lambda schemas that aren't compatible with the JSON Schema format expected by @backstage/plugin-scaffolder-node
17 TypeScript errors across 5 files blocked yarn tsc and the deploy script
Converted all schemas to JSON Schema format with as const type assertions
Lesson: Backstage plugin APIs evolve — verify schema format against the installed version
π Stale DNS Wildcard Record
V1 deployment left a *.vibe-coder.biz A record pointing to an old ALB in us-west-1
Terraform tried to create the record but it already existed — InvalidChangeBatch error
Used terraform import to adopt the existing record, then terraform apply updated the alias to the V2 ALB
Lesson: Always check for stale DNS records from previous deployments before applying
βοΈ Config Validator Crash Loop
Backstage pod entered CrashLoopBackOff after deploy — config validator rejected missing keys
Known issue with ArgoCD on K8s < 1.30 — client-side apply stores full manifest in annotations
Fix: kubectl apply --server-side --force-conflicts bypasses the annotation size limit
Lesson: Always use server-side apply for large CRDs like ArgoCD, Istio, or Crossplane
π ArgoCD Repo URL Needs .git Suffix
ArgoCD returned 400 when registering a scaffolded app — repository not found
Root cause: publish:gitlab output remoteUrl doesn't include .git suffix
ArgoCD's ls-remote validation requires the .git suffix for HTTPS repos
Fix: Normalize repoUrl to append .git if missing, plus 3s delay and 3-attempt retry for race conditions
π Backstage Service-to-Service Auth
Catalog API returned 401 when called from a custom scaffolder action via raw fetch()
Backstage's new backend system requires service tokens for internal API calls
Injected discovery + auth services to get a proper plugin-to-plugin token
Lesson: Never use raw HTTP for Backstage internal calls — use the SDK's auth services
π YAML Colons Break GitLab CI Scripts
GitLab CI rejected the pipeline: script config should be a string
Lines like sed -i "s|image:.*|image: $TAG|" contain colons that YAML parses as key-value mappings
Also hit by git commit -m "chore: update image" — the chore: was a YAML key
Fix: Wrap script lines containing colons in single quotes in the YAML template
π¦ npm ci Requires a Lockfile
GitLab CI lint stage failed: npm ci requires package-lock.json
Scaffolded repos only have package.json — no lockfile is generated by the template
Same issue in the Dockerfile: RUN npm ci --only=production also failed
Fix: Changed to npm install in both CI template and Dockerfile for POC simplicity
π Separation of Duties
Tier 1 namespace isolation enforces least-privilege access and network segmentation across all scaffolded workloads and platform namespaces. Every scaffolded app is isolated by default with explicit allow-rules for required traffic.
Namespace isolation
Security controls are easier to scan when grouped into workload and platform boundaries.
π‘οΈ Network Isolation
Calico (Tigera Operator) installed via Terraform for NetworkPolicy enforcement
Default-deny on every namespace — no traffic flows unless explicitly allowed
Namespace labels (idp-role) drive policy selectors for cross-namespace rules
Service CIDR 172.20.0.0/16 included in platform egress rules for ClusterIP access
π RBAC & Least Privilege
Per-app ServiceAccount with read-only Role (no secrets, no write verbs)
platform-admin ClusterRole for security resource management
developer-readonly ClusterRole for safe cluster visibility
ArgoCD AppProject scopes each app to its own namespace and repo
π CI/CD Credential Scoping
Project-scoped AWS credentials (APP_AWS_ACCESS_KEY_ID) instead of shared group-level keys
Per-app ECR IAM policy template restricting push/pull to app-specific repository
Blast radius limited: compromised credentials only affect one app's ECR repo
π Production Path (Tier 2/3)
Tier 1 (current): Namespace isolation within single cluster
Tier 2: Multi-cluster with dedicated workload clusters
Tier 3: Multi-account with AWS Organizations and separate blast radius
Architecture documentation covers migration paths between tiers
14 property-based tests validate SoD correctness properties across randomly generated inputs — ensuring NetworkPolicies, RBAC, and ArgoCD scoping are structurally correct for any valid app name.
π GitHub Authentication
Phase 8 replaced the insecure guest auth provider with GitHub OAuth, eliminating unauthenticated access to the production portal. Guest auth with dangerouslyAllowOutsideDevelopment is a known security anti-pattern — GitHub OAuth provides identity verification, audit trail, and maps authenticated users to Backstage catalog entities.
Auth flow
Secrets sync and GitHub OAuth are shown as two connected control paths rather than a terminal diagram.
βοΈ Configuration Changes
Backend:plugin-auth-backend-module-guest-provider replaced with plugin-auth-backend-module-github-provider in packages/backend/src/index.ts
Frontend:providers={['guest']} changed to providers={['github']} in packages/app/src/App.tsx
Production config:auth.providers.guest section removed; auth.providers.github.production added with ${GITHUB_CLIENT_ID} and ${GITHUB_CLIENT_SECRET}
Dev config: Retains guest: {} for local development convenience alongside the new GitHub provider
π Secrets Pipeline
GitHub OAuth App registered with callback URL https://backstage.vibe-coder.biz/api/auth/github/handler/frame
Credentials stored in AWS Secrets Manager at backstage-idp-poc/backstage-secrets
ExternalSecret auto-syncs via existing dataFrom.extract pattern — no manifest changes needed
Config validator checks for GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET at startup
π‘οΈ Why This Matters
Eliminates unauthenticated access to the production portal
Provides identity verification and audit trail for all portal actions
Maps GitHub users to Backstage catalog entities via emailMatchingUserEntityProfileEmail resolver
Removes the dangerouslyAllowOutsideDevelopment: true security anti-pattern from production
Terraform update: Secrets Manager module secret_keys updated to include GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET alongside existing credentials.
π Repository Split
The monorepo is being split into three purpose-specific repositories, driven by differing change cadences: app code changes frequently, Terraform rarely, and platform ops at its own pace. Independent CI/CD, versioning, and access control per concern reduces blast radius of changes.
Repository split
The new layout emphasizes three concerns with distinct change cadence and ownership.
Deploy scripts: deploy-backstage.sh (updated with --app-repo parameter)
Tests, docs, specs, and project documentation
No Backstage source code or Terraform after split
Migration Scripts
Script
Purpose
scripts/split-app-repo.sh
Extracts backstage/ contents into a standalone App Repository
scripts/split-iac-repo.sh
Extracts terraform/ contents + bootstrap script into a standalone IaC Repository
scripts/split-platform-repo.sh
Documents which directories to remove and retain for the Platform Ops Repository
Deploy script updated: The --app-repo (required) and --app-branch (default: main) parameters enable the deploy script to clone the App Repository for Docker builds after the monorepo split. Error handling covers clone failures and missing branches.
π Domain & Access
Service
URL
Backstage Portal
https://backstage.vibe-coder.biz
ArgoCD Dashboard
https://argocd.vibe-coder.biz
Scaffolded Apps
https://<app-name>.vibe-coder.biz
Grafana Cloud
https://bedigital.grafana.net
GitLab Repos
https://gitlab.com/devop212/<group>/<app-name>
AWS Region
us-east-2 (Ohio)
π What's Next
This POC demonstrates the core self-service workflow. Future iterations could include:
π¦ Additional Templates
Python, Go, Java — same self-service experience
π° Cost Tagging
Automatic AWS cost allocation tags per scaffolded service
π Repository Split Execution
Execute the three-repo migration using split-app-repo.sh, split-iac-repo.sh, split-platform-repo.sh
π Alerting Rules
Auto-provisioned Grafana Cloud alert rules per service
π Multi-Environment
Dev → Staging → Production promotion workflows
π‘οΈ SAST/DAST
Static and dynamic application security testing in the pipeline
π Project Tracking
This project is tracked in Jira under the BDB project with four epics:
Epic: Backstage Self-Service Developer Platform
25 implementation tasks across Terraform, Backstage, CI/CD, and Kubernetes
Covers the self-service platform spec (Requirements 1-12)
All tasks labeled with backstage-idp
Epic: IDP Deployment Operations
13 tasks covering infrastructure provisioning through end-to-end validation
Includes manual checkpoint tasks for operator-driven steps
Story points: infrastructure (5), DNS/ACM (3), ArgoCD (3), GitLab vars (2), Backstage deploy (5), Grafana (2), smoke test (3), README (3)