Current Architecture ✅
What You Have (Excellent!)
- Multi-Cluster Isolation - Separate DEV and PROD EKS clusters
- High Availability in PROD - Dual NAT gateways, 2 managed nodes + Karpenter SPOT across 2 AZs
- GitOps Workflow - ArgoCD with auto-sync (DEV only, manual sync in PROD)
- Secrets Management - External Secrets Operator + AWS Secrets Manager
- Infrastructure as Code - Terraform with modular structure
- CI/CD Pipeline - GitLab CI with automated testing and deployment
- Monitoring - Grafana Cloud with Alloy agents
- Security - Spring Security, IRSA, no secrets in Git
- Policy Enforcement - Kyverno for environment isolation
- Load Balancing - Application Load Balancers for each app
- Karpenter-First Architecture - SPOT nodes for workloads (~70% cost savings)
What's Missing (For Production-Grade) ⚠️
1. Disaster Recovery & Backup
Current State: No backup strategy visible
What to Add:
- EKS Cluster Backups: Velero for Kubernetes resource backups, backup to S3 with cross-region replication, automated daily backups with retention policy
- Database Backups: Automated RDS snapshots, cross-region snapshot copies, point-in-time recovery enabled
Interview Talking Point:
"For production, I'd add Velero for cluster backups to S3 with cross-region replication. This gives us disaster recovery capability - we can restore the entire cluster state if needed."
2. Multi-Region Deployment
Current State: Single region (us-east-2)
What to Add:
- Active-Passive Setup: Primary cluster in us-east-2, standby cluster in us-west-2, Route53 health checks with failover routing
- Active-Active Setup: Clusters in multiple regions, global load balancing with Route53 geolocation routing
Interview Talking Point:
"For Gong's scale, I'd implement multi-region with Route53 health checks. If us-east-2 goes down, traffic automatically fails over to us-west-2. This gives us true high availability at the regional level."
3. Database Layer
Current State: No persistent database shown
What to Add:
- RDS for Relational Data: Multi-AZ RDS (Aurora PostgreSQL recommended), read replicas for scaling, automated backups and point-in-time recovery
- ElastiCache for Caching: Redis cluster for session storage, reduces database load, improves response times
4. Auto-Scaling
Current State: ✅ Karpenter for node auto-scaling, Metrics Server for HPA
Implementation:
- Karpenter: Automatic node provisioning based on pod demand
- Metrics Server: EKS managed addon for CPU/memory metrics
- HPA Ready: Can scale pods based on CPU/memory/custom metrics
- Cost Optimization: Spin-down scripts for nights/weekends
Interview Talking Point:
"For 3-4x scale, I'd implement Cluster Autoscaler and HPA. During traffic spikes, HPA scales pods, and Cluster Autoscaler adds nodes if needed. During off-hours, it scales down to save costs - this is how I saved $1.6M at NM."
5. Service Mesh (Optional but Valuable)
Current State: Direct pod-to-pod communication
What to Add:
- Istio or Linkerd for mutual TLS between services
- Traffic management (canary deployments, A/B testing)
- Observability (distributed tracing)
- Circuit breaking and retries
6. WAF (Web Application Firewall)
Current State: ALB without WAF
What to Add:
- AWS WAF on ALBs for protection against SQL injection, XSS
- Rate limiting to prevent DDoS
- Geo-blocking if needed
- Custom rules for application-specific threats
7. Cost Optimization
Current State: ✅ Karpenter-first architecture with SPOT instances
Implementation:
- Karpenter SPOT Nodes: Workloads run on SPOT instances (~70% cost savings)
- 2 Managed Nodes: Only for Karpenter HA (can't run on nodes it creates)
- Auto-Consolidation: Karpenter terminates underutilized nodes automatically
- Spin-Down Scripts: Scale to 0 for nights/weekends (~$107/mo baseline)
8. Compliance & Audit
Current State: Basic logging
What to Add:
- AWS CloudTrail: All API calls logged, immutable audit trail
- VPC Flow Logs: Network traffic analysis
- EKS Audit Logs: All Kubernetes API calls logged
- GuardDuty: Threat detection for AWS accounts
9. Network Policies
Current State: ✅ Zero-trust networking implemented via GitOps
Implementation:
- 12 Kubernetes NetworkPolicies: Default deny ingress, allow VPC CIDR, DNS, monitoring
- 4 Calico GlobalNetworkPolicies: CIDR-based ALB ingress, AWS APIs egress, Grafana Cloud egress
- Deployed via ArgoCD: Pure GitOps, no kubectl in production
10. Certificate Management
Current State: ✅ HTTPS with ACM wildcard certificate (*.thunk-it.com)
Implementation:
- AWS Certificate Manager (ACM): Wildcard certificate for *.thunk-it.com
- Route 53 DNS: Custom domains for all applications
- HTTPS on ALBs: TLS termination at ALB, automatic certificate renewal
Priority Order for Implementation
Phase 1 (Critical for Production):
- HTTPS/TLS - ✅ DONE: ACM wildcard certificate (*.thunk-it.com)
- Database Layer (RDS) - Persistent storage
- Backups (Velero) - Disaster recovery
- Auto-Scaling (HPA + CA) - ✅ DONE: Karpenter + Metrics Server
Phase 2 (High Value):
- WAF - Security layer
- Cost Optimization (Spot/RI) - Reduce costs
- Audit Logging - Compliance
- Network Policies - ✅ DONE: 12 NetworkPolicies + 4 GlobalNetworkPolicies
Phase 3 (Scale & Advanced):
- Multi-Region - Regional failover
- Service Mesh - Advanced traffic management
Estimated Costs
| Configuration | Monthly Cost | Notes |
|---|---|---|
| Spun Down (Baseline) | ~$107 | EKS control plane, NAT gateways, ALBs only |
| Running (Karpenter HA) | ~$180 | 2 managed + Karpenter SPOT nodes |
| Old Setup (4x ON_DEMAND) | ~$230 | Before Karpenter optimization |
| Full Production Features | ~$400 | Includes RDS, ElastiCache, WAF, backups |
Still very cost-effective for a production-grade, highly available system!
Interview Strategy
When asked "What would you add?":
"This demo shows the core GitOps architecture. For production at Gong's scale, I'd add:
- Immediate priorities: HTTPS with ACM, RDS for persistence, Velero for backups, and HPA for auto-scaling
- Security layer: AWS WAF on ALBs, network policies for zero-trust
- Cost optimization: Mixed instance strategy with Spot and Reserved Instances
- Scale: Multi-region with Route53 failover, service mesh for advanced traffic management
The beauty of this architecture is it's extensible - we can add these components incrementally without disrupting the core GitOps workflow."
What You're NOT Missing (Already Great!)
- ✅ GitOps - ArgoCD with proper workflow
- ✅ Secrets Management - External Secrets Operator
- ✅ Multi-Cluster - DEV/PROD isolation
- ✅ IaC - Terraform with modules
- ✅ CI/CD - GitLab with automated testing
- ✅ Monitoring - Grafana Cloud
- ✅ Security - IRSA, no secrets in Git
- ✅ Policy Enforcement - Kyverno
- ✅ High Availability - Multi-AZ, dual NAT in PROD
- ✅ Testing - 352 tests including property-based tests
Conclusion
What You Have (Excellent POC):
Your current implementation demonstrates strong understanding of GitOps principles, multi-cluster architecture, Infrastructure as Code, CI/CD automation, secrets management, monitoring and observability, policy enforcement, and testing practices.
What's Missing (For Production):
The gaps are expected for a POC and show you understand the difference between a demo and production: security hardening (HTTPS, WAF, GuardDuty), data persistence (RDS, ElastiCache), auto-scaling (HPA, Cluster Autoscaler), disaster recovery (Velero, multi-region), cost optimization (Spot, RIs), and advanced features (service mesh, network policies).
Key Message:
"This POC demonstrates the core GitOps architecture that everything else builds on. The foundation is solid - multi-cluster isolation, proper secrets management, automated testing, and monitoring. Phase 2 adds the production hardening - security, persistence, auto-scaling, and DR. The beauty of this modular approach is we can add these incrementally without disrupting the core workflow."