Architecture Review

What's Missing from the GitOps Demo?

Current Architecture

What You Have (Excellent!)

  1. Multi-Cluster Isolation - Separate DEV and PROD EKS clusters
  2. High Availability in PROD - Dual NAT gateways, 2 managed nodes + Karpenter SPOT across 2 AZs
  3. GitOps Workflow - ArgoCD with auto-sync (DEV only, manual sync in PROD)
  4. Secrets Management - External Secrets Operator + AWS Secrets Manager
  5. Infrastructure as Code - Terraform with modular structure
  6. CI/CD Pipeline - GitLab CI with automated testing and deployment
  7. Monitoring - Grafana Cloud with Alloy agents
  8. Security - Spring Security, IRSA, no secrets in Git
  9. Policy Enforcement - Kyverno for environment isolation
  10. Load Balancing - Application Load Balancers for each app
  11. Karpenter-First Architecture - SPOT nodes for workloads (~70% cost savings)

What's Missing (For Production-Grade) ⚠️

1. Disaster Recovery & Backup

Current State: No backup strategy visible

What to Add:

  • EKS Cluster Backups: Velero for Kubernetes resource backups, backup to S3 with cross-region replication, automated daily backups with retention policy
  • Database Backups: Automated RDS snapshots, cross-region snapshot copies, point-in-time recovery enabled

Interview Talking Point:

"For production, I'd add Velero for cluster backups to S3 with cross-region replication. This gives us disaster recovery capability - we can restore the entire cluster state if needed."

2. Multi-Region Deployment

Current State: Single region (us-east-2)

What to Add:

  • Active-Passive Setup: Primary cluster in us-east-2, standby cluster in us-west-2, Route53 health checks with failover routing
  • Active-Active Setup: Clusters in multiple regions, global load balancing with Route53 geolocation routing

Interview Talking Point:

"For Gong's scale, I'd implement multi-region with Route53 health checks. If us-east-2 goes down, traffic automatically fails over to us-west-2. This gives us true high availability at the regional level."

3. Database Layer

Current State: No persistent database shown

What to Add:

  • RDS for Relational Data: Multi-AZ RDS (Aurora PostgreSQL recommended), read replicas for scaling, automated backups and point-in-time recovery
  • ElastiCache for Caching: Redis cluster for session storage, reduces database load, improves response times

4. Auto-Scaling

Current State: ✅ Karpenter for node auto-scaling, Metrics Server for HPA

Implementation:

  • Karpenter: Automatic node provisioning based on pod demand
  • Metrics Server: EKS managed addon for CPU/memory metrics
  • HPA Ready: Can scale pods based on CPU/memory/custom metrics
  • Cost Optimization: Spin-down scripts for nights/weekends

Interview Talking Point:

"For 3-4x scale, I'd implement Cluster Autoscaler and HPA. During traffic spikes, HPA scales pods, and Cluster Autoscaler adds nodes if needed. During off-hours, it scales down to save costs - this is how I saved $1.6M at NM."

5. Service Mesh (Optional but Valuable)

Current State: Direct pod-to-pod communication

What to Add:

  • Istio or Linkerd for mutual TLS between services
  • Traffic management (canary deployments, A/B testing)
  • Observability (distributed tracing)
  • Circuit breaking and retries

6. WAF (Web Application Firewall)

Current State: ALB without WAF

What to Add:

  • AWS WAF on ALBs for protection against SQL injection, XSS
  • Rate limiting to prevent DDoS
  • Geo-blocking if needed
  • Custom rules for application-specific threats

7. Cost Optimization

Current State: ✅ Karpenter-first architecture with SPOT instances

Implementation:

  • Karpenter SPOT Nodes: Workloads run on SPOT instances (~70% cost savings)
  • 2 Managed Nodes: Only for Karpenter HA (can't run on nodes it creates)
  • Auto-Consolidation: Karpenter terminates underutilized nodes automatically
  • Spin-Down Scripts: Scale to 0 for nights/weekends (~$107/mo baseline)

8. Compliance & Audit

Current State: Basic logging

What to Add:

  • AWS CloudTrail: All API calls logged, immutable audit trail
  • VPC Flow Logs: Network traffic analysis
  • EKS Audit Logs: All Kubernetes API calls logged
  • GuardDuty: Threat detection for AWS accounts

9. Network Policies

Current State: ✅ Zero-trust networking implemented via GitOps

Implementation:

  • 12 Kubernetes NetworkPolicies: Default deny ingress, allow VPC CIDR, DNS, monitoring
  • 4 Calico GlobalNetworkPolicies: CIDR-based ALB ingress, AWS APIs egress, Grafana Cloud egress
  • Deployed via ArgoCD: Pure GitOps, no kubectl in production

10. Certificate Management

Current State: ✅ HTTPS with ACM wildcard certificate (*.thunk-it.com)

Implementation:

  • AWS Certificate Manager (ACM): Wildcard certificate for *.thunk-it.com
  • Route 53 DNS: Custom domains for all applications
  • HTTPS on ALBs: TLS termination at ALB, automatic certificate renewal

Priority Order for Implementation

Phase 1 (Critical for Production):

  1. HTTPS/TLS - ✅ DONE: ACM wildcard certificate (*.thunk-it.com)
  2. Database Layer (RDS) - Persistent storage
  3. Backups (Velero) - Disaster recovery
  4. Auto-Scaling (HPA + CA) - ✅ DONE: Karpenter + Metrics Server

Phase 2 (High Value):

  1. WAF - Security layer
  2. Cost Optimization (Spot/RI) - Reduce costs
  3. Audit Logging - Compliance
  4. Network Policies - ✅ DONE: 12 NetworkPolicies + 4 GlobalNetworkPolicies

Phase 3 (Scale & Advanced):

  1. Multi-Region - Regional failover
  2. Service Mesh - Advanced traffic management

Estimated Costs

Configuration Monthly Cost Notes
Spun Down (Baseline) ~$107 EKS control plane, NAT gateways, ALBs only
Running (Karpenter HA) ~$180 2 managed + Karpenter SPOT nodes
Old Setup (4x ON_DEMAND) ~$230 Before Karpenter optimization
Full Production Features ~$400 Includes RDS, ElastiCache, WAF, backups

Still very cost-effective for a production-grade, highly available system!

Interview Strategy

When asked "What would you add?":

"This demo shows the core GitOps architecture. For production at Gong's scale, I'd add:

  1. Immediate priorities: HTTPS with ACM, RDS for persistence, Velero for backups, and HPA for auto-scaling
  2. Security layer: AWS WAF on ALBs, network policies for zero-trust
  3. Cost optimization: Mixed instance strategy with Spot and Reserved Instances
  4. Scale: Multi-region with Route53 failover, service mesh for advanced traffic management

The beauty of this architecture is it's extensible - we can add these components incrementally without disrupting the core GitOps workflow."

What You're NOT Missing (Already Great!)

  • GitOps - ArgoCD with proper workflow
  • Secrets Management - External Secrets Operator
  • Multi-Cluster - DEV/PROD isolation
  • IaC - Terraform with modules
  • CI/CD - GitLab with automated testing
  • Monitoring - Grafana Cloud
  • Security - IRSA, no secrets in Git
  • Policy Enforcement - Kyverno
  • High Availability - Multi-AZ, dual NAT in PROD
  • Testing - 352 tests including property-based tests

Conclusion

What You Have (Excellent POC):

Your current implementation demonstrates strong understanding of GitOps principles, multi-cluster architecture, Infrastructure as Code, CI/CD automation, secrets management, monitoring and observability, policy enforcement, and testing practices.

What's Missing (For Production):

The gaps are expected for a POC and show you understand the difference between a demo and production: security hardening (HTTPS, WAF, GuardDuty), data persistence (RDS, ElastiCache), auto-scaling (HPA, Cluster Autoscaler), disaster recovery (Velero, multi-region), cost optimization (Spot, RIs), and advanced features (service mesh, network policies).

Key Message:

"This POC demonstrates the core GitOps architecture that everything else builds on. The foundation is solid - multi-cluster isolation, proper secrets management, automated testing, and monitoring. Phase 2 adds the production hardening - security, persistence, auto-scaling, and DR. The beauty of this modular approach is we can add these incrementally without disrupting the core workflow."