How did we minimize the risk of outages during the k8s upgrade in a system that contained over 100 microservices?
Implementation
GCP & AWS
Technology
GKE / EKS
Tooling
Golang, Terraform, eks-clt
Team
4 Engineers
Scale
6000 production containers
BENEFITS
Maintenance cost reduction
Summary:
To minimize the risk of outages during k8s upgrades or maintenance, it's best to have multiple Kubernetes clusters in production. Even with a sizable number of nodes, relying on just one cluster can be risky. To ensure high availability, we've adopted the paradigm of immutable infrastructure and established a fleet of independent Kubernetes clusters.
Challenges:
The production system contained over 100 microservices
During peak hours there are over 6000 containers in the cluster
Standard auto-discovery is too slow to catch up with 3000 changes
Solution:
Streamline cluster creation with internal cli (cli eks create [role])
Automated deployment with GitOps model to get clusters up and running
Autoscaling capabilities to ensure clusters can independently handle traffic
Redesign service-discovery across the infrastructure, to avoid 502 errors while sunseting clusters