How did we minimize the risk of outages during the k8s upgrade in a system that contained over 100 microservices?

Implementation

GCP & AWS

Technology

GKE / EKS

Tooling

Golang, Terraform, eks-clt

Team

4 Engineers

Scale

6000 production containers

BENEFITS

Maintenance cost reduction

Summary:

To minimize the risk of outages during k8s upgrades or maintenance, it's best to have multiple Kubernetes clusters in production. Even with a sizable number of nodes, relying on just one cluster can be risky. To ensure high availability, we've adopted the paradigm of immutable infrastructure and established a fleet of independent Kubernetes clusters.

Challenges:

The production system contained over 100 microservices

During peak hours there are over 6000 containers in the cluster

Standard auto-discovery is too slow to catch up with 3000 changes

Solution:

Streamline cluster creation with internal cli (cli eks create [role])

Automated deployment with GitOps model to get clusters up and running

Autoscaling capabilities to ensure clusters can independently handle traffic

Redesign service-discovery across the infrastructure, to avoid 502 errors while sunseting clusters

Watch more:

Privacy policy | Cookies policy

2023 - Let’s Go DevOps - All rights reserved

Design by Creativetree