Good, Bad, and Ugly: The Art of Load Balancing at Scale

Good, Bad, and Ugly: The Art of Load Balancing at Scale

Good, Bad, and Ugly: The Art of Load Balancing at Scale

Jacek Marmuszewski

Load Balancing at Scale: Algorithms, Health Checks, and Production Architecture

Listen to the full episode above or read the article below:

In this session at Devoxx Poland 2024, Jacek Marmuszewski, founder of Let's Go DevOps, unravels the complexities of load balancing at scale—from simple round robin algorithms to production architectures handling 6,000 containers on 100% AWS spot instances.

How load balancing reveals your experience

Load balancing and the way traffic actually reaches your application is one of my favorite questions for interviews with candidates. Not because there's a good and bad answer to it, but it shows a lot about your experience and how big systems you are working with.

Usually the answers go this way:

The fresh college graduate

When I have a guy that's fresh out of college, he probably started some of his own project. He goes with a very simple flow:

We have a user that uses DNS and it reaches my server. "My server" is the literal name because he usually uses some kind of tutorial and this tutorial says "this is my server" or "my app." He uses some kind of hosting where he just bought a single server, opened up the process there, and pointed DNS directly to the VM.

The big organization developer

Then you have someone who has a little bit more experience but works in big organizations. For this fellow, it starts the same: We have a user, we have DNS, but then we go to some kind of load balancer.

Usually developers are not very familiar with how this load balancing works in bigger corporations, especially that underneath there are servers—servers that you need to order. Very common case is that those are bare metal, so they are managed by administrators in some magical way. Sometimes using Excel, sometimes they have something more fancy.

The DevOps culture engineer

Then we have those guys—I would say that they are in the DevOps culture, not necessarily infrastructure engineers. They start to describe the system with one breath and it starts with not a user but users.

Then we have:

  • DNS

  • CDN

  • Load balancers

  • A lot of servers (most of them dynamic)

  • Even more dynamic servers that are coming and going

This is how you can find someone who's working on a really innovative environment that utilizes cloud and uses cloud native applications. This is how the story begins.

This really short chart gives me a lot of places where I can ask about certain things, how they're doing the stuff, and what we can learn from it.

My background

I'm working as a DevOps engineer and actually I'm a founder of Let's Go DevOps company. We manage infrastructure for different companies. Load balancing is one of our key things, one of the key tools that we have in the tool chain.

Today I want to tell you a little bit about load balancing and the ugly part where sometimes the load balancing that you are using is not working correctly.

The agenda

We will start with:

  • How it works—the algorithms between you and the load balancers

  • Why it doesn't work (corner cases that break load balancing with bigger traffic)

  • How to make it work

  • Some perks of using proper balancing

  • How it's all put together in one of our infrastructures that handles pretty big traffic and is extremely dynamic

This is a company where we have scenarios during which 2,000 containers disappear from organization within minutes. If you think about the scale of organization—6,000-8,000 containers—we can lose 25% of servers and this is normal. We don't get any errors from it.

Putting it all together enables you to do a really dynamic environment that can save you a lot of money, especially if you are in the cloud.

Load balancing algorithms: The basics

Round robin

The round robin algorithm is really simple. We have a couple of servers and we are going to first, then to second, third, fourth, fifth, and then we go once again.

This probably is a flashback from college. It powers I believe 90% of load balancers that you will find on the market and this is the default algorithm.

However, it has some weird behaviors.

When round robin fails: Unequal request times

If you have an application that doesn't have traffic with similar response times—for example, a ping request that is extremely simple and extremely fast, and then you have something that's doing a lot of computation and needs a lot of time on the server—you can end up in this situation where the balancing is not equally shared between the servers.

The first server will do a lot less work than the second one.

With big traffic you'll probably never get this distribution, but we found for small apps, especially when you are writing UI, that sometimes you have this workload: having two servers where first your UI sends information about configuration or if the feature flag is enabled, and then it does the work.

You can actually reach this kind of distribution, especially in small apps. In bigger apps it's a little bit more tricky to get something like this.

When round robin fails: Different server sizes

There are also other issues. Let's say our application is okay—the requests are pretty much the same when it comes to resources on the servers. But for some reason we don't have unified servers and we get a machine that is twice as big or three times as big.

This often happens in companies using bare metals. Whenever they're changing generation of servers, the new ones are getting a lot more resources. You get this disproportion between servers and the round robin algorithm will push the same traffic to all of those.

It may not be ideal.

Weighted round robin: The solution

The way to fix it is to use something called weighted round robin.

Your administrator that has different kinds of servers can specify that the weight for server number one is three and the weight for server number two is one.

Now whenever round robin kicks in, it will send three requests to server number one and then one request to server number two. The bigger server will get 75% of the traffic while the smaller server will only get 25%.

Weighted round robin for canary releases

This is also a really nice scenario for doing canary releases. Whenever we have an old application and you want to start new code, test it out on production, and you just want to send a portion of the traffic there, you can set up weights like 95% to old release and 5% to the new release.

The round robin algorithm will make sure that only 5% reaches your new code so you can test it out, see the errors, see it in real life without impacting too much traffic.

Off-topic #1: The canary release disaster with Traefik and Kubernetes

I decided I will do some off-topics during this presentation because I want to show you real scenarios where stuff actually got out of hand in some corner cases.

The setup

Let's assume we have an application deployed on Kubernetes. The main application has around 20 containers, 20 pods. It's a pretty big application. At higher traffic we have quite a lot of instances here.

We decided the company decided that we want to test it in canary release. It's one of the critical applications. Let's do a service that will be named "service-canary." In here, because we are pushing only a portion of the traffic there, we don't want to have 20 instances. For some reason we decided it will be only one instance.

Here we have Traefik as our load balancer. Traefik is configured in the way that 95% of the traffic goes to regular service A and 5% of the traffic will go to the canary release.

We have a distribution of responses: 95% + 5%. That's really what we were expecting.

What went wrong

Now something happened in the infrastructure. We lost one of the containers—probably one of 8,000 containers. One just died. Unfortunately, it was the canary one.

We have this situation:

  • Service A: 20 pods, all working

  • Service A canary: 1 pod, not working

Now I have a question to you: How will the traffic behave?

Option number one: We will have 100% success responses because service A is actually working.

Option number two: We will be 95% of success and 5% of failures because we are pushing those 5% down the line to the canary release.

The shocking answer

I have the right answer—the real answer, not the one in documentation: The percent of success rate is 0%. We have 100% of 404 errors that the service is returning.

This is really interesting stuff.

What happened?

First of all, documentation tells us that this scenario should end up in having 100% success rate. The canary release should be removed from traffic and no traffic should go there. This is what documentation tells you. Everything was set up according to documentation.

However, there are two small errors in Traefik itself.

First error: When Traefik is doing service discovery, there is a bug where they are getting an object from Kubernetes. In this object there is a map. They are checking the length of the map. If the map is empty, they say "okay, there is nothing in this service."

However, this map is not containing only healthy nodes—it also has all the unhealthy nodes. If you have unhealthy stuff and 100% unhealthy, the map will show that something is in there. So they will register a service. However, because there are no healthy pods, they will not register any IPs inside. We will have empty service within our configuration.

That would be okay.

Second bug: Actually looking at all the services that you have configured—usually most of the load balancer implementations in Traefik are expecting that you are working on IPs directly. For this weighted load balancer, it's a little bit different implementation. You are working on services that are in fact load balancers—the interface for load balancers that are doing round robin between the IPs.

Now because you have this abstraction layer, one of those classes actually throws an error that there are no IPs in there. This throw an error up and this actually broke a chain of computing the routing for a single service.

The result

What actually happened in Traefik: Because of this situation, one of the canary releases died and the entire company was really sad because the canary died.

We hoped for it to just check it out, but we dropped down all the traffic for entire service because of two really small bugs in Traefik.

However, the outcome of it is: Always test the corner cases. It wasn't designed to actually handle this kind of situation. In documentation they say it will work. However, a lot of complex stuff in Traefik made it actually break entire production.

Least connections algorithm

Getting back to the original topic: One of the ways we can fix the problem of different resources and different size of requests or responses is using a different load balance algorithm—least connections.

How it works

Whenever you are thinking about any network traffic, it's not that we are sending data and that's it. We are actually creating a TCP connection and we are keeping the TCP connection alive.

If we have three connections opened, it means we have three requests that were sent to the server and we are waiting for the response on the same TCP connection that we opened.

This means that load balancer actually knows how many requests are currently in progress. If we have, for server number one and server number three, three requests open, each of those will have three open connections.

We can count open connections that are currently active. We can make a decision to send the next request to the server that is doing the least work.

We can send the next request to server number two because it only had one connection open. Now it will have two.

Why I thought this was brilliant (and why I was wrong)

This is a really great algorithm. I actually stumbled across it when I just finished college and it was my first work. I was like "this is super awesome, this is how we should load balance because we have all the details in load balancers and you are always speaking to the servers that are least occupied."

However, I learned quickly that it doesn't work.

The zombie server problem

How it doesn't work is actually an interesting story. We had an implementation where we had a couple of production servers—10, 15, something like this. During the deployment we accidentally broke one of them.

We broke it—I'm not really sure how, a few years have passed—but we had issues with database connection. Either we missed the config or there was some network issues going to the database.

The server, when building an object of my object factory or the DAO factory, knew that it cannot connect to database. Whenever you're trying to get anything from the database, you're getting instantly an error.

This is interesting because:

When we were sending requests to the working service, it has a really huge XML to build and it took between 500 milliseconds to 800 milliseconds. Pretty long.

However, when we were sending requests to the server that was broken, it instantly knew that there is an error with connection. It was responding with 500 error within a couple of milliseconds.

The disaster

What happened with this algorithm is that server number two was dead—this was our zombie in the system. The load balancer pushed most of the traffic to the service that was responding the fastest.

Creating empty pages, creating errors—this is something you can do extremely fast. It ended up with 90% error rate or even more. Having 10 servers, one of them broken, and most of the traffic goes to the broken server.

The least connections algorithm is not the solution for all the evil.

There are also more complex algorithms that require you to install agents taking into consideration a lot of resources on the target computer, but in my opinion it's a little bit too complex to actually start implementing.

The round robin, the weighted round robin, the least connections—those are the algorithms that are most commonly used. However, they have some issues and you might run into some weird scenarios when using them.

Service discovery: Call me maybe

During my talk about load balancing algorithms, I mentioned service discovery. Service discovery is especially important when we are talking about environments that are dynamic—we are adding servers and removing them and this is normal operation.

Whenever you're running Kubernetes, every new deployment actually creates a set of totally new services (from our perspective, new servers or services).

Now I have a question to you: What happens in this dynamic environment when we have a totally new service that wants to ask load balancer to be balanced?

The best explanation

I have the best explanation here: The new service is going to service discovery and it says: "Hey, I just met you, and this is crazy, but here's my number, so call me maybe."

This is how it works.

There's a little bit of an off-topic here because those are the lyrics of a song called "Call Me Maybe." I believe this is the only song about distributed computing that's available on the market. If you have some time after this talk, go ahead and listen to it.

It's not only my opinion because there is a consulting company called Jepsen (this is the surname of the main artist of this "Call Me Maybe" song). They also found that the lyrics correspond with distributed computing.

In their blog, they are actually testing how distributed systems are behaving with network issues. If we have distributed databases, will we lose the data? Will we have a split brain? Will we be able to use the distributed database during some kind of network event?

Distributed systems, Call Me Maybe song—go check out the Jepsen blog, especially if you are open for distributed databases. They will show you what the issues with those can be.

Service discovery tools

Getting back to the topic: When it comes to service discovery, we have a lot of tools that are already available on the market. We start from HashiCorp Consul, we have Zookeeper, but also a lot of the tools for managing containers and workloads actually embed service discovery within the system.

There's a really huge difference when it comes to timing and how much performance you can get out of those.

Consul: Great for small, problematic for large

In the past where there was no Kubernetes, we used a lot of Consul. Consul is a great tool for smaller environments. With bigger environments, especially dynamic environments, you will run into some performance issues.

How Consul works:

We have a leader. This is the only node that will be acknowledging that service is up or down. This will be our main source of truth.

We need a quorum—we need to have a cluster of master servers, at least three of those. They elect a leader and this leader is the one that's giving you a stamp.

Consul implements a really nice algorithm of moving data around called gossip. With gossip, when one of the servers receives a change to the database, it just pushes the same change to all the nodes that it knows. We are not transferring entire database—we are sending datagrams about what changed in the database.

The problem with Consul is that for every server that you have, you need to install Consul client. The Consul client is connecting either to one of the servers that it can find in the network or it can connect the clients between itself. It all uses gossip to propagate the changes.

It really works well in small environments where clients send something to server, maybe this goes to leader, then goes back to 10, maybe even 100 servers. Just 800 messages that need to be sorted around.

Consul at scale: The problem

Let's imagine the situation: We have 6,000 of those clients and 2,000 of them are dying at this moment, so they are starting the deregistration.

We have 2,000 agents that need to send information about the registration somewhere to the network. With the gossip protocol, those master servers need to go to the leader to make sure everything gets a stamp from a leader server. Then we need to go back to the network and all the way around to those 6,000 clients to propagate the change into the database.

This is a little bit tedious when you are thinking about large setups and a lot of changes going in the network during the same time.

Kubernetes solves the problem

What solves the problem? Like every DevOps, I need to say: If you have a problem, Kubernetes is your answer. Doesn't matter what the problem is.

In here, Kubernetes, yes, it's solving some of the issues by using Kubernetes events.

I know this is not the logo of events itself, but I found it interesting that there's a website about all the Kubernetes events. Today I will not bore you about Kubernetes because there are plenty of conferences about Kubernetes itself.

What I wanted to mention is that Kubernetes uses those events. There is also one thing that I really like about the way Kubernetes answers some questions: The main thing we should consider when it comes to service discovery is how to actually know that something is wrong with our service.

It's not that obvious—how can a machine know that the service is not working correctly?

Kubernetes health checks

Kubernetes implements a couple of health checks. The names are not that important, but if you think about the lifecycle of your application, you actually have a couple of things you need to consider.

Startup probe

We have application boot time. This means we need to:

  • Download the image

  • Start application

  • Load all the dependencies

  • Maybe contact third-party servers

  • Maybe get configuration

This is done in the very beginning of the application. It is not ready to do anything, it's just booting up.

Kubernetes has the startup probe and asks if the application is ready to actually serve the traffic or has it booted. This is the first probe.

Liveness probe

We have a second probe that answers the question if the process of the application is alive. This is something used internally by Kubernetes. Whenever application is not alive, it will be restarted.

Readiness probe: The most important for load balancing

Now we have something that is most important from load balancing perspective: Can application serve the traffic?

The readiness probe is actually answering this question—whether or not application should be added to load balancer.

This is really interesting stuff because we need to make sure that those probes are correctly configured so we don't get into weird situations.

When I tell you about weird situations, be really aware of zombies in your infrastructure, especially when it comes to load balancing.

The zombie service: Misconfigured readiness probes

One of the configurations we stumbled across looked like this:

We have a service that for readiness probe has:

  • Failure threshold of 4

  • Period seconds (how often the readiness probe is executed) set to 15

Why this is dangerous

Whenever you are thinking about load balancing in the dynamic environment, it's maybe not that important to add the service as fast as possible, but it's extremely important to make sure to remove the dead services instead of sending the traffic there.

If we know that something failed, then we need to remove it from the balancer.

The scenario

We have our service that is playing well with all the tools and all the environment. It's okay. We are pushing traffic there.

But we get into the situation when the server is shutting down, the service is shutting down.

Now we have this weird situation:

  • The service is down

  • It is not accepting any traffic

  • It is throwing back errors

  • Even our health check is telling us the service is not available

But still we need 15 seconds from last successful question to the next one to know that the service is down. It will tell us it's down, but we need in the worst case scenario 15 seconds.

Now it told us it is down. We have a failure threshold of 4. This means we will wait 15 additional seconds to make sure that it's dead. Then we will wait additional 15 seconds to really, really make sure that it's dead. After the next 15 seconds, we will decide that it's dead enough to stop sending traffic over there.

The result

This means that for 60 seconds we're pushing traffic knowing that we'll get failure response for every request that goes there.

This is not a great scenario. After those 60 seconds we will remove service from load balancing and everything will get back to good shape.

The fix

This is really important when you are setting up your health checks, especially the ones that are used by the balancing scheme: Make sure that those are really lightweight endpoints that you can call as many times as you want, as fast as you want, without troubling the application itself. Then react on every single error and remove application from load balancing.

But what about flip-flopping?

You may get into this problem where somebody complains that you will have applications that will flip-flop. Maybe we have a really short timeout on this health check and the application is just hitting garbage collection or something like this. It will not respond in time and we will remove it from load balancing.

But the question here is: Is it really a problem?

If the application is doing some kind of hard GC because it has a lot of resources that it wants to deallocate, maybe removing it from load balancing for a couple of seconds will give it time to actually make the clean in its memory and be in a lot better state after it finishes without being constantly hit by new traffic.

Flip-flopping on this readiness probe is not that bad a thing as you might think. Don't be frightened by setting really hard and restricted boundaries on those health checks. It's not a big deal if the application flip-flops once in a while.

Graceful shutdown

The next big problem we have with health checks is how to actually shut down application.

You might think it's easy—you just come in, you push the power button on the server, and the application is shut down. That's one of the schemas. But when you don't want to produce any errors, we need to go with something that is called graceful shutdown.

How graceful shutdown works

We have Kubernetes or any other driver for your workloads that is testing: If I send a SIGTERM signal to the application, what will happen?

The good approach is that application will:

  1. Capture it

  2. Deregister from load balancer

  3. Wait for all the active connections that are there to finish the work, send their responses

  4. Then shut down

This is the cleanest way we can actually end application life.

The bad approach: Hard shutdown

There is also a second option: We are not doing anything and we are waiting for this big bang at the very end of a timeout.

When I say a big bang, this means:

First, on our side—the infrastructure side—we will see a lot of errors coming from the application because all the active connections were just terminated mid-flight. Whatever, just gone.

The second problem: Because they were interrupted mid-flight—I don't want to say mid-transaction because transactions means you can roll back something or it will be rolled back—in here you can have a huge request that's asking a lot of systems or updating a lot of data in different systems that you just crashed in the middle of the execution.

This means there may be data inconsistency whenever you are doing this hard shutdown and you are not finishing all your work before you shut down the application.

That's not a cool scenario.

Perks of load balancing: Embrace boring technology

When it comes to perks related to load balancing, I also want to start with an off-topic. I forgot about this guy—no, he is here.

There is a cool site called boringtechnology.club. This is a presentation from I believe 2018 about choosing boring technology.

The guy comes in and tells a story about his company and how everyone was so hyped after each conference to use this or that, using totally new stuff whenever it was really not necessary to use it.

Why I love REST

I really like the blog and I really love using boring technology. Whenever somebody comes in—especially from KubeCon—it's like new people are coming back and they're telling me "from now it's only service mesh, we'll be using gRPC" and a lot of different weird stuff.

I really love REST communication. It's boring. It's old. There are a lot of tutorials. It's not sexy. gRPC is a lot better.

But you have a really, really good thing when it comes to REST: We know how to cache it and it really works well with all the load balancers and caches that are along the line.

To be honest, I can put the developers in two groups (some will actually be in two of those at the same time): They are either going "let's cache everything that we have" or "clear the cache every once in a while."

Yes, this is a little bit of an issue with caching that you need to do it twice.

Caching advice

Some advice from one wise man:

Not everything should be cached

Not everything should be cached. We have a lot of data that has personal information or some data that should not be available without authentication, without auditing, and so on. Not everything is okay for being cached.

But think about it that you can actually use the caches whatsoever.

Must-revalidate flag

The next thing: If you are afraid that the data in cache will be outdated and you will have problems with keeping the data up to date in caches—we know it's one of the biggest problems in IT (naming things and purging the caches)—there is a cool feature: The must-revalidate flag.

Whenever you are setting up the cache, you may tell your NGINX, your load balancer, that for every request that comes, before I serve the cached version, I will go to application backend and I will send it a hash of the request that I currently have in cache.

The application can do some kind of magic and determine whether or not the data changed since this value. It can be, for example, a date of last change of the data that you have in the database.

You just go in, you look at the single value—just the date. You compare those. If it's the same, it's okay. You can send the information back from cache. If it's not, we need to create a new request and we will respond with a request that has updated data.

This is how you can remove a lot of headache and a lot of traffic from your application itself and utilize the cache, knowing that always you'll have the up-to-date versions in your cache.

The MD5 mistake

With must-revalidate, there is a small catch that I found in one of the companies. The developers were using MD5 for this information about the version of the object in the cache.

What they're doing in the backend: Whenever they got must-revalidate request, they were going to database, they were building entire JSON, they were doing an MD5. If the MD5 was different, then the validate method was false, so they were doing the "do the work" method that actually went to the database, pulled the JSON, and returned it.

When doing must-revalidate, remember that it needs to be fast. You are trying to take away burden from your backend and push it on the caches. Must-revalidate should be fast and the value that you are using to compare data should be something that you can easily retrieve from the database.

Dark art of caching: Serving stale data

A little bit of the dark art of caching and load balancing: You can use caches if your server service completely breaks down.

This will not work for PII data and sensitive data that you store, but if you are having an internal system that has a lot of information that you are constantly querying, you can actually use cache to return the value even if the service is down.

Yes, the data may not be up to date. Updates will not work. But you may keep your application running on a little bit older data just utilizing the caches that you have across the infrastructure.

Real example

We have a company that uses a lot of weather information and status information about different things. Those usually are added and not changed, so they're extremely well to cache.

Those systems are actually built in this way that even if the underlying services are down, we can serve the traffic back to the internal applications from those cached values.

Leaky bucket for throttling

Also, the thing you can think about when it comes to caching is using a leaky bucket method for making sure that you are not overwhelmed with amount of traffic that comes into your application.

This is a pattern of making sure you have some throttling on the APIs, especially the APIs that are more heavy when it comes to computation.

How leaky bucket works

I think the name explains a lot. Let's imagine a bucket (I actually couldn't find a better icon, so it's not a database this time, it's a bucket). This bucket has a hole in the bottom.

The goal is that if you have a stream of messages or stream of requests that's smaller than the hole in the bucket, it will just go through without any issues.

However, whenever you have more requests that come in, they will be slowly filling up the bucket until you reach the buffer. If you reach the buffer, of course everything else will spill out—this will be the errors that you will see.

Leaky bucket is actually a nice algorithm that will soften up the curve. If you have applications that, for example, during boot time are sending a lot of messages to your backend and you want to make sure that you don't hit this curve and reach some huge levels of utilization of the resources on the backend, you can use leaky bucket on the load balancing level and make sure that the curve is actually flattened out and is kept in the bucket until you can process it.

Off-topic #2: The more of you, the less (Polish song reference)

There's also a part of a Polish song: "The more of you, the less."

The problem with load balancers, the caches, and the leaky buckets: Usually what we try to do with load balancers is we want to make them as separated from other load balancers as possible. We don't want to have the situation when crash of one load balancer is doing something to the rest of them.

The cache hit rate problem

If you think about having your application and load balancers—let's assume we have at least two of them—each of them will have their own individual cache, its own individual set of leaky buckets.

Yes, we can make it unified and we can create a Redis cache and keep the caches somewhere in the infrastructure. But the issue is that it will be another network call to some service that can crash and we may lose everything.

In here we have two load balancers. Whenever the request comes in, it will go to first or second load balancer and it will go to the app before it can be cached.

If we have 10 requests (10 GET requests) and then we update the data: Two of them will go to first and second load balancer. The eight will be served from caches. We will have around 80% hit rate on our caches.

Adding more load balancers reduces cache efficiency

However, if you think "okay, load balancers are super cool, the presentation was super cool, and I want to have more load balancers," then you can add those to the picture. Now we have eight load balancers and each of them is an individual being.

This means that for those 10 requests that we had, eight of them will probably hit each individual load balancer and only two will be served from cache. We'll have around 20% hit rate when it comes to pulling data out of the cache.

Not necessarily you want to go with cache for every system or 10 caches for every system. You of course want to make sure that you can survive crash of half your infrastructure or half of your load balancers, but don't be too aggressive when it comes to number of load balancers.

Where should you start with caching?

Now the question to you: Where should we start?

If you never used cache and you want to start, this is actually a pretty common question that I give to people: You are not using cache, every request goes to your backend. Where should you start?

My answer usually is that I want to start as far away from the application as possible. This means CDN.

You have a lot of resources that are probably public resources. You have some HTML. You probably have single page applications.

You won't believe me how many people are actually pushing those requests into the infrastructure and serving static content from their backend application.

This is one of the things that can be extremely easy—just outsource somewhere else. You can use free plan on Cloudflare, just put it in, make sure that every JSs are versioned, and you will have a CDN that's just working for you.

Where should we start? With the CDN.

The production architecture

Now, having 10 minutes of the presentation, it comes with: What we did with all the knowledge and how you can actually try to build your infrastructure to utilize all this knowledge. What's the architecture we need?

What we did for one of our customers:

Cloudflare as single entry point

We started with Cloudflare because it's a really good CDN. We also added a lot of security modules there.

The goal and the rule of thumb is that all the traffic to the system comes to Cloudflare.

You may ask me "why Cloudflare?" While this item, it really looks good on this kind of presentations where you have a single entry point. All the traffic, despite whether you have on-premise, AWS, GCP, or you are using third-party services, from attacker perspective everything will work or will be pointing to Cloudflare.

I know that probably better hackers will know how to do reconnaissance in this case, but it's a little bit more tricky to do it, so you will not expose information about where you are hosted and what are your services that you are using.

Cloudflare is really great for it. Also, you can enable bot protection and so on, so it gives you a lot of tools that you can do on the edge of the traffic that's coming in.

AWS Network Load Balancers

The next thing we are using is AWS Network Load Balancers.

Why network, not application load balancers?

First of all, Network Load Balancers are a little bit faster. The second thing is that with Application Load Balancers in AWS, we were hitting the limits so hard that it was not good enough for the scale that we had.

We needed to push down to the Network Load Balancer—the lower layer—to make sure that we can accommodate for the applications.

Multiple Kubernetes clusters

Then we are using Kubernetes clusters. But not one cluster—multiple clusters.

This was a huge discussion within the team because before Kubernetes we were really good at making sure that if a server dies, we can boot it up really quickly. We know how to do it. Infrastructure as code.

Then came Kubernetes and told us there are cows and pets (for Kubernetes nodes you are—whatever Kubernetes will manage those).

But at some point we reached the level where you need to really care about the cluster. If you break the production cluster, you are breaking entire production at once. People are doing changes to the single clusters every day. That's a little bit scary.

We decided that we want to keep the old way still alive. We have a couple of clusters that are fully active. At any point of time we can remove any of those.

One of them—replace it with a new one. All the hardcore changes that we are doing to the clusters, we are doing on new instance that is added to the traffic. We have this immutable infrastructure on the Kubernetes cluster level. That's a different approach that we picked here.

NGINX on Kubernetes with smart design

This brings a little bit of challenge. To get the load balancing done, we are installing load balancers—in this case NGINX—directly on Kubernetes cluster.

To make sure that those have caches: The cache is locally on each of the NGINX. This is a StatefulSet. Whenever it restarts, we will get the same drive, so a lot of caches will be pre-populated in case we are doing just a restart of NGINX.

However, if you know Network Load Balancers, you know that for adding the server to Network Load Balancer, you will get a 5-minute penalty. You cannot push traffic for 5 minutes during the time of boot up of the service.

The warm standby solution

What we thought about: Let's have more NGINXes because we can have it deployed as a DaemonSet. For every server that we have, we can have NGINX that will be pre-registered to Network Load Balancer. But we can play with the readiness probe.

We can say that those NGINXes are not ready to serve traffic. This means that we have only two or three active NGINXes that are serving traffic. The rest of them is just kept there waiting.

In case one of the active NGINXes goes down, we don't have a 5-minute window for adding a new one. We are just adding it as it is. It's already warmed up. It's already registered. We can switch the flag and the new NGINX will be added there.

Kubernetes events for fast updates

To make sure that we get all the information about all the changes in the infrastructure, we are listening to Kubernetes events.

Every time a service goes down or is removed or the NGINXes go down, those NGINXes individually refresh their configuration and make sure that we are up to date.

We don't get too many errors whenever we are losing like 20% of the infrastructure because the events are pretty fast.

Cross-cluster backup routing

To make sure that if we hit a service or the cluster where one of the services is not working, we also have a backup scenario where all the traffic is actually pushed out to one of the neighbors.

You can easily do it with NGINX—you have just upstream with a backup flag. This means that if none of the regular servers is available, you will only then use the backup. The backup is actually targeting different clusters, not services in there.

It's more like: If the request comes into the cluster, it will be computed on the cluster. But if for some reason we cannot complete the computation there, we'll just pop out the request to another cluster and it will be computed somewhere else.

Yeah, that's the overall architecture of the solution.

Statistics: The numbers that matter

When it comes to statistics:

The company has over 100 microservices. It's a pretty big company when it comes to stuff that you are deploying.

This translates to over 6,000 containers running on the cluster.

In peak times it corresponds to between 100 and 200 requests per second coming from external parties (something that comes from internet).

Also between 200 and 600 requests that are done internally between services.

Maybe not that big of a scale, but I think that 800 requests in peak is pretty much an interesting case that you can experiment on.

The real feature: 100% spot instances

The real thing (the real feature, it's not an issue): The company is running 100% of spot instances on production.

Spot instances for those of you who are not familiar with AWS—this is a server spot market. You can buy servers for 20%, 30%, 40% of the original price. Those are some leftovers on servers on AWS.

The only catch is that they are not permanent. You may get a request from AWS to move out of the server within 120 seconds.

Because the company was designed from the very beginning to be cloud native, to move applications pretty fast between different servers, because the load balancers are done the way that they can accommodate for this case, the company can run 100% on spots, saving around 60% of the AWS billing just by using these methods.

We don't get any errors when we get eviction from multiple servers. We are doing chaos engineering on production on a daily basis.

I would say that every day we are losing 15% of infrastructure and that's okay because the way it's built, we can actually accommodate for it.

Carbon savings

When it comes to Cloudflare, there's also a really important thing to mention: Just in 2022 we saved 514 kilograms of carbon, and this translates to shutting out one light bulb for a year.

I know it's a small step, but bigger companies, more light bulbs—it's definitely a step forward.

Wrap up

Before I finish, I have two things to share with you:

I will be here if you have any questions after the talk. If for some reason you will not find me here, just ping me on LinkedIn. You can go to Let's Go DevOps profile and find me over there. You can also send to devo@letsgodevops.pl.

We also have some cool t-shirts to win with this gopher team. To win one of those, there are two things you need to do:

First (the tricky one): You need to come here because I have some flyers on the table. On the back of the flyer you will find the email address. You don't need to read through all the flyer, you just need to send an email that you want to have a t-shirt.

I have five of them with me. Unfortunately there's a disclaimer because I have a fixed size and fixed templates with me today, so first come, first served.

I think that's all from my side. We still have two minutes, so if you have any questions, we can go now or use the break to talk more about it.

FAQ

What load balancing algorithm should I use?
Round robin works for 90% of cases but fails with unequal request times or different server sizes. Use weighted round robin for different server capacities or canary releases. Avoid least connections—it can route 90% of traffic to broken servers that respond fastest with errors.

How should I configure Kubernetes readiness probes for load balancing?
Make probes lightweight and fast with low period seconds (not 15). React immediately to failures—don't use high failure thresholds. Misconfigured probes with failure threshold of 4 and 15-second periods can send traffic to dead services for 60 seconds. Flip-flopping is okay—it gives applications time to recover.

What's the difference between Consul and Kubernetes for service discovery?
Consul works well for small environments but struggles at scale (6,000+ clients with 2,000 simultaneous changes). Kubernetes uses events which are faster for dynamic environments. At large scale with frequent changes, Kubernetes service discovery performs better than Consul's gossip protocol.

How do I implement graceful shutdown?
Applications should capture SIGTERM signals, deregister from load balancer, wait for all active connections to finish sending responses, then shut down. Hard shutdowns cause errors (terminated mid-flight connections) and potential data inconsistency when requests updating multiple systems crash mid-execution.

Where should I start with caching?
Start as far from the application as possible: CDN first. Many companies serve static content (HTML, JavaScript, single-page applications) from backend applications. Use free Cloudflare plans, version your JS files, and get immediate caching benefits without application changes.

How does adding more load balancers affect cache hit rates?
With 2 load balancers and 10 requests, you get ~80% cache hit rate. With 8 load balancers, hit rate drops to ~20% because each load balancer has individual cache. Don't be too aggressive with load balancer count—balance redundancy against cache efficiency.

What's the must-revalidate flag and how should I use it?
Must-revalidate tells load balancers to check with the backend (using a hash of cached data) before serving cached content. Backend compares a simple value like last-modified date—if unchanged, serve from cache; if changed, generate new response. Don't use MD5 of entire JSON (too slow)—use easily retrievable database timestamps.

Why use Network Load Balancers instead of Application Load Balancers?
Network Load Balancers are faster and handle higher scale. With Application Load Balancers in AWS, we hit limits so hard they weren't good enough for our scale. We needed the lower layer (Network Load Balancer) to accommodate the applications.

How do you handle the 5-minute Network Load Balancer penalty?
Deploy NGINX as DaemonSet with multiple instances pre-registered to Network Load Balancer but marked as not ready (readiness probe). Keep 2-3 active NGINXes serving traffic. When one fails, flip the readiness flag on a warm standby—it's already registered, no 5-minute wait.

Can you really run 100% spot instances in production?
Yes, if designed for it from the beginning. Our client runs 6,000+ containers on 100% spot instances, saving 60% on AWS billing. They lose 15% of infrastructure daily without errors. Requires cloud-native design, proper load balancing, fast service discovery (Kubernetes events), and cross-cluster backup routing.

Event: Devoxx Poland 2024
Speaker: Jacek Marmuszewski, Founder of Let's Go DevOps
Company: Let's Go DevOps
Topics: Load balancing algorithms, Kubernetes health checks, service discovery, caching strategies, production architecture, AWS cost optimization, spot instances

Architecture: Cloudflare → AWS Network Load Balancers → Multiple Kubernetes Clusters → NGINX (StatefulSet with local cache) → 100+ microservices, 6,000+ containers

Results: 800 req/s peak, 100% spot instances, 60% AWS cost savings, 15% daily infrastructure loss with zero errors

Want to expand the topic?

Want to expand the topic?

Address:

Let's Go DevOps Sp z o.o.
Zamknięta Str. 10/1.5
30-554 Cracow, Poland

View our profile
desingrush.com

Let’s arrange a free consultation

Just fill out the form below and we will contact you via email to arrange a free call to discuss your project scope and share our insights from similar projects.

© 2024 Let’s Go DevOps. All rights reserved.

Address:

Let's Go DevOps Sp z o.o.
Zamknięta Str. 10/1.5
30-554 Cracow, Poland

View our profile
desingrush.com

Let’s arrange a free
consultation

Just fill out the form below and we will contact you via email to arrange a free call to discuss your project scope and share our insights from similar projects.

© 2024 Let’s Go DevOps. All rights reserved.

Address:

Let's Go DevOps Sp z o.o.
Zamknięta Str. 10/1.5
30-554 Cracow, Poland

View our profile
desingrush.com

Let’s arrange a free consultation

Just fill out the form below and we will contact you via email to arrange a free call to discuss your project scope and share our insights from similar projects.

© 2024 Let’s Go DevOps. All rights reserved.