
Jacek Marmuszewski
500TB of Audit Logs for $500/Month: A Cloud-Native Migration Story
Listen to the full episode above or read the article below:
In this clip from Jacek Marmuszewski's Infoshare 2024 presentation, discover how his team managed to store over 500TB of audit data at a cost of under $500 per month—all while keeping it searchable in real time!
The evening that started an RPG quest
I was settling in for an evening of gaming after doing system upgrades when I got a DM from one of our clients asking if I had two minutes to jump on a call. "This is absolutely nothing urgent," they said.
I didn't have anything else to do, so I jumped in. They told me they had a problem with their system that stores all the audit logs.
This was the first initial question, and the journey through this project could be a scenario for a pretty good RPG game.
What we discovered
First, we learned it's not some small system. When we counted all the data, it was half a petabyte of audit logs stored on 10 different servers.
Those 10 servers were in different locations. Some were physical servers. We had at least three versions of their auditing system—not something coherent across the entire environment.
Some servers were actually offline. If you have audit data you keep for auditors, you probably don't want to pay much for internet access or power. The guys were shutting servers down.
For some of those servers, we had a side quest: calling support who had to go to the server room floor, find the server, and boot it up so we could actually access the data.
Some were stored in really remote locations. If you remember the dial-up time, the speed of internet connection was similar. We weren't expecting such slow internet speeds.
The floor is lava. During gathering requirements, we learned most audit data is kept for compliance and legal issues. You'll see a lot of decisions made based on really weird restrictions we had.
Not everything was accessible for us. Sometimes we just needed to be compliant with the law.
Would we do it again?
After half a year or a year of making the project happen and migrating all the data, would we do it again?
Yes, we would. This was a really awesome project with really challenging things that happened along the way.
The flawed initial design
I want to introduce you to the story where I'll go through all the technical details.
The first thing we got from the client was an initial design from their team. The manager told me they had to have initial results in like two weeks, and they already spent two months doing the initial design.
The design was prepared for something more like an on-premise setup than a cloud setup. They figured out: "Okay, if we have only MySQL databases, let's use a technology called replication. Everyone uses it. Let's just push it to some servers somewhere and have all the audit data in a single place."
This may be a really interesting concept. However, the nature of the client's business is they have servers in absolutely every country in Europe and they want to push this data into one or two locations in Europe.
If you think about cross-country replication and replication to some weird islands in Europe, it's not that simple to maintain. It would be a pain in the butt for the long term to actually manage replication over long distance.
We told them: "Okay, if we want to do this project, we want to do it cloud-native way and we will come out with a design of something different."
They agreed.
Understanding the data
Before any project can start, we need to understand what we are dealing with. We started with gathering and analyzing the nature of the data.
The ledger discovery
First, we learned the data is created as a ledger. This is financial data. Every record that comes in will never be changed.
This changes the game dramatically because now we're not thinking about doing real-time replication. We're thinking about the stream of data we need to move across different databases. There will be no updates, no deletes. If it's in the ledger, then it's in the ledger. That's it.
Finding stakeholders and usage patterns
The second thing was finding stakeholders to get information on how it's used.
It's used in two ways:
Support cases from clients. Clients of our client claim: "I withdrew some money or I added some money or I want some money and I don't see it in my account." Support goes through the data of a certain day or certain game they played, figuring out what happened and if the client is right or not.
Auditors (the more challenging use case). Every once in a while comes the auditor and says he needs to check if we have the ledger.
If the auditor is new, he'll probably come and say "I want the ledger from this particular day." He looks at this day and says "Okay, if you have a ledger, then it's okay."
The more annoying auditors are the ones who actually played the system or played the game, wrote down all the events, and are asking for their particular game from a year ago, two years ago, five years ago. They're making sure the data is consistent as well.
If you think about those two use cases, they're pretty much the same. We need to get a date range or we need to get information about certain events in the system. Those are the use cases.
Testing storage formats
We started analyzing what the data actually is and how it's stored.
All of the auditing systems they currently had (or most of them) were running MySQL with InnoDB engine.
The first thing we thought: If we don't want to change much, we can use a paid solution called Percona. Percona has compressed tables. We can with just a simple query change the engine. Percona will do the magic and that will do some trick when it comes to size of the data.
But half a petabyte of data is not small data. We may think about something going more into the big data spectrum. BigQuery as a Google solution—the client is on Google.
As I told you, the floor is lava, so BigQuery was one of the few options we had for doing some kind of cloud option for big data.
If you're using BigQuery, it would be a shame to put small queries in BigQuery. Clearly big data. We need some cool big data format.
Everyone is using CSVs for this. If CSV, there's also the partner in crime—gzip. We have some formats where we can try to store the data and see how far we can go with optimizing the initial dataset.
Storage size comparison
You probably think the biggest one would be the InnoDB—half a petabyte, 500 terabytes of data they're storing.
We did some estimates of how much data would take in different formats:
BigQuery: By just reloading data from InnoDB (which has a lot of indexes and so on, not optimized for storing data), by just pushing it to BigQuery, we can reduce the volume nearly by half.
CSV: It's also not the most efficient way of storing data, but there's like 10% or 20%, 15% difference between CSV and what we'll get in BigQuery. Something interesting.
Percona (TokuDB): That was really interesting because we can reduce size by 80%, nearly or over 80%, by just changing the engine. If we can pay for Percona, we can just simply change the engine and get some volumes back.
CSV and gzip: It's only 10% of data volume. That is a really interesting part of this equation.
Why data size means nothing
What does this chart tell us? It tells us actually nothing. Nothing at all.
Data size is not important in this case. What we need to actually get from it is the price of the storage and the price of the solution we'll be using.
If BigQuery is super cheap and CSVs are super expensive, it doesn't matter how much the data is there. But it's a start.
The cost analysis
When thinking about the prices, we have a couple of options in Google that we should consider when it comes to storing the data.
I prepared a short table. We have all the technologies, size estimated for this particular way of storing data, then size per terabyte of data, total size, total price for entire storage, and some additional fees that might come in for each of those solutions.
MySQL and Percona on block storage
For this we'll need to save on block storage. We need to have servers and we need to use block storage for it.
When you come to Google, it's pretty complicated. You can have volumes that are magnetic, then you can choose something more expensive but faster, and even faster. For this particular case, what we needed is something actually cheap. We took the first one.
The $41 per terabyte gives us a price of $205,000 for InnoDB and $3,500 for Percona.
For this you must also add additional free space because nobody will be running physical databases on the edge of storage. Probably at least 10% needs to be added when it comes to storage.
You'll also need to pay for compute just to run the MySQL. When it comes to Percona, you'll need to have a really beefy computer to get the data out.
Probably the cost will be a lot more. We have estimation of $205,000 and $3,500. If you look at the initial setup, $27,000 was what the client had on the invoice. Probably we were in really similar areas.
BigQuery native storage
If you think the pricing is weird or hard to read in Google, when it comes to BigQuery, I think you need BigQuery to understand how BigQuery is calculated.
There are two options. First, you can use external storage (I'll talk about in a little while). If we go with BigQuery native storage, it's not that simple to just load the data.
You'll get part of the data put in logical space and some into physical space. Those spaces are built differently.
If you store the data for at least 3 months without changing it, you will get 50% discount on the storage. This is what we'll get for those storages.
The question is: How much will our data use of this logical and physical storage?
We did some experiments. We took one terabyte file, one of the gzip files. When we loaded it into BigQuery, we got billed for 4.5 terabytes of logical storage and 1.5 terabytes of physical storage.
Doing this kind of math and estimation, we ended up with pretty much an average of $25.50 per terabyte of data. This was added to the table. We had 285 terabytes of storage we wanted to put there. If you add it, we have something along $7,000.
Something really interesting here: If you look at the discounted price ($3,600) and compare it to TokuDB, it has absolutely no reason to put something in TokuDB where you have to maintain it, upgrade it, do a lot of magic around the databases, if in BigQuery you can be like 10% more expensive and that's it.
Unfortunately in BigQuery you will also pay for queries. I will come to this later on.
Cloud Storage (CSV and gzip)
For CSVs and gzip versions, we can use Cloud Storage. You also have different options—you can use something cheap or expensive.
The problem with cheap stuff is you need to pay for every read from this drive.
We considered we'll use a mix of two:
Standard for data used by support team to answer questions coming in on a regular basis
Archive (the really cheap one) to store data rarely used, used only by auditors
We'll need to pay for reading it. However, looking at the statistics, we will not be looking into it so often.
Those are two stores we'll be using. Adding it to our table gives us something like this. If you look at the bottom: $130. This is how much we would pay for initial half petabyte of data stored in CSV gzip format on the most cost-optimized storage in Google.
This is pretty astonishing. This is how low we could get with the entire setup.
Of course, we decided to play a little bit with it, so it's a little bit more expensive. But let's get back to the topic.
For those two formats, we'll be using BigQuery again as a driver. This will be the external storage. We'll need to pay for query and we sometimes will need to pay for reading the data.
The proof of concept
Let's pick two for doing an actual POC:
BigQuery with native storage
BigQuery with external storage
Understanding file format performance
First, we read the documentation about how BigQuery handles external storage and different types of files.
You have a note about CSVs and gzips. The main difference: CSVs can be read in parallel while gzips are single-threaded.
This means our queries will be a lot faster on CSVs. However, the size of CSV is quite big.
When we deep dive into the documentation, it's not a single thread per all the files you have but actually per file. Having really small files will mean we have more threads to actually analyze them.
The story is not that simple because it's not per file—it's per chunk. The chunk needs to be a file of same size. Something we can definitely play with during the POC.
Testing file sizes for optimal performance
We started analyzing what size of the files would be best from performance perspective versus cost perspective.
We took one day of data and started cutting into different sizes of files.
4GB files on the drive as gzip files:
Slot time: 80 minutes (this is the time we'll be billed for computing)
User waiting: around 11 minutes for getting the response
We did a lot of tests with different file sizes.
32 megabyte files:
Slot time: 30 minutes
Response time: This constant 11-second response time from the system
We noticed at some point the curve—this was the sweet spot for finding the best size of the files.
What does this mean for cost?
Again, the question is what does it mean to us? Again, the answer is absolutely nothing.
What we're interested in is how much will be billed for it.
We'll be billed about 2 cents for going through one day of data. Half an hour of compute time is 2 cents in BigQuery for external storage.
Native storage performance and costs
The question is what will we get for native storage? Native storage, if you remember, is at least three times more expensive for our case. It's more like 6 to 12 times more expensive.
I would assume the queries will be cheaper over there. However, that's not the point here.
We did the same test with native storage.
First, for native storage we need to load the data and we are billed for loading the data.
How much? For this particular example, we'll be billed for 2 hours and 30 minutes.
You can see first of all, by not loading the data into BigQuery, we actually got five free runs on the data on external storage. That's one of the things that changes here.
The next thing: For any query you run, you can be billed in one of two ways:
Option 1: Buy upfront capacity for running queries (slots)
In this case, we would be billed for 56 minutes of slot time. Eventually 4 cents for actually running a query. So twice as expensive as running it on native storage.
As I mentioned in the very beginning, the guys needed to wait for at least 6 hours in the current system to get data. I think 11 seconds is a really good bargain here.
Option 2: If you run out of slots, you'll be billed by query size
This is interesting because for query size, we'll be billed for 40GB of data and it will be 20 cents for running exactly the same query.
When we're paying upfront, we'll be paying twice as much as for external storage. When we run out of our prepaid version, we'll be paying 10 times as much as we pay for external storage.
Of course, this is the bargain between performance and cost. But again, we decided let's go with option A—the gzip part.
Building tools for gzip optimization
After doing the POC and gathering all the data and showing the initial results, we decided we need to go with this gzip part.
But this is not that simple. We needed to build some tools because gzip is challenging.
The gzip challenge
Gzip is a stream. We're pushing some kind of data. It doesn't know where it ends and it's executing or outputting some kind of stream.
The second thing: It's dynamic. We cannot upfront know what will be the size of the outcoming file.
The solution
We did some magic and wrote some tools that are:
Pushing to the stream
Estimating the stream size
Calculating every once in a while
Saving those streams in the process
We added around 2% of overhead to the data. However, what we get is pretty consistent file sizes that we're pushing to BigQuery.
Then BigQuery thinks: "Okay, we will give you one thread per all the files that you have."
Sometimes you're running with thousands of threads, sometimes with 10,000, 100,000 and so on.
The final architecture
If you think about this project and the summary of what we did, the final solution, the final architecture:
Step 1: Source extraction
We start from the source. We have every data in MySQL database.
However, instead of running replication across countries, we created a set of really simple scripts that are:
Dumping the data
Zipping the data
Splitting the data into similar chunks
Step 2: Google Cloud Storage
This data is sent to Google Cloud Storage. This is something we don't have to maintain, is available around the globe. It's really maintenance-less.
What's also interesting about it: We can add a policy to it. The policy says that after a certain period of time, we will migrate it automatically to cheaper storage. After other certain set of time, we'll remove the data whatsoever.
If the data is not needed for legal, let's just flush it out. That's done automatically. There is no rotation or no one actually doing physical work on the data.
Step 3: Two use cases
For internal teams: We're using Google BigQuery as a runner for their analytical queries. This is the UI they're using.
For compliance: We actually need to ship data sometimes to remote locations where the compliance says it needs to be shipped.
If you think about it, the previous solution would be putting those replications in different directions—more replications and so on.
In here, what we need to do: We need to put a Raspberry Pi with a simple bash script that is using GCP sync for Cloud Storage. It will just sync up the data.
There is no weird setup. It's an extremely simple solution that will just pull the data from the cloud to the location where we need the data.
We simplified a lot in this process.
The results that matter
Looking at some numbers for the very end:
Cost reduction
We started with billing of $27,500 a month for the entire infrastructure for the data.
The guys currently are paying $500 a month when it started in BigQuery with this solution.
We reduced the pricing of the solution by $27,000 a month. This is a pretty big saving.
Performance improvement
The second thing that was not in the initial scope: those 11 seconds.
They never thought about performance of this facility. They thought the data is rarely used and so on. The internal team can just wait for the data.
We reduced it from 6 to 8 hours per query to a constant 11 seconds.
It's not only for the queries they were running in the old facility because they were running daily analysis. Now they can actually do analysis on the entire dataset and have the response in 11 seconds.
We'll just get more workers. It will be more costly.
I talked with the team that is doing a lot more now with the data. Aren't they afraid of additional cost?
They just laughed at me and reminded me that in the previous slide I just told them we saved them $27,000 a month. This is money they can play with when it comes to data querying.
This is a huge change in the way the company is actually analyzing the data. Those audit logs have a lot of insights that they are starting to use in their day-to-day process.
Infrastructure reduction
Finally, something I'm most proud of: We didn't add any new infrastructure. We actually removed like a dozen servers from the setup.
Absolutely nothing to maintain. Everything in the cloud. Everything managed by the cloud.
We just have a couple of setups in Terraform and everything works seamlessly and smoothly.
Get the full presentation
I told you I have a little bit longer version going a little bit deeper into all of the things.
If you're interested, if you like the article you can definitely find me on LinkedIn—either my personal account or my company account—or you can send an email to contact@letsgodevops.pl.
FAQ
What was the original infrastructure setup?
Half a petabyte (500TB) of audit logs stored across 10 different servers in different European locations, running at least three versions of auditing systems. Some servers were offline to save costs, and some had dial-up-speed internet connections.
Why didn't MySQL replication work?
Cross-country replication across every European country to one or two locations, including remote islands, would be extremely difficult to maintain long-term over such distances.
What made the data easier to migrate?
The audit logs are ledger data—financial records that never change. No updates, no deletes. This eliminated the need for real-time replication and allowed treating it as a data stream instead of a live database.
How much did storage costs decrease?
From $27,500/month to $500/month—a reduction of $27,000/month (98% cost reduction). This went from using MySQL on block storage to CSV gzip files on Google Cloud Storage with BigQuery external tables.
How much did query performance improve?
From 6-8 hours per query to 11 seconds constant response time. The team can now analyze the entire dataset in 11 seconds instead of just running daily queries.
What's the optimal file size for BigQuery external storage?
Through testing, 32MB gzip files provided the sweet spot—reducing slot time from 80 minutes (with 4GB files) to 30 minutes while maintaining 11-second response times. This costs about 2 cents per day of data queried.
How much does it cost to query the data?
External storage: ~2 cents for one day of data. Native storage with pre-purchased slots: 4 cents (2x more expensive). Native storage on-demand: 20 cents (10x more expensive). Plus loading data into native storage costs 2.5 hours of compute time upfront.
How is compliance data shipped to required locations?
A Raspberry Pi with a simple bash script using GCP sync for Cloud Storage pulls data from the cloud to wherever compliance requires it. No complex replication setup needed.
How is old data managed?
Automated lifecycle policies on Google Cloud Storage migrate data to cheaper archive storage after a certain period and automatically delete it when no longer legally required. No manual work needed.
Event: InfoShare 2024
Speaker: Jacek Marmuszewski, Co-founder of Let's Go DevOps
Topics: Petabyte-scale migration, BigQuery external storage, audit log optimization, cloud cost reduction, compliance data management


