How Let's Go DevOps Built a Cloud Infrastructure Business During COVID-19
Listen to the full episode above or read the article below:
In this episode of Cyber Diaries, host Marius Poskus talks with Jacek Marmuszewski, co-founder of Let's Go DevOps. They discuss real challenges in cloud infrastructure, from migration strategies to Kubernetes security and cost optimization.
Starting a DevOps company with clients on day one
My co-founder, Krzysztof Szarek, and I are infrastructure engineers. Before joining Let's Go DevOps, we worked at a company heavily dependent on revenue from the travel industry. When COVID-19 stopped travel, the company faced financial problems.
Instead of layoffs, the company offered reduced hours and pay. Krzysztof and I agreed. With extra time during lockdown and nowhere to go, we decided to help small companies and startups build infrastructure.
The pandemic lasted longer than expected. When it ended, the companies we'd helped wanted to hire us full-time. We received four full-time offers each.
Instead of picking one, we decided to call all the guys we enjoyed working with over the years and try to build a team.
We launched Let's Go DevOps with two clients already committed. As engineers, we decided what could go wrong when it comes to how hard it can be to run a company.
What Let's Go DevOps does differently
The goal isn't just building infrastructure. Our approach is about having long-term relationships with all the clients we work with. We want to build, maintain, troubleshoot, and help you out with all the infrastructure, especially cloud infrastructure, to make sure we are with you along the way when you scale up.
Someone once called us an "in-source DevOps team" - functioning as an internal team but hired externally.
Why 70% of engineers commit infrastructure changes in one company
In one company we work with about 100 engineers total, we recently checked licensing for a tool. This required counting how many people commit to the infrastructure repository.
The result: nearly 70 people. That's 70% of the engineering staff actively working on infrastructure.
This means that most of the company is actually building infrastructure for themselves. And we, as the infrastructure team, are helping them with modules and abstraction.
This represents real DevOps culture - but it's not uniform across all developers.
Some developers want to create a Jira ticket for you and don't bother with infrastructure at all. While you have a different set of people in the same company that would like to do everything on their own, and they understand infrastructure, and sometimes they need a little bit of guidance.
The lift-and-shift cloud migration mistake
Many companies follow the same failed pattern:
The board reads cloud vendor brochures promising agility and cost savings. They decide to migrate. But they just move everything as-is - same setup, same architecture, just on AWS instead of on-premise servers.
Then reality hits: This is not the best strategy. We are overspending, we are spending more time on development, and the infrastructure is extremely tricky.
Cloud-native for new companies
Companies without legacy infrastructure have an advantage. They can build expecting machines to disappear daily, with full automation and self-healing applications.
They are so used to those ideas that when they are doing technical design, it just clicks the cloud native way.
Legacy companies need years, not weeks
For companies with existing infrastructure, the transformation is harder. Applications might share state between clustered nodes or require constant connections. Going cloud-native often requires slow code changes.
Whenever you're thinking about cloud migration, if you have something, it's not a couple of weeks; probably it will be a couple of years before you're fully cloud native.
The strategy: focus on high-value components first to see immediate benefits, while acknowledging you just brought a lot of technical baggage with you and at some point you will need to pay out the tech debt.
Infrastructure as Code: Why it's not optional
I still encounter resistance from engineers who find clicking in AWS easier than writing code.
My analogy: would you run a tech company saying you don't need source code, the binaries are okay, you'll just run them and patch them? That's what managing infrastructure without code is like.
Configuration drift: 400 differences between environments
At my first job at a large corporation, we had database issues. We requested a configuration report comparing production and certification environments.
We had like 400 differences in the configuration. So this means that the environments were not comparable to each other whatsoever.
With Infrastructure as Code, this becomes impossible. Staging and production share identical core infrastructure code.
Security benefits of Infrastructure as Code
Tools like Terrasec scan infrastructure code before deployment, catching security issues early. Infrastructure changes go through pull request reviews. Testing happens in pre-production environments that match production exactly.
You can do a lot of stuff during your QA to make sure that you are pushing a secure infrastructure upward.
Preventing chaos when 70 engineers change the infrastructure
How do you maintain control with most of your engineering team modifying the infrastructure?
View-only cloud console access
From the very beginning, we don't give permissions to click anything in the UI. I know it's tempting, but what we start with is a view-only or observer policy.
Engineers might get slightly more than view-only - perhaps restarting servers - but not making changes that create configuration drift.
Smart module design prevents security issues
If you are building any module or kind of abstraction for the company and you expose some feature, you will find people using this feature.
Example: If modules allow opening arbitrary ports, engineers will open arbitrary ports. But if modules only expose HTTPS options, nobody requests HTTP-only applications. The entire company naturally builds with encrypted traffic.
The module and your developer experience with the infrastructure tools should be that from the very beginning, you are adding those layers and you are building the stuff the right way. And the right way doesn't mean my way, but the company way.
Security policies need executive approval
Bottom-up policy creation doesn't work.
The policy needs to come from up, not from down. Usually, at the bottom, we try to create a policy set of rules, but at some point, we usually need a C-level, a CSO who will just make his stamp on the policy.
Without formal executive approval: It's more like good recommendations from your fellow engineers, but until you have this formal agreement, there is no policy whatsoever.
Implementing realistic policies
Dropping strict policies on a startup doing everything manually will fail.
Creating a really restrictive policy in the company that was a startup, and we're doing everything by hand without any processes, you will not just start with the policy - nobody will be able to actually fulfill it.
Success requires a plan for gradual implementation with reasonable deadlines. Once approved and followed, you need to have usually some kind of process of auditing it and making sure all the engineers are following the policy.
Why serverless doesn't live up to the hype
I'm not sure if I ever saw the serverless as a true future.
At Let's Go DevOps, we've built AWS Lambdas and used serverless options like Redshift serverless and Aurora serverless. None met expectations.
Lambda maintenance problems
When it comes to Lambda and using it as your driver for business logic, it's usually extremely hard to maintain. It's really okay for really small applications. But as the code base grows bigger and bigger, it's a lot easier to use those techniques that we have had for 50 years for deploying your stuff to servers.
Eventually, people are scared to do maintenance of Lambdas. Some companies freeze development for days when releasing Lambda updates.
Serverless security risks
A friend building an application with serverless tools noticed strange AWS traffic. Investigation revealed the tool was opening everything up because you want the stuff to work, not necessarily be secure.
IAM keys were embedded in the mobile application. If you are good enough with decomposing the Android package to find the IAM keys, you could actually get those IAM keys that allowed you a little bit more than you were expecting.
This is bad practice, but this is how those tools usually work. They give you a really quick boost in how fast you can go to production, but usually it's the cost of security.
The serverless cost problem
The more requests you have, the more you'll be paying for the infrastructure, and at some point, this costs like 10 or even 100 times more than you'd be paying for a classical setup with a working application deployed on, for example, Kubernetes.
AWS cost optimization strategy
At Let's Go DevOps, every engagement starts by examining the billing dashboard to understand spending patterns.
Quick wins: databases and storage
Databases rarely scale dynamically, making them ideal for savings plans. It's usually a really easy task to do a savings plan for your databases.
Storage accumulates on S3 because you never get a "storage full" message. Enabling access logs and monitoring usage helps identify data that can be deleted or archived.
Spot instances save 50-70% on compute
Spot instances offer dramatic savings but require preparation. Your applications need to be prepared for being shut down in 120 seconds.
The approach: find the right balance between instances using spots versus those needing savings plans or reserved capacity.
Real client results: 5-10x cost savings
We recently calculated what one client would pay without a cloud-native architecture.
We concluded that it will be five times the money that they are paying currently.
But I believe that's conservative: When looking at different companies that we help out in different stages, I would say that the 5x is only a theoretical value because usually it's somewhere between 8 and 10x.
The reason for overspending? You have this urge in the cloud that if something is not performing the way you want, you just click it and say I want a bigger machine. The costs usually go beyond the boundary.
Without cloud-native practices and cost controls from the start, you will be dramatically overspending.
Kubernetes security: layered approach
Managed Kubernetes reduces operational burden
First of all, what we try to do is we try to use Kubernetes as a service. So it decreases the burden on our side when it comes to properly setting it up.
Self-managed Kubernetes is challenging, especially for upgrades. It's a lot of hassle... making sure that everything is up to date and ready for doing updates is a little bit tricky.
Cloud provider network security
At Let's Go DevOps, we prefer AWS network-based load balancing over internal Kubernetes networking, enabling security groups for pods.
We can use the same mechanisms that were previously available. So, firewall done by security groups, and we can control the traffic between pods because this is in the virtual AWS network, not necessarily a Kubernetes one.
Policy enforcement with Kyverno
Tools like Kyverno prevent insecure configurations such as running containers as root.
CI/CD pipeline security gates
If you are building Docker images, we can use tools like Trivy, for example, that will find a lot of potential issues, not only with the libraries that you have, but also with Docker configuration.
If someone configures Docker to run as root, the pipeline catches it. The pipeline will tell you that, hey, you are doing weird stuff with your Docker file. So fix it because it will not go to production whatsoever.
The principle: Having the same set of rules on production and staging actually makes the deployment process at least predictable. Because if you were not able to deploy to staging, you'll definitely not be able to deploy to production.
Container vulnerability management at scale
Managing vulnerabilities across hundreds of microservices requires systematic approaches.
Automated rebuild policy
One company implemented a policy requiring regular rebuilds: Your container should be rebuilt once every while, so a week, two weeks, a month or so. And during this rebuild, even if you don't have code changes, you are pulling new packages.
Benefits include confidence in deployments (they do them regularly), current libraries, and passing tests. When critical vulnerabilities appear, pushing to production is routine rather than exceptional.
Automated alerts for outdated containers
The system sends notifications when your application is too old. For some reason, it was not rebuilt, not deployed. So you should take a look at something that is not okay with the CI/CD pipeline.
The in-source DevOps team model
Traditional consulting companies assign people to projects who follow documentation. When requirements change mid-project - common at agile companies - those companies were not delivering.
We operate differently: Maybe we can be the internal team, but just hired from an external company.
We want to be with you, work with your engineers, go have a beer with your engineers, to be part of your organization, to understand your business, and to build all the infrastructure components around it.
Let's work with Let's Go DevOps
If you have a challenge, we are really open for challenges. I have a lot of Navy SEALs on my team, so we are looking for something fun to play with.
FAQ
How much can companies save with cloud-native architecture?
Based on our client work, at Let's Go DevOps, companies typically save 5-10 times compared to non-cloud-native setups. One client operates at 20% of what traditional infrastructure would cost.
How long does cloud migration take?
For companies with legacy infrastructure, full cloud-native transformation takes years, not weeks. However, focusing on high-value components first delivers benefits from day one.
Why do lift-and-shift migrations fail?
Moving infrastructure as-is to the cloud doesn't capture cloud benefits. Companies end up overspending while getting the same or worse performance because they haven't adopted cloud-native practices.
What's the biggest mistake companies make with cloud costs?
When performance issues arise, clicking to upgrade to bigger machines without understanding the root causes. This creates exponential cost growth without solving underlying problems.
How does Infrastructure as Code improve security?
IaC enables security scanning before deployment, eliminates configuration drift between environments, and makes infrastructure changes reviewable through pull requests like application code.
Why doesn't serverless work for most companies?
Serverless becomes difficult to maintain as codebases grow, can cost 10-100x more at scale than traditional deployments, and quick-start tools often sacrifice security for ease of use.
Podcast: Cyber Diaries Episode 15
Host: Marius Poskus
Guest: Jacek Marmuszewski, Co-founder of Let's Go DevOps
Company: Let's Go DevOps
Topics: Cloud migration, Infrastructure as Code, DevOps culture, Kubernetes security, AWS cost optimization