Home

About us

Case studies

Blog

Careers

Contact

Let’s talk

Managing Infrastructure at Scale: IaC, Policy Enforcement & K8s Security

Managing Infrastructure at Scale: IaC, Policy Enforcement & K8s Security

Managing Infrastructure at Scale: IaC, Policy Enforcement & K8s Security

Jacek Marmuszewski

How Let's Go DevOps Built a Cloud Infrastructure Business During COVID-19

Listen to the full episode above or read the article below:

In this episode of Cyber Diaries, host Marius Poskus talks with Jacek Marmuszewski, co-founder of Let's Go DevOps. They discuss real challenges in cloud infrastructure, from migration strategies to Kubernetes security and cost optimization.

Starting a DevOps company with clients on day one

My co-founder, Krzysztof Szarek, and I are infrastructure engineers. Before joining Let's Go DevOps, we worked at a company heavily dependent on revenue from the travel industry. When COVID-19 stopped travel, the company faced financial problems.

Instead of layoffs, the company offered reduced hours and pay. Krzysztof and I agreed. With extra time during lockdown and nowhere to go, we decided to help small companies and startups build infrastructure.

The pandemic lasted longer than expected. When it ended, the companies we'd helped wanted to hire us full-time. We received four full-time offers each.

Instead of picking one, we decided to call all the guys we enjoyed working with over the years and try to build a team.

We launched Let's Go DevOps with two clients already committed. As engineers, we decided what could go wrong when it comes to how hard it can be to run a company.

What Let's Go DevOps does differently

The goal isn't just building infrastructure. Our approach is about having long-term relationships with all the clients we work with. We want to build, maintain, troubleshoot, and help you out with all the infrastructure, especially cloud infrastructure, to make sure we are with you along the way when you scale up.

Someone once called us an "in-source DevOps team" - functioning as an internal team but hired externally.

Why 70% of engineers commit infrastructure changes in one company

In one company we work with about 100 engineers total, we recently checked licensing for a tool. This required counting how many people commit to the infrastructure repository.

The result: nearly 70 people. That's 70% of the engineering staff actively working on infrastructure.

This means that most of the company is actually building infrastructure for themselves. And we, as the infrastructure team, are helping them with modules and abstraction.

This represents real DevOps culture - but it's not uniform across all developers.

Some developers want to create a Jira ticket for you and don't bother with infrastructure at all. While you have a different set of people in the same company that would like to do everything on their own, and they understand infrastructure, and sometimes they need a little bit of guidance.

The lift-and-shift cloud migration mistake

Many companies follow the same failed pattern:

The board reads cloud vendor brochures promising agility and cost savings. They decide to migrate. But they just move everything as-is - same setup, same architecture, just on AWS instead of on-premise servers.

Then reality hits: This is not the best strategy. We are overspending, we are spending more time on development, and the infrastructure is extremely tricky.

Cloud-native for new companies

Companies without legacy infrastructure have an advantage. They can build expecting machines to disappear daily, with full automation and self-healing applications.

They are so used to those ideas that when they are doing technical design, it just clicks the cloud native way.

Legacy companies need years, not weeks

For companies with existing infrastructure, the transformation is harder. Applications might share state between clustered nodes or require constant connections. Going cloud-native often requires slow code changes.

Whenever you're thinking about cloud migration, if you have something, it's not a couple of weeks; probably it will be a couple of years before you're fully cloud native.

The strategy: focus on high-value components first to see immediate benefits, while acknowledging you just brought a lot of technical baggage with you and at some point you will need to pay out the tech debt.

Infrastructure as Code: Why it's not optional

I still encounter resistance from engineers who find clicking in AWS easier than writing code.

My analogy: would you run a tech company saying you don't need source code, the binaries are okay, you'll just run them and patch them? That's what managing infrastructure without code is like.

Configuration drift: 400 differences between environments

At my first job at a large corporation, we had database issues. We requested a configuration report comparing production and certification environments.

We had like 400 differences in the configuration. So this means that the environments were not comparable to each other whatsoever.

With Infrastructure as Code, this becomes impossible. Staging and production share identical core infrastructure code.

Security benefits of Infrastructure as Code

Tools like Terrasec scan infrastructure code before deployment, catching security issues early. Infrastructure changes go through pull request reviews. Testing happens in pre-production environments that match production exactly.

You can do a lot of stuff during your QA to make sure that you are pushing a secure infrastructure upward.

Preventing chaos when 70 engineers change the infrastructure

How do you maintain control with most of your engineering team modifying the infrastructure?

View-only cloud console access

From the very beginning, we don't give permissions to click anything in the UI. I know it's tempting, but what we start with is a view-only or observer policy.

Engineers might get slightly more than view-only - perhaps restarting servers - but not making changes that create configuration drift.

Smart module design prevents security issues

If you are building any module or kind of abstraction for the company and you expose some feature, you will find people using this feature.

Example: If modules allow opening arbitrary ports, engineers will open arbitrary ports. But if modules only expose HTTPS options, nobody requests HTTP-only applications. The entire company naturally builds with encrypted traffic.

The module and your developer experience with the infrastructure tools should be that from the very beginning, you are adding those layers and you are building the stuff the right way. And the right way doesn't mean my way, but the company way.

Security policies need executive approval

Bottom-up policy creation doesn't work.

The policy needs to come from up, not from down. Usually, at the bottom, we try to create a policy set of rules, but at some point, we usually need a C-level, a CSO who will just make his stamp on the policy.

Without formal executive approval: It's more like good recommendations from your fellow engineers, but until you have this formal agreement, there is no policy whatsoever.

Implementing realistic policies

Dropping strict policies on a startup doing everything manually will fail.

Creating a really restrictive policy in the company that was a startup, and we're doing everything by hand without any processes, you will not just start with the policy - nobody will be able to actually fulfill it.

Success requires a plan for gradual implementation with reasonable deadlines. Once approved and followed, you need to have usually some kind of process of auditing it and making sure all the engineers are following the policy.

Why serverless doesn't live up to the hype

I'm not sure if I ever saw the serverless as a true future.

At Let's Go DevOps, we've built AWS Lambdas and used serverless options like Redshift serverless and Aurora serverless. None met expectations.

Lambda maintenance problems

When it comes to Lambda and using it as your driver for business logic, it's usually extremely hard to maintain. It's really okay for really small applications. But as the code base grows bigger and bigger, it's a lot easier to use those techniques that we have had for 50 years for deploying your stuff to servers.

Eventually, people are scared to do maintenance of Lambdas. Some companies freeze development for days when releasing Lambda updates.

Serverless security risks

A friend building an application with serverless tools noticed strange AWS traffic. Investigation revealed the tool was opening everything up because you want the stuff to work, not necessarily be secure.

IAM keys were embedded in the mobile application. If you are good enough with decomposing the Android package to find the IAM keys, you could actually get those IAM keys that allowed you a little bit more than you were expecting.

This is bad practice, but this is how those tools usually work. They give you a really quick boost in how fast you can go to production, but usually it's the cost of security.

The serverless cost problem

The more requests you have, the more you'll be paying for the infrastructure, and at some point, this costs like 10 or even 100 times more than you'd be paying for a classical setup with a working application deployed on, for example, Kubernetes.

AWS cost optimization strategy

At Let's Go DevOps, every engagement starts by examining the billing dashboard to understand spending patterns.

Quick wins: databases and storage

Databases rarely scale dynamically, making them ideal for savings plans. It's usually a really easy task to do a savings plan for your databases.

Storage accumulates on S3 because you never get a "storage full" message. Enabling access logs and monitoring usage helps identify data that can be deleted or archived.

Spot instances save 50-70% on compute

Spot instances offer dramatic savings but require preparation. Your applications need to be prepared for being shut down in 120 seconds.

The approach: find the right balance between instances using spots versus those needing savings plans or reserved capacity.

Real client results: 5-10x cost savings

We recently calculated what one client would pay without a cloud-native architecture.

We concluded that it will be five times the money that they are paying currently.

But I believe that's conservative: When looking at different companies that we help out in different stages, I would say that the 5x is only a theoretical value because usually it's somewhere between 8 and 10x.

The reason for overspending? You have this urge in the cloud that if something is not performing the way you want, you just click it and say I want a bigger machine. The costs usually go beyond the boundary.

Without cloud-native practices and cost controls from the start, you will be dramatically overspending.

Kubernetes security: layered approach

Managed Kubernetes reduces operational burden

First of all, what we try to do is we try to use Kubernetes as a service. So it decreases the burden on our side when it comes to properly setting it up.

Self-managed Kubernetes is challenging, especially for upgrades. It's a lot of hassle... making sure that everything is up to date and ready for doing updates is a little bit tricky.

Cloud provider network security

At Let's Go DevOps, we prefer AWS network-based load balancing over internal Kubernetes networking, enabling security groups for pods.

We can use the same mechanisms that were previously available. So, firewall done by security groups, and we can control the traffic between pods because this is in the virtual AWS network, not necessarily a Kubernetes one.

Policy enforcement with Kyverno

Tools like Kyverno prevent insecure configurations such as running containers as root.

CI/CD pipeline security gates

If you are building Docker images, we can use tools like Trivy, for example, that will find a lot of potential issues, not only with the libraries that you have, but also with Docker configuration.

If someone configures Docker to run as root, the pipeline catches it. The pipeline will tell you that, hey, you are doing weird stuff with your Docker file. So fix it because it will not go to production whatsoever.

The principle: Having the same set of rules on production and staging actually makes the deployment process at least predictable. Because if you were not able to deploy to staging, you'll definitely not be able to deploy to production.

Container vulnerability management at scale

Managing vulnerabilities across hundreds of microservices requires systematic approaches.

Automated rebuild policy

One company implemented a policy requiring regular rebuilds: Your container should be rebuilt once every while, so a week, two weeks, a month or so. And during this rebuild, even if you don't have code changes, you are pulling new packages.

Benefits include confidence in deployments (they do them regularly), current libraries, and passing tests. When critical vulnerabilities appear, pushing to production is routine rather than exceptional.

Automated alerts for outdated containers

The system sends notifications when your application is too old. For some reason, it was not rebuilt, not deployed. So you should take a look at something that is not okay with the CI/CD pipeline.

The in-source DevOps team model

Traditional consulting companies assign people to projects who follow documentation. When requirements change mid-project - common at agile companies - those companies were not delivering.

We operate differently: Maybe we can be the internal team, but just hired from an external company.

We want to be with you, work with your engineers, go have a beer with your engineers, to be part of your organization, to understand your business, and to build all the infrastructure components around it.

Let's work with Let's Go DevOps

If you have a challenge, we are really open for challenges. I have a lot of Navy SEALs on my team, so we are looking for something fun to play with.

FAQ

How much can companies save with cloud-native architecture?
Based on our client work, at Let's Go DevOps, companies typically save 5-10 times compared to non-cloud-native setups. One client operates at 20% of what traditional infrastructure would cost.

How long does cloud migration take?
For companies with legacy infrastructure, full cloud-native transformation takes years, not weeks. However, focusing on high-value components first delivers benefits from day one.

Why do lift-and-shift migrations fail?
Moving infrastructure as-is to the cloud doesn't capture cloud benefits. Companies end up overspending while getting the same or worse performance because they haven't adopted cloud-native practices.

What's the biggest mistake companies make with cloud costs?
When performance issues arise, clicking to upgrade to bigger machines without understanding the root causes. This creates exponential cost growth without solving underlying problems.

How does Infrastructure as Code improve security?
IaC enables security scanning before deployment, eliminates configuration drift between environments, and makes infrastructure changes reviewable through pull requests like application code.

Why doesn't serverless work for most companies?
Serverless becomes difficult to maintain as codebases grow, can cost 10-100x more at scale than traditional deployments, and quick-start tools often sacrifice security for ease of use.

Podcast: Cyber Diaries Episode 15
Host: Marius Poskus
Guest: Jacek Marmuszewski, Co-founder of Let's Go DevOps
Company: Let's Go DevOps
Topics: Cloud migration, Infrastructure as Code, DevOps culture, Kubernetes security, AWS cost optimization

Want to expand the topic?

Want to expand the topic?

Write to Us

Write to Us

Read more

Scaling Smart: DevOps Strategies for Cloud Cost Optimization, Security, and Scalability

READ ARTICLE

Scaling Smart: DevOps Strategies for Cloud Cost Optimization, Security, and Scalability

READ ARTICLE

Scaling Smart: DevOps Strategies for Cloud Cost Optimization, Security, and Scalability

READ ARTICLE

500TB of Audit Logs for $500/Month - Real-Time Searchable!

READ ARTICLE

500TB of Audit Logs for $500/Month - Real-Time Searchable!

READ ARTICLE

500TB of Audit Logs for $500/Month - Real-Time Searchable!

READ ARTICLE

Address:

Let's Go DevOps Sp z o.o.
Zamknięta Str. 10/1.5
30-554 Cracow, Poland

Contact:

contact@letsgodevops.pl

View our profile
desingrush.com

Let’s arrange a free consultation

Just fill out the form below and we will contact you via email to arrange a free call to discuss your project scope and share our insights from similar projects.

© 2024 Let’s Go DevOps. All rights reserved.

Terms of Service

Cookies Settings

Address:

Let's Go DevOps Sp z o.o.
Zamknięta Str. 10/1.5
30-554 Cracow, Poland

Contact:

contact@letsgodevops.pl

View our profile
desingrush.com

Let’s arrange a free
consultation

Just fill out the form below and we will contact you via email to arrange a free call to discuss your project scope and share our insights from similar projects.

© 2024 Let’s Go DevOps. All rights reserved.

Terms of Service

Cookies Settings

Address:

Let's Go DevOps Sp z o.o.
Zamknięta Str. 10/1.5
30-554 Cracow, Poland

Contact:

contact@letsgodevops.pl

View our profile
desingrush.com

Let’s arrange a free consultation

Just fill out the form below and we will contact you via email to arrange a free call to discuss your project scope and share our insights from similar projects.

© 2024 Let’s Go DevOps. All rights reserved.

Terms of Service

Cookies Settings