
Jacek Marmuszewski
From Java Developer to Cloud Infrastructure Expert: Real Talk on Scaling, Security, and Cloud Costs
Listen to the full episode above or read the article below:
In this episode of TestGuild's DevOps Toolchain Show, host Joe Colantonio sits down with Jacek Marmuszewski, co-founder of Let's Go DevOps, to discuss the real challenges of building scalable cloud infrastructure—from costly mistakes to AI security threats.
The unexpected path to DevOps
My journey into DevOps wasn't planned. Like many engineers, I expected to be a programmer after college.
I was hired as a Java developer for my college internship. The amount of boilerplate code required to build anything in early Java was overwhelming. I quickly realized Java wasn't for me.
I switched to C++, thinking it would be faster and more interesting. After a year or two, I realized programming wasn't my thing either. But during that time, I was exposed to system-level components—mainframes, operating systems, and hardware.
That's when it clicked. I could delegate coding tasks through Jira tickets while working directly with low-level infrastructure.
I transitioned to system administration and system engineering. When the DevOps culture emerged, I was one of the first to adopt it. I had a programming background and system engineering experience—the transition was seamless.
Joining companies that showed me how to use the cloud for building scalable applications brought everything together. That programming and system engineering knowledge just clicked into place.
The cloud scalability myth
The biggest misconception about cloud infrastructure? That it scales infinitely.
Cloud providers only claim to be infinitely scalable. There are limits. We've actually broken the cloud in a couple of scenarios when we reached those limits.
But that's a rare problem. Most companies will never hit cloud provider limits.
The real scalability challenge is application design itself.
When your application isn't ready to scale
We frequently work with startups that have built systems rapidly, then tried to launch and scale. The problems we discover:
You cannot run a second server because everything—including user sessions—is kept in memory. You're stuck with sticky sessions and architectural workarounds.
Applications don't boot correctly when running multiple servers. Or worse, shutdown processes lose data. When you're scaling up and down automatically, your application must behave correctly during both startup and shutdown.
Database logic prevents horizontal scaling. You'd be surprised how many developers still write hardcore SQL queries or use database functions for business logic. That doesn't scale.
The solution? Follow the 12-factor application methodology. If your application adheres to these principles, it's ready to be super scalable.
Real-world scaling: 2 servers to 80 in three days
One client perfectly illustrates cloud-native scaling benefits.
Most of the month, they operate on two servers. For three or four days each month, they automatically launch 60 to 80 servers. They pay only for those few days of heavy usage, then scale back down to cheaper operations.
You cannot do this with physical hardware.
Building scalability from day one
Scalability comes with developer experience. At some point, you just write scalable software and build scalable systems naturally.
For developers starting out, this is where we come in at Let's Go DevOps. Our goal is to join companies as early as possible—during the design phase.
We discuss initial architecture and identify bottlenecks before they're built: "This database component won't scale well. It's a hardcore component. But these application layers can scale up and down. Let's move business logic from the database to applications."
This is DevOps culture in practice. Get infrastructure engineers and business stakeholders in the same room from the very first meeting. Brainstorm together about what to build and how.
Testing scalability before it breaks
At Let's Go DevOps, we use tools like K6 to generate load—sometimes a little, sometimes a lot—to understand application behavior.
But here's our strategy: We don't just test with your target numbers.
Disable autoscaling completely and find which component breaks first. Understand the flow. Fix or upgrade that component. Then find what breaks next.
This process gives you a deep understanding of how your application behaves under load. It enables us to build monitoring and alerting around the application that triggers automatic scaling and downscaling.
For day-to-day operations, we implement extensive metrics. You don't need constant synthetic performance testing. You can identify potential issues from production metrics, develop theories, validate them, and upgrade the system accordingly.
The awareness problem: When companies don't know they have issues
Getting companies to recognize scalability problems before a crisis hits is challenging.
There are mainly two scenarios where we engage:
Scenario 1: Early-stage prevention (my favorite)
We reach out to early-stage companies with a simple proposition: We'll manage your infrastructure and all the boring stuff—infrastructure, IT support, security, compliance, everything that's not custom-designed for your business.
Usually in a startup, you have two or three founders. If one is focused on building infrastructure, that's wasted time and company potential.
They can launch to production faster if we handle everything else. It's usually a robust setup. We learn about the business and build a platform for them. They get the learning curve about scalability as an add-on to initial services.
Scenario 2: The midnight fire call (not fun, but common)
Someone calls me in the middle of the night: "I just figured out we have scalability issues."
I ask what happened. The answer: "It doesn't work anymore. We hit some kind of limit."
These companies feel the pain of a non-functioning system and want immediate help. We've had several clients come to us with fires burning, asking for emergency assistance.
When everything is on fire: The emergency response
When clients call with production emergencies, we jump in immediately with our best engineers.
By design, we get extensive permissions to many systems. The NDA comes first. Then we figure out what we can do with the issue.
We've been called to live production incidents where all the paperwork was signed in minutes. Usually, paperwork takes months. In emergencies, it's minutes.
We sit with their team and dig through metrics, logs, servers—whatever is necessary to fix the issue. We get our hands dirty until the problem is resolved.
Balancing speed, security, and compliance
This is a tough question, especially for young companies.
Here's the reality: If you have an investor and a runway of three to six months, you need to deliver product. Security is not the top priority. I probably shouldn't say that on the record, but it's the truth for young startups.
Feature delivery is more important when you have just a couple of users or no real user base yet.
Building security from the beginning (even when it's not the priority)
Our approach: Build mechanisms to control and secure infrastructure from day one, but with configurable security levels.
All our designs—infrastructure, CI/CD pipelines—have knobs you can adjust to enable or disable security features.
For the first few months, these settings are at a medium level. When the company matures and realizes they have exposure, valuable data, and real traction, we don't need to rebuild everything.
We just enabled switches and tune security settings up. We go from medium to high in a matter of days because it's more about education than technical setup.
You're cutting corners, but cutting them smartly—small wounds, not deep flesh cuts. The fact that we've launched several startups gives us the knowledge to build it better each time.
The cloud cost optimization nobody talks about
There are so many things I could discuss here. Let me start with the biggest misconception.
Cloud is not cheaper than bare metal
When you think about cloud the same way you think about bare metal, you're paying a premium for a lesser level of stability.
That's counterintuitive. If you compare Hetzner's offering with EC2, EC2 is far more expensive and less stable.
But here's the truth: Cloud allows you to buy resources just for the minutes you really need them.
If you design for traffic bumps that go up and down, you don't need to buy hardware for your peak loads. You buy hardware when you need it. That's the simple approach.
The serverless trap
You can develop really quickly using services where you pay per request. AWS Lambda is perfect for checking if you have any traction.
These services are usually cheap in the beginning. When you reach a certain scale, they become extremely expensive.
At a certain level, it's far more cost-effective to move to different compute options—ECS, EC2, or Kubernetes, depending on your business and scale.
The 80% cost reduction case study
One customer has worked with us for around 10 years, predating Let's Go DevOps' founding. They focused extremely on optimizing cloud spend from the beginning. The business had pretty low margins, so cloud costs were a significant portion of revenue.
We did extensive optimization. We estimated that infrastructure costs went from a well-designed architecture to what we achieved with cloud-native approaches: about 80% reduction.
They're paying only 20% compared to full AWS pricing without features like spot instances.
AWS was still five times cheaper than their previous hosting provider.
From their perspective, infrastructure runs on pennies, not full price.
Multi-cloud: The expensive insurance policy
This is a good question because two or three months ago, a huge portion of the internet went down during an outage.
We've worked in fully multi-cloud approaches with hot-hot setups between GCP and AWS. Traffic switching was seamless between them.
But the amount of work required and the features we had to sacrifice were enormous.
The 20/80 problem
If you have a cool offering in AWS and the same in Google, they usually have 10-20% truly similar features. The additional 80% is cloud-specific—either in configuration or completely missing in the other cloud.
With multi-cloud, you can only use that 20% of the common functionality.
The biggest pain point? Managing firewalls. You have security groups in AWS and tags for firewalls in GCP. You cannot merge them together. You need translation layers. You're back to managing IP ranges like it's the 1980s.
The real question about multi-cloud
Before thinking about multi-cloud, I ask customers: What will happen if half of the internet dies?
Because that's what we're talking about when GCP or AWS dies.
Nothing works. All the pages people visit daily aren't working. If Google is down, users probably can't even search for your page.
When they try to visit and find it down, will they think you don't know how to build infrastructure? Probably not, because most of the internet isn't working.
You can live with the fact that you die with half of the internet. That's okay.
Disaster recovery: The smarter approach
What I recommend instead: Think about disaster recovery.
Make sure you can restore in another cloud. It won't be in an hour—maybe a week—but assume AWS somehow melted their server room. I want backups in a secondary location, probably a secondary cloud.
This is a hot-and-really-cold approach focusing on data, not active infrastructure.
This is cost-effective. You should genuinely consider whether you want to be online when everything else is offline. That's not an easy answer.
Security in the age of AI: New attack vectors
Security is a really important part of our job. I actually have the title of DevSecOps.
Especially with young companies that don't have a CISO role, the security burden lies somewhat on us. We try to influence and build security from the beginning.
When a CISO joins the company, they usually expect to find themselves in a deep hole. Instead, they discover the first floor is pretty okay, and the second floor is half-built. They can start furnishing the place because we've built foundations from the very beginning.
AI hasn't changed attacks much (yet)
I haven't seen an increased number of AI-driven attacks yet. Even in the past, we had script kiddies and tools like Armitage (a Metasploit addition) that automatically launched everything against targets.
AI might make attacks more sophisticated. Instead of the dumb approach of firing everything, you can ask questions to drive you in effective directions.
The real AI threat: Agents in your infrastructure
What frightens me most about AI is the attack vector for agents running inside infrastructure.
You can send documents or prompts targeting artificial intelligence living in your infrastructure. You can create a bad actor inside the infrastructure. That's scarier from my perspective.
When clients want to install agents operating on particular data sets, we start conversations. If you're scraping public web pages, let it go. Maybe it generates some traffic costs, but that's it.
We have different conversations when agents hook into critical company information or personally identifiable information (PII).
In many proof-of-concept designs, companies create a single server, open all databases to it, and let AI digest everything to learn and produce answers.
That's when we discuss potential impact, prevention measures, and safeguards. How do we catch or block rogue AI before it exfiltrates data or harms systems?
Privacy in the cloud: Lessons from WorldCoin
For WorldCoin, they're doing a really good job with privacy. It's one of their mission statements. I can't discuss project details, but I can tell you how far down the AWS security rabbit hole we've gone.
Starting with standard approaches like AWS enclaves for encrypting everything, AWS has extensive documentation showing how compliant and secure they can be.
Physical servers vs. cloud security
I had this conversation with a colleague: If you think about a physical server room in your building, and I want your hard drives, I come to your building and take them out.
In an AWS server room with probably a million drives, finding specific data is actually far more complicated.
Concepts AWS builds into servers—even Nitro hardware components—ensure you cannot mess with other VMs or other users' data.
AWS does a really good job enabling extremely enterprise-level security features in design and execution.
One piece of advice for your DevOps journey
If you're starting up, reach out with technical questions. I'm on LinkedIn and happy to answer them.
I really root for startups. If you want to talk about your designs or need a mentor, ping me. We can get a virtual coffee and go through issues you might be having.
I'm open to sharing knowledge, especially with young companies. Come with technical questions—I'm really open to answering them.
FAQ
Why do most scalability problems happen?
Application design, not infrastructure limits. Common issues include storing sessions in memory, improper shutdown handling, and business logic in databases instead of applications.
How do you test for scalability?
Disable autoscaling and deliberately find which component breaks first. Fix it, then find the next bottleneck. This creates a deep understanding of your application's behavior under load.
Should startups prioritize security or speed?
Speed, but build security controls from the beginning with configuration knobs. Start at medium security levels, then tune up as the company matures—taking days, not months.
Is multi-cloud worth the investment?
Rarely. You're restricted to 20% of common features between clouds. Better approach: Implement disaster recovery with cold backups in a secondary cloud.
How much can cloud-native practices save?
One client achieved an 80% cost reduction compared to standard AWS pricing. AWS was still 5x cheaper than their previous hosting provider.
What's the biggest AI security threat?
AI agents with access to internal systems and data. They can be manipulated through documents or prompts to become bad actors inside your infrastructure.
Podcast: TestGuild DevOps Toolchain Show
Host: Joe Colantonio
Guest: Jacek Marmuszewski, Co-founder of Let's Go DevOps
Topics: Cloud scalability, cost optimization, multi-cloud strategy, AI security, DevOps culture


