
Jacek Marmuszewski
When Size Matters: The Cool Kids' Guide to High-Performance Computing in the Cloud
Listen to the full episode above or read the article below:
Jacek Marmuszewski, Co-Founder & DevSecOps Engineer at Let's Go DevOps, shared practical insights from deploying supercomputers in AWS at Infoshare 2025. In this session, Jacek walked through the real challenges of building ultra clusters in the cloud - from tracking hardware shipments across continents to solving millisecond-level latency issues that can cost hours of compute time.
From moon landing to your pocket
In 1969, NASA's Mission Control had 12 kiloflops of computing power. That was enough to land a person on the moon and bring them back.
Engineers designing the rocket probably used the CDC 6600 - a top-of-the-art supercomputer from that same year. It had 3 megaflops of compute power.
Today, a new iPhone has 2 teraflops. That's not a thousand, but a million times faster than that supercomputer.
New MacBooks from this year? 8 teraflops.
The current champion on the Top 500 list is El Capitan - 1.7 exaflops. That's a million times faster than your smartphone.
Around the year 2000, we had supercomputers that were roughly as powerful as today's iPhones. It took us 25 years to miniaturize that compute from filling an entire room to fitting in our pockets. The same 25 years it took to record a sequel to Gladiator.
A Polish company on the Top 500 list
In November 2008, on 371st place, there was something unexpected - a Polish company that bought a supercomputer to power their application. Not academics doing research. A Polish business.
Nasza-klasa.pl - they launched around the same time as Facebook. Facebook had a better approach with microservices, more scalable architecture. Nasza-klasa had flaws where they couldn't scale with multiple servers. They needed a supercomputer to run their stuff.
They made it onto the Top 500 list in 2008.
When our client ran out of bigger boats
When I founded Let's Go DevOps, I thought we'd mostly do cloud stuff. Supercomputers weren't in our reach. But at some point, a client asked us to help with supercomputer deployment in the cloud.
The team: Michał, Marin, Bartek, and Krzysiek worked on this with AWS and Nvidia engineers. When we came to the table, it wasn't ready to be shipped. We helped them ship it - or they helped us, not sure what the relation is.
The biggest boats in the cloud
AWS has two flavors of really big instances.
U7 series: 32 terabytes of RAM and nearly 900 CPUs in a single instance. That's designed to run SAP - seems like quite a lot of hardware for managing a company.
P5s (P5.48xlarge): These are for high-performance compute and graphics card computation. They're maybe not that beefy initially - only 2 terabytes of RAM and 200 CPUs. But they have eight H100 graphics cards. Each P5 has 640 gigabytes of graphics memory and 32 fiber optic cards with 3.2 terabits per second network connection. 16 petaflops of compute.
Our client needed around 40 petaflops. 16 wasn't enough.
The ultra cluster challenge
AWS had ultra clusters on their NDA-restricted list. We contacted them and they said it wasn't ready yet.
The project requirements: an ultra cluster that multiple organizations could join. Each organization needed separate billing. We needed to add and remove participants on the fly. Kubernetes would manage the ultra cluster and other business components. All organizations would access additional features unique to them, many not deployed elsewhere.
The ultra cluster needed to be in the middle, accessible by all organizations, doing common compute but accelerated with business-specific stuff from different organizations.
Problem zero: there is no hardware
When we contacted AWS, they told us P5s were only available in the US. All the cool stuff goes to the US first. We needed it in Europe.
We pulled some strings. We got shipping tracking from AWS saying our hardware was added to a container, transported to Europe, was on the ramp in the server room, got installed. We had contact with the guy that actually put all the cables inside.
For six months, we had contact with an engineer on-site in AWS. A lot of stuff was not in the UI yet. We had a technician doing changes to the physical architecture of their cloud to validate ideas. Pretty fast delivery - Amazon is great at that.
Problem one: the speed of light
The speed of light is finite. The issue is the amount of hops, the amount of switches we need to pass to move traffic from one server to another. In a cloud environment, hardware can be spread across entire data centers - multiple floors, your servers could be anywhere.
For our relatively small supercomputer - only three nodes (El Capitan has probably 2,000) - 1 second delay in computation equals more or less one and a half hours freeze of your new MacBook. No computation whatsoever for one and a half hours.
1 millisecond delay? Around 8 hours delay on your MacBook. Even though the cluster is small, those delays add up to really huge loss in compute power.
AWS has placement groups with three options: cluster, partition, or spread.
Cluster option: AWS collocates all hardware as close as possible - possibly single rack, possibly single switch. No delay on the network. That's what we picked.
Partition: Groups of servers close together, but groups themselves on separate hardware. Like for Kafka - you don't want 900 instances on single physical equipment because you'd lose everything.
Spread: Each VM deployed on a separate physical server.
Problem two: 32 pipes into one big pipe
We had 32 network connections - 32 fiber optic cards connected between servers. But having 32 separate connections isn't useful if you need to push large datasets through.
We used network trunking - really old technology. The operating system and network equipment think all connections are a single, bigger pipe. Nearly as fast as a single super-fast cable connection.
In cloud: Elastic Fabric Adapter (EFA). Install the EFA driver and those 32 separate network cards disappear from the OS. They're replaced with a single interface - theoretical 3.2 terabits per second.
In testing, we actually got it up to 3.8 terabytes per second. Even better than documentation says.
Problem three: making 24 GPUs look like one
Three servers, eight graphics cards each - 24 H100s total.
With RDMA and Nvidia's NCCL, applications don't need to put computation in buckets and process each bucket separately. Instead, it looks like one huge memory space. You load all data and do computations by treating it as single memory space.
That's how we made a bigger card out of 24 smaller H100 cards.
The data problem
If you have large pieces of data in your server room, you can order AWS Snowmobile - a truck that holds 400 petabytes of data. They drive to you, load it, drive to the data center, unload it into the cloud.
We didn't order one because all data was already in the cloud. But we needed to get it into the ultra cluster fast enough.
Options:
Local NVMe drives: fastest, but if a server dies, data is gone
EBS: need to manage data and backups yourself, eats network connection
S3: infinite, don't worry about managing anything, but quite slow
We found S3 Express One Zone in documentation. Different mode of S3, deployed in only a single zone, extremely fast. With some system-level magic, really good speed. We preload data to cluster from S3 Express and only use it when we need to refresh machines, reinstall stuff.
The lie of "unlimited"
AWS documentation says no limit - unlimited storage, unlimited bandwidth.
We learned there needs to be hardware that runs the software. They're using equivalent of large machines with 10 gigabit connections from those S3s. We have 3.2 terabits available. We saturate the S3 connection for our buckets.
Two solutions:
First: changing the language. Not all languages are equal in AWS. Some SDKs don't use features that are experimental. If you're doing high-performance stuff in cloud, contact AWS and ask if the language you're using, if the SDK is actually up for the challenge. We learned not everything is.
Second: structure the data and cluster it. Not a single instance for entire bucket - we shard it and launch probably 8,000 instances feeding data to us.
That's it. Simple, right? 15 minutes now in the UI. Six months of work where the UI wasn't ready.


