Episode transcript:
Jose
Hello and welcome to the Smooth Scaling Podcast, where we are speaking with industry experts to uncover how to design, build, and run scalable and resilient systems. I'm your host, José Quaresma, and today I'm joined by Johannes Boumans, Engineering Manager in the SRE team at Zalando.
It was a super insightful conversation about how one of the biggest e-commerce websites in Europe delivers a unique shopping experience to customers across 25 countries. We talked about a lot of stuff—we discussed the challenges of scaling for daily peak traffic and how to strike the balance between cost, reliability, and user experience. Enjoy.
Welcome, Johannes. It's great having you here.
Johannes
Thank you very much for having me.
Jose
I've been really looking forward to this chat. I was also trying some of the stuff on the platform that we’ll be talking about, and I’m quite excited to hear a bit more from you on how you make things work in the backend. So, very excited about this.
And I would like to maybe start a little bit on Zalando itself, right? So, you've been at Zalando for 10 years now—is that right? A bit over 10 years? And I guess you know Zalando. Can you tell us a little bit about Zalando and how the company has evolved throughout that time?
Johannes
Yes, definitely. So, Zalando is the leading multi-brand fashion destination in Europe. I've been with Zalando a little bit over nine years—almost 10.
Zalando brings head-to-toe fashion and lifestyle products to almost 53 million active customers, all the way from apparel to footwear, accessories, and beauty. Our assortment consists of thousands of international brands, ranging from globally famous brands all the way to local heroes.
Let me go a little bit into the numbers. Gross Merchandise Volume (GMV) is about €15.3 billion in 2024. We serve customers across 25 European countries, and we have a bit more than 15,000 employees in Europe. Out of those 15,000, roughly 3,000 people are working in tech.
Jose
So how is that? I don't know what the numbers were 10 years ago—I would guess that they were smaller. How has that journey been as you've been growing with Zalando in those almost 10 years?
Johannes
The company was founded in 2008, and since then, we've had very immense growth. Right now, we have 3,000 people working in tech, and back in the day, it was just tens of people.
Right now, we have roughly nine tech hubs—eight located in Europe and one also located in China, in Shenzhen. All of those locations now have multidisciplinary teams, which 10 years ago was something unheard of.
And also for SRE, right? The way SRE is becoming part of daily practice—building and shipping solutions to our customers on a daily basis—has really transformed not only how we operate the business, but also how we operate software.
Jose
Just still within this journey and the 10 years—can you tell us a bit about your role? How has that evolved over the almost 10 years?
Johannes
First, we actually started in a very small team—having technology for the full spectrum, covering everything end-to-end from the front-end, the back-end, including supporting roles, doing the analytical and design part.
Now, over the next decade, we really have very strong multidisciplinary teams. We don’t have a single team anymore—we have more than 10 teams in a small department. I'm roughly now supporting 25 engineering teams for the full organization in the unit where I'm working.
We really have disciplinary teams with a product manager, engineering teams, an engineering manager, an analyst, and a design team—all in one multidisciplinary team—to really have dedicated ownership.
Jose
And then your SRE teams are kind of teams supporting all those multidisciplinary teams that are working on the product?
Johannes
Yes, actually, we have to take a small step back here because we really believe in you build it, you run it. So we don't really have a central SRE team anymore.
We believe in enabling others, also having an SRE champions model, because we really believe in you build it, you run it—that teams start and build things from the bottom up. So SRE is not a central role. It’s really everywhere, and everyone has to take on that role and the accountability for it.
Jose
Do you then have kind of still a virtual team—a virtual SRE team across the teams—where the different SREs and more... I don't know if you have in the teams some people that are more focused on the SRE work or the SRE champions?
So how do you, on one hand, do the best to integrate that work and keep it top of mind in the day-to-day, but then also ensure that best practices and the experience are shared across the organization?
Johannes
Yep. Really, as an organization, we went through, let's say, multiple phases. It started in 2016 with the rollout of SRE, and we really went from a legacy monolith application into a microservice architecture in the cloud.
By that time, we had a clear separation between engineers doing engineering work and operators doing the day-to-day operations. Moving into the cloud was a very disruptive change, and teams had to learn new things, like going from managing data in a data center to managing their workloads in the cloud.
So first, we had, back in those days, a central SRE team. But since the business was scaling and growing, we very quickly realized that this wasn’t really scalable. So back then, we split it up. We then had a central enablement team, which was still centrally building capabilities, and that worked really well. But we still missed the aspect of empowering teams—really building them up.
That's why we came up with the SRE champions model. They’re supported by the central enablement team, and the SRE champions really emphasize that SRE is everywhere. They’re there to challenge teams, to challenge people, and to keep them accountable to run their business—and, in the end, emphasize you build it, you run it across the full organization.
Jose
Awesome. Great to see. And I did kind of work a lot with this, I guess, strategy and thought a lot about it also in my previous role, with a lot of focus on the DevOps area. So it's great to hear about that setup in place and being successful.
We've been talking about SRE and, I guess, reliability. Now, can you share—maybe before we go into the rest—can you tell us what is your definition of reliability?
Johannes
Yes, that is a very interesting one. SRE is all about finding the right balance. It's not a set of rules people should follow, but it's really about finding the balance. And it's about trade-offs.
SRE is about finding the right balance, in my perspective. It's about service level objectives—making sure that systems are available within the defined thresholds that we've all agreed on. That includes the executive management team. And that means keeping ourselves accountable.
So it's about finding the right balance. And reliability is constantly about that—whether it's doing new feature development or improving on technical debt or resiliency. SRE is, in my opinion, really about the same thing: finding balance.
Jose
And has that changed within the 10 years that you've been there? So was there a very clear definition before that has been evolving? How do you see Zalando's approach to reliability and its evolution?
Johannes
Yeah, that definitely has changed. Back in the days when we were running monoliths, we didn’t really emphasize the you build it, you run it philosophy. We basically had five on-call teams running everything, and the teams were just throwing things over the barrier—"Here, let’s deploy it, let’s hand it off to the operators."
With the strong switch to the cloud, and later on also to container orchestration over Kubernetes, we really moved from throwing things over the barrier—“Now it’s your responsibility to keep it up and running”—to empowering the engineering teams themselves.
You build it, you run it starts from the very beginning: you design the system, you deploy it into the cloud, which means the teams also build their own CI/CD pipelines. They operate it. They are on call. And they stay on call from that perspective. And it also goes to the very end—maintaining it, and eventually retiring the application once the time is there.
Jose
There's a little bit of extra motivation that is quite important when the person who's building a system and contributing to it is also the person that might get a phone call at 2 a.m. if something goes down, right? So I think there's a little bit of extra motivation there—a healthy one, I would say—to be a little bit extra sure that things will be working as expected in a reliable way, right?
Johannes
It's the dedicated ownership that we are trying to establish, which brings in a form of accountability. So if an engineer deploys something on a Friday afternoon and they’re not certain about the change, they likely will not do it—because they’ll feel the hurt and a lot of pain in the evening or even at night.
Jose
And within Zalando, you are working specifically in Lounge, right? Is it Lounge by Zalando, I think? Or Zalando Lounge?
Johannes
Yes, it's Lounge by Zalando.
Jose
And it does feel like—I was trying it out as well and refreshing my memory of it—and it is quite a unique e-commerce experience.
Can you tell us a little bit more maybe how it came to be, and what is so special about it?
Johannes
Lounge by Zalando is basically the off-price destination where we offer our customers very limited-time—very heavily discounted deals, up to 70% off from its recommended retail price. It was actually started because there was a strong need for overstock or off-price articles, articles which were no longer being sold at full price in the industry. There are a lot of articles hanging around across the globe that actually need to be sold.
So that’s where the opportunity arose: having the chance to sell articles to customers at a very discounted price, up to 75%, which brings some challenges on its own.
Jose
And can you tell us a little bit more about those challenges?
Johannes
Yes. Actually, we have a very discounted price, so we have limited deals. Those limited deals start every day at 7 o’clock in the European time zone. And at that time, basically, all of our traffic comes in all at once.
What that means is: the newsletter, the daily pushes—all come in at once with this demand that hits us in a matter of minutes. It’s a significant traffic spike, which is really aggressive. Let’s say it goes from 30,000 requests per second to 200,000 requests per second in five minutes.
That brings a significantly different scale in how we operate—something quite different from traditional e-commerce.
Jose
Yeah, that's a super interesting model, and also the specificity about it and the spike, right? Is there huge variability in that spike from day to day? And is it something that you try to predict to be able to be kind of... Actually maybe I’ll start with that, because I have some follow-up questions, because I think it’s very interesting.
Johannes
Yeah, so actually, traditional scaling-out mechanisms do not really apply here. So we had to reinvent the wheel a little bit.
That means that services have to scale out rather quickly—but also scale in rather quickly, from a cost-saving perspective. That means having certain pre-warmers to preload the content, pre-warming the images, the applications—so basically the pods, as we are running in Kubernetes—but also the caching layer.
And in the end, also predicting what kind of traffic we would expect. Because we have limited deals, and sometimes deals are more popular than others, we see different kinds of traffic patterns.
It’s all about finding, once again, the right balance. What kind of traffic should we predict? What kind of bots should we scale out for?
So we built a very specific prediction model to estimate how many requests per second we’ll get—plus a small buffer—to ensure we have a decent scale. While on the other hand, we also don’t burn a significant amount of money on cloud compute costs.
Jose
So as I understand it, you’re doing a little bit of both, in the sense that you're improving and trying your best at predicting the volume that you’ll be getting and being ready for that, and at the same time also improving your architecture and pushing it as far as you can toward being able to dynamically scale out as fast as possible. Are you focusing on both things? Do I understand that right?
Johannes
Yeah. The system has really been built from the ground up. Take this into perspective: we knew up front that at a certain time in the day, there would be a significant surge in traffic. So that is something we took into account from the beginning while designing such systems, yes.
Jose
Was that the thought all the way from the beginning when building the platform? Because as I understand, it's a completely different platform than the standard Zalando platform. Is that correct?
Johannes
Yes, it's actually built to scale out heavily and quickly—in a matter of minutes. But at the same time, also reliably scale in, so that we do not burn a significant amount of cloud compute costs. And that's for everything—not just the bots, but also the data stores, the data volumes. It’s been built around that capability.
Jose
Very, very interesting. I'm curious—was it something like the platform was built and then you realized, oh, we need to think a bit better about the scalability because we’re having these spikes? Or was it, from the very beginning, let's build a platform from scratch to be able to handle these spikes and then contract as best as possible?
Johannes
Over the years, it has evolved a bit. But we knew up front that we wanted to work with a kind of limited deal—a limited-time offer. So that was something we knew from the beginning. And also, we push all the traffic in at once to give our customers, from a fairness perspective, equal opportunities.
So we really have a first-come, first-serve approach. The newsletters are dispatched at the same time, and we also try to push notifications to our customers’ mobile devices at the same time, to really provide equal competitiveness around tomorrow.
Jose
And I guess you also give—at least the ones that I've seen in the app—you also give your customers a heads-up, right? You're saying, “Well, tomorrow at 7 a.m., this one will start,” right? So you, as a customer, you also know what that time will be. So transparency there is so important.
Johannes
Yeah, it's also about the creation of a bit of a hype—like, “Tomorrow we have XYZ, please check it out, come back at seven o’clock in the morning.” To get the best fresh deals—yes.
Jose
And is it literally every day you have something new coming, or do you have some exceptions once in a while?
Johannes
No, every day we have new deals coming in—so every day during weekdays at seven o’clock in the European time zone. And at eight o’clock, so customers have one hour.
Jose
We can sleep in. That’s good.
And has there been any kind of cross-pollination, any learnings from building Lounge and improving the setup from an architecture and infrastructure perspective? Has there been any learnings that you've brought back to the Zalando platform?
Johannes
Yeah, a couple of things. Mainly it’s about scaling in. Applications usually can scale out quite rapidly—especially in Kubernetes using horizontal pod autoscalers. Scaling out horizontally is quite known in the industry.
But scaling in is a very important aspect too. And there’s actually a lot of opportunity still to explore—to really save some cost. So that is something we’ve actually given back to our counterparts.
Jose
In your experience, what would you say is the biggest challenge with the scaling-in part?
Johannes
Yeah, scaling is one aspect, but in our organization, we are developing in a very fast environment. We have hundreds of software deployments every single day to really double down on surprising our customers with new functionality and actual entertainment, because it’s really a hunting experience that we try to offer.
Meanwhile, our microservices are being heavily leveraged. Back in the day, before personalization, it was just an API client and a data store backed with in-memory storage. But now, every customer sees their own personalization, their own data points, and that puts things in a completely different perspective.
We're also experiencing growth in three other dimensions. First, horizontal complexity—like going live in new European markets and serving more customers and visitors. Second, vertical complexity—as we add innovation on top of our existing services. And that all requires very strong guardrails.
At Zalando, we have service level objectives. We call them Critical Business Operations, and those are the most important operations we have to run our ecosystem. That includes things like user login, registration, entering the catalog, the product details page, adding to cart, checkout, and returns—the traditional e-commerce flow. We measure those over a rolling 28-day window, targeting 99.95% availability.
Jose
Then the other side that we haven’t touched much on is that, in Lounge, quite often you’re selling limited inventory, right? You’re often clearing out the shelves. I don’t think you put it directly in those words, but it’s inventory that makes sense to sell at a discounted price, and it’s often limited.
Was there any kind of specific challenge related to that? I could imagine that if something is in very high demand and you have thousands of people going in at the same time, you might have some clashes with inventory. So how did you approach that?
Johannes
Yes. We have very limited stock. Usually, for some shoes, we only have one or two sizes available. And in terms of quantities, maybe two, three, four, five—it really depends.
We also built that part a bit differently compared to traditional e-commerce systems. We work with a kind of reservation system. As soon as you put an article into the cart, we reserve that article for you exclusively for 20 minutes. That’s a bit different from traditional systems, where reservation usually happens only at checkout.
For us, it really matters to create that competitiveness from the beginning. So as soon as you add an article to the cart, you make an exclusive reservation—for a limited amount of time. Once that time expires, the cart is cleaned up. That’s how we handle the first-come, first-serve model.
It’s challenging, because if you’re holding articles upfront without having completed a checkout—without the money being captured—there’s a risk of bot activity.
Jose
I guess there’s also this other aspect from a customer perspective: if I go in at the beginning of the hour and I see that the size I want isn’t available, maybe I should check again in a few minutes, because someone might have reserved it and then that reservation got canceled, right?
Is that how it works? Or have you thought about having a kind of waiting list scenario?
Johannes
No, we do not have a waiting list. But once an article is fully reserved—meaning all items are currently in carts with other customers—if those reservations expire, customers might get a notification if they signed up for it.
Then the chances are equal again. Once the article becomes available, the first person to click "Add to Cart" will receive it.
Jose
And you mentioned the bot challenges from the perspective of being able to, as you said, reserve the item before actually having to purchase it, right? And I know bot fights are a hard one—it’s a never-ending one. But was there anything that you have seen being successful in how to mitigate that risk?
Johannes
Yeah. As the internet can be quite dangerous, we’ve had to build multiple layers of protection.
First, all the way from how customers are able to register or not—we’ve built several mechanisms into that. Also, the way we operate our business is somewhat unique: we’re behind a login wall. So the website is not publicly accessible. It’s free to register, of course, but that brings an extra layer of protection.
Beyond that, we also have very strong login validation, so we keep the majority of bots out at the entrance. We use several mechanisms, including third-party vendors, to help evaluate the quality of a user session in terms of behavior—like click patterns—and we run a risk model against that.
Besides that, we’ve also built a couple of in-house mechanisms like general rate limiters, to prevent ourselves from being taken down. Imagine having a set of users or bots trying to consume the entire website—we want to make sure that doesn't put us out of business.
Jose
The fact that it’s behind login, and you have user accounts—I guess that also helps you gather a lot of data on each login. That might help identify whether the behavior looks more like a person or a bot, right? So it’s a very interesting scenario there as well.
Johannes
Yeah, data is key there. All the data is being leveraged, challenged, and then evaluated. Based on that, a risk score comes into play. And that risk score will then trigger certain controls that we’ve configured.
Jose
Kind of going back into system design and building a system for this kind of peak traffic—in my experience, it's often about trade-offs. It’s about what we are prioritizing, where we want to invest our time and effort, and what we choose not to focus on.
Can you think of any examples of specific trade-offs that you had to make in the design of the system? Could be cost versus performance, or maybe user experience versus performance. Is there anything that comes to mind?
Johannes
How we measure it is through our service level objectives, but this actually comes one step beforehand.
One specific challenge we had was about loading data. In our ecosystem, we can’t really compute anything at runtime—everything has to be precomputed before our traffic peak comes in. So we have a very strong limitation: we don’t compute anything at runtime. Everything needs to be ready in advance.
Because of our limited-deals model, we know exactly which campaign or deal will go live and when. All of that is computed beforehand and sent downstream to our partners and dependencies—like CRM—in batches. Sending out the newsletter or push notification at the same time is part of that coordination. And doing all of that preloading is not something that is really common in the industry.
Jose
And I’ve noticed also that, if I got it right, there’s no search functionality on the platform either. Is that part of the trade-off too? From a performance perspective, would that be hard to achieve during peak times?
Johannes
Yeah, very good callout.
The search capability is currently not available. We are, however, evaluating it, because customer behavior is changing. People want more of a curated shopping journey rather than just exploring the different deals.
So we’re looking into it, but it’s a significant challenge to do at large scale—especially when a large number of customers enter the website at the same time. One of the main issues is computing search results at runtime and delivering valuable insights quickly. That’s been one of the main concerns that kept us from having it initially.
Jose
And I do see there’s a trade-off there. I was thinking about that as a customer too. When I was trying it out, I thought, “Oh, I saw this item an hour ago,” but then I had to scroll a bit to find it.
So maybe there’s a good trade-off where you let people search during off-peak hours, but during peak hours it’s not possible. But then again, there’s the user experience question—if people get used to searching, and then they suddenly can’t, that could be frustrating. So yeah, very interesting and complex problems, I guess.
Johannes
Yep. And that brings us back, once again, to finding the right balance. That’s really what SRE is all about. It’s about service level objectives, and those always involve trade-offs.
Jose
We have a lot of technical people among our listeners. If there’s an engineering manager out there who’s starting to prepare for a major traffic surge or a big event, do you have any recommendations for them? Anything you’d tell them to focus on or think about?
Johannes
It's about getting confidence in your systems. How we do that is with service level objectives—that’s really how we safeguard our systems. But that’s only one aspect, right?
If you don’t deploy your systems, they’ll likely be reliable. But then the other aspect comes in: dependencies, traffic. Especially in Q4—it’s peak season, the Champions League of e-commerce. So it’s also about gaining extra confidence in how your systems perform.
One of the things we do is conduct monthly load testing. And that’s not something we do on staging—we actually do it on production. We simulate real-time peak traffic, plus a buffer, across all of our services. That includes all the clients—web, mobile web, native apps—and even the full edge.
That means we’re also load testing our partners. But that’s just one part. We also use chaos engineering to really see how reliable our systems are when unexpected scenarios happen. Imagine some systems have a slight increase in P99.9 latency—would circuit breakers kick in? What resiliency patterns are in place?
And it could be the case that your systems are running perfectly fine, but you’re facing upstream challenges—maybe from one of your partners. Imagine we're running a big Black Friday campaign, and one of the payment vendors experiences issues. That could impact your applications as well.
So we also practice regularly—game days, executing playbooks—to continuously evaluate how systems are actually performing.
Jose
That's some really good advice. And I was just about to ask if you were working with chaos engineering—you confirmed it before I even got to ask. That’s very, very interesting.
And I expected it, given the scale and thoughtfulness of the architecture you’ve described. Maybe one follow-up on the chaos engineering—when talking about it, it’s often about going in and, say, deleting a pod or a set of pods.
Have you considered going all the way, like turning off a data center, to see how the system handles it? Is that something you’ve tried as well?
Johannes
Yeah. Chaos engineering is really about the practice of preparation—but also about increasing your confidence in your system. And if you’re not confident, it’s about gaining that confidence.
It’s about understanding your systems inside and out. For us, that means not testing on staging—it’s really about testing in production. Both chaos and load testing. Applications behave differently in staging than in production, even though they should be the same. We always see differences in size, instance types, CPU, memory, load balancing. The amount of data is also different.
So it’s all about the practice. When we do chaos engineering, it’s in production. And it could be as heavy as turning off an availability zone in a specific cloud region. Or it could be as simple as increasing latency on a specific API call—just to see how circuit breakers and resiliency patterns kick in across dependencies.
Jose
One thing I’d like to ask—because you have this focus and experience with peaks every day—is, how different would your approach be, or would your advice change, if someone is preparing for a one-off peak event, like once a year? Versus someone who has to deal with daily peaks?
Is the advice different? Or just the intensity?
Johannes
Yeah, it depends a bit on the nature of the business. If your peaks are rather predictable—like you’d expect a peak only in November or December—then of course it likely makes sense to start preparing later in the year.
But if you can expect a peak at any time, then you have to perform at that specific time. And if you can’t predict that time, it really becomes about building the muscle and the capability to run those exercises every day—or at least every few weeks.
For us, it was important to have that capability constantly. But even in other e-commerce companies, I see that they build that muscle quite early, because the e-commerce calendar is shifting. Events are happening throughout the year—end-of-season sales, Black Friday, but also newer events like Singles’ Day are really popping up.
So e-commerce is evolving. Peaks are here, and they’re here to stay—because they drive massive customer demand, foster growth, and also keep competition healthy.
Jose
We do have Zalando—the main site—which works with Queue-it, right? Using a virtual waiting room there. Is that something you've worked with? And how do you see that in connection to the work you do in Lounge?
Johannes
Yeah, it’s a bit of a different solution. The trigger is a bit later. For us, the queue system starts when someone applies an add-to-cart—that’s when our queue kicks in.
In the main Zalando shop, Queue-it gets triggered at the checkout. So the way peak handling works is a bit different between the two.
But it’s definitely something we’re evaluating for the future—looking at how we can better handle those peaks and give more fairness to our customers. As I mentioned earlier, complexity is growing—both horizontally and vertically. And especially when you're operating across 25 European markets and multiple time zones, the ecosystem becomes more and more challenging.
Jose
And definitely—from our side, and from the virtual waiting room perspective—that fairness aspect is so important. As you said, making sure people are served first-in, first-out and giving them a smooth chance to purchase what they came for really matters. So thank you for sharing that as well.
Are you up for a couple of rapid-fire questions?
Johannes
Yes, let's do it.
Jose
Let’s start with—I said a couple, it’s actually three of them. First one: is there any book, podcast, or thought leader that you’d recommend to our audience?
Johannes
I have two. The Pragmatic Programmer by Hunt and Thomas—it’s really about timeless engineering mindset, and that’s a pretty good one.
And then one of my latest reads was Team Topologies. It’s all about scaling people, not just systems. That’s becoming a more prominent part of how we work.
Jose
Very good. And Team Topologies—they’re coming out with a second edition, I think later this year. So it might be a great opportunity for people who haven’t read it yet to pick it up. Manuel and Matt are the authors—it’s a great read.
Johannes
Yeah, definitely a very strong recommendation.
Jose
Second question: is there any advice you’d give either to your younger self or to someone just starting out in this area?
Johannes
Maybe two.
I think the biggest factor is: be curious about the why, not just the how. That’s a very important mindset. These days, especially with all the new technologies, we often know how systems work. But understanding the why behind them is becoming even more important.
And the second one: don’t chase clever solutions. It’s really about optimizing—especially when it comes to communication. In software, it’s all about finding agreements. You can be the best engineer, but without communication, your impact will be limited.
So those three—curiosity about the why, avoiding overly clever solutions, and focusing on communication—are probably my most important pieces of advice for people starting out.
Jose
Very good. Thank you for sharing. Final question before we wrap up—Johannes, to you, scalability is?
Johannes
Scalability is the ability of a system to handle increased load and business complexity without a proportional increase in cost, effort, and impact on reliability—which all ties back to the core of SRE. It's about finding balance.
Jose
Awesome way—very complete and thorough definition of scalability. Thank you for that. And I think it's a great way to wrap it up.
Thank you so much for being here, Johannes.
Johannes
No worries. Thank you for having me.
Jose
And that's it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed it, consider subscribing and perhaps share it with a friend or colleague.
If you want to share any thoughts or comments with us, send them to smoothscaling@queue-it.com.
This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo, and brought to you by Queue-it—your virtual waiting room partner.
I'm your host, Jose Quaresma. Until next time, keep it smooth, keep it scalable.
[This transcript was generated using AI and may contain errors.]