Choose a language:

Design for Failure with Martin Larsen | Smooth Scaling Podcast

smooth scaling logo
No system has 100% reliability. Failures and faults are inevitable. At scale, everything breaks. In this episode, Martin Larsen explains the design for failure approach behind Queue-it’s architecture and how it increases the platform’s availability and resilience. Larsen explores the principles behind designing for failure, the tradeoffs involved, the mechanisms implemented at Queue-it, and the tangible ways companies can bring this development approach into their processes.

Martin Larsen is a Distinguished Product Architect at Queue-it. Starting as a software developer, Martin was one of the company’s first employees. He played an instrumental role in building the foundations of Queue-it and is heavily involved in activities including the design, architecture, testing, and deployment of the virtual waiting room, as well as defining and executing on product vision.

Episode transcript:

Jose
Welcome to the Smooth Scaling podcast, where we focus on system design for high traffic. We deep dive into topics like scalability, resilience, and performance. I'm your host, José, and today we have Martin Larsson with us on the podcast.

Martin
Thanks, great to be here.

Jose
Good to have you. And Martin, you're a—let me check my notes—a distinguished product architect at Queue-it. I think it's a very distinguished title. And you’ve been at Queue-it from the very beginning, from a long time ago. So before we dive into designing for failure, can you just tell us a little bit about yourself, your background, how you got to Queue-it, and how the journey’s been?

Martin
Sure. So I’m 44 now, and I started at Queue-it about 14 years ago. Before that, I was doing a bunch of web application work. I took a master’s degree from the IT University, and then I worked at a company building software for the public sector. I was there for a couple of years, then somewhere else, and eventually I got the opportunity to join Queue-it. I was actually the first developer they hired. I knew the founders from a previous job, so I guess they liked me—and I jumped on board from the very beginning.

Jose
Very cool. And you’ve been here for those 14 years, so I’m guessing a lot has changed since then.

Martin
Yeah. I’ve done all kinds of things. I was a fairly new developer at the time. Since then, I’ve coded a lot of the platform, but I’ve also moved into roles like product management, architecture, cloud technology, and cloud architect. Now I’ve sort of found my place in product, I guess.

Jose
And I’m very happy for that—as someone working at Queue-it too. So maybe we can start talking about design for failure. Someone who doesn’t know the topic might say, “Well, why not just design for success? Make everything work correctly.” Can you tell us what designing for failure means, the approach behind it, and where you think it applies?

Martin
Yeah, so for me, designing for failure is about embracing the fact that no matter what you do to make your platform resilient, you're still going to fail. You might achieve 99.99-something percent uptime, but eventually, something will go wrong. So the key is to consider: what happens when you fail? What’s the blast radius?

That’s why this mindset is very important. And also, the more “nines” you want in your availability, the more expensive it becomes. So you start to ask: “Okay, we know we’re going to fail. How do we survive that? What’s the user experience like when we do? What can we live with in terms of degraded experience?”

Jose
Was that mindset already part of the system design when you joined 14 years ago?

Martin
Yes, it was. Back then, this was actually kind of a hot topic. The move to the cloud had just started, and with the cloud, people began questioning some of the older design patterns.

In the past, you might have just had a web application server and a SQL database—very common architecture. But with the rise of cloud, and even a bit before that, we started seeing new kinds of services: message buses, NoSQL databases, queues, and so on. These technologies were gaining traction.

And alongside that came the idea: let’s design something that remains operational even when parts of it fail. I remember a conversation with our previous CEO, Niels, where we talked about maintenance windows. He asked, “What do we do during maintenance windows, when we need to upgrade the system?” I said, “You’re not going to have any.” He looked at me like I was crazy, because before the cloud, you always had scheduled downtime. But that’s where my thinking was already, even back then.

Jose
Really interesting to know that it was part of the thinking from the start. Do you think designing for failure applies to every company? Or are there scenarios where maybe it's not worth the effort?

Martin
I think there are levels to it. There are some things that are easy to do and that I’d suggest everyone implement.

But it's not just about picking the right tech or adding resiliency to your code. It's also about your processes. For example: how do you deploy your code? Do you roll it out to all users at once, or do it gradually? That’s part of designing for failure too.

There’s also the mindset of deploying frequently. We all know deployments can fail—even with great testing. So making smaller, more frequent changes makes sense. It lowers the risk and gives you more practice. Before the cloud, we used to do big releases every few months, and then spend days getting them to work.

So for me, designing for failure is something that should flow through the entire R&D organization and how you operate.

Jose
Those examples you just gave—would you say they’re the “simple” parts of designing for failure? Are there more extreme examples?

Martin
Yeah, of course. You can definitely take it all the way to the extreme. Let’s say our data center crashes—it becomes unavailable. What are you going to do? Are you able to run the entire setup in a different data center?

That’s kind of the extreme case, because how often does that really happen? I’ve been with Queue-it for 14 years, and I’ve never seen an AWS data center go down completely. Maybe there have been brief outages in some regions—luckily not in the ones where I was deploying code—but even then, it’s usually a very short disruption.

Now compare that to what happens when you deploy faulty code. There was a Norwegian trading platform the other day—I don’t know if you read about it—but it was down for an entire day because they deployed bad code.

So, when you compare hardware or data center failures with software failures, the latter happens much more frequently. And they’re costly—not just in downtime but in the engineering effort to recover. Replicating data across regions, maintaining hot backups—all of that is expensive.

On the other hand, doing something like deploying new code to just a small segment of your infrastructure to see how it behaves—that’s not that expensive. So yeah, definitely some of the process-oriented things are relatively easy to implement.

But when you start talking about having fully redundant database clusters in other regions—that’s where things get a lot more expensive. And at that point, it becomes a cost-benefit analysis. What’s the cost of being down versus the cost of building in that level of resiliency?

Jose
I really like the way you explained that. It’s essentially about preparing for worst-case scenarios—or at the very least, discussing them. Maybe there are some extreme cases, like a full data center outage. As a company, you might choose whether or not to address that specific case and build for it. But having the conversation means you know where you stand across that spectrum of potential failures. I think that’s a really smart approach to software development.

Martin
Like anything else in software development, it’s about trade-offs. There’s always a cost trade-off. And if you’re building in resiliency, you also need to test that it works.

So, if the data center fails once every 10 years, you need to be ready for that rare moment. And when it does go down, you have to be confident your system will fail over properly. Even if that failure never happens, the cost of maintaining that readiness is still there.

Jose
And this might be a bit of a tough question, but for people listening—how should they think about that cost trade-off? Any guidance for teams that are just starting to have those discussions?

Martin
Yeah, you really have to bring the business perspective into it. It’s not just the cost of getting the platform running again—it’s the cost of being down. You lose revenue. You get bad press. There’s reputational damage.

For us at Queue-it, it’s quite clear. We serve our customers during their most critical hours. If we’re down, it’s incredibly expensive—as you know. So for us, the calculation isn’t all that difficult.

But even then, there are trade-offs. Is a certain investment in resilience worth it, or would those same resources be better spent improving the user experience? Both benefit the customer, so it’s always about finding the right balance—just like any other product decision.

Jose
Looking specifically at Queue-it, are there any particular ways we’ve designed the system for failure that you’d like to highlight?

Martin
Sure. Let’s take a very specific example. We deal with massive traffic spikes—it could be a 100x increase within just a few minutes. So to handle that, we’ve designed the infrastructure and the application to scale horizontally. Of course, this is fairly standard practice today, but 14 years ago, it was much less common.

The idea is to use stateless servers so you can quickly spin up new instances. Within minutes, you can increase capacity tenfold. That’s one specific thing we’ve done.

But there’s a twist. You might not always be able to scale fast enough. That’s kind of the worst-case scenario for us—when you hit a resource limit like 100% CPU, and suddenly you’re down. All end users are affected.

To handle this, we’ve built a mechanism where we ask end users to reduce how often they poll us. For example, if they normally check their status every 30 seconds, we might ask them to switch to polling every five minutes. That significantly reduces the load on our servers and allows us to support more users while auto-scaling kicks in.

Jose
That’s pretty smart.

Martin
Yeah, and I think this is a really important concept—degraded user experience. There is a consequence for the users waiting in line—they don’t get frequent progress updates, and they can’t really see how their position is changing in the waiting room.

But that’s acceptable when the alternative is system downtime. And this idea of a degraded user experience is something I think every system should consider.

Even if you have something as simple as a webshop, and let’s say you have a widget that offers personalized product suggestions—those might rely on real-time session data processed through some engine. That can be pretty resource-intensive.

Now imagine you're under heavy load and approaching your system limits. Would you rather crash completely, or just turn off the personalization temporarily? Sure, it's a degraded experience, and maybe you’ll lose a few upsell opportunities. But your users can still shop and check out with the items they came for. Most of them might not even notice the personalization is missing.

And these kinds of fail-safes are relatively inexpensive to implement. In many cases, it's just a feature flag—a toggle that you can flip when the system is under pressure.

Jose
It actually reminds me of something I read a few years ago about Netflix. They have something similar built into their systems. When you open Netflix, you usually get a list of personalized show recommendations. But if that recommendation service is down, the app doesn’t crash—you just see a standard list of shows instead.

So it’s that same idea: a degraded experience, but the user still gets something. I can still watch the show I came for.

Martin
Exactly. And for me, that’s the essence of designing for failure—you keep your core service running, while the nice-to-have features or added benefits get disabled.

Jose
Thinking back to when you joined Queue-it—14 years ago, you mentioned that some of this thinking was already in place. Was there any specific source of inspiration for that? Where did it come from?

Martin
Yeah, I can only speak for myself, of course. But we’d already talked about how the cloud movement was gaining traction at the time, and that definitely brought a lot of focus on resilience and scalability.

But one personal source of inspiration for me was peer-to-peer systems. You know—file-sharing platforms and even the more gray-area stuff like piracy. Those systems were still quite popular in the years leading up to that time, even if they were starting to fade a bit.

I actually did some of my master’s work and school projects on peer-to-peer networking. Those systems are fascinating—completely atomic. Every node can do everything, and if one node goes down, no one even notices. The system just keeps running.

Jose
Exactly.

Martin
So that really influenced me. I brought some of those principles into the product. Back then, AWS only had a few services available—SimpleDB, and I think S3 for object storage. So we had to build more ourselves than you’d have to today. Nowadays, a lot of that kind of resilience is built into messaging platforms and NoSQL databases.

But back then, we implemented some of it ourselves. We even built a peer-to-peer protocol into the system. I thought that approach was really powerful—and in line with a lot of what other companies were doing at the time too. There was a broader push toward chaos engineering and similar ideas around failure resilience.

Jose
Very cool. So maybe shifting gears a bit—let’s talk about the actual causes or reasons for failure. In your many years working here, can you summarize what you see as the main causes of failure, either at Queue-it or more broadly in the industry?

Martin
Yeah. Of course, there are infrastructure-related issues—servers go down, databases fail, those kinds of things. I think those are fairly easy to handle these days, or at least easier than they used to be, because resiliency is now built into so many products.

That’s kind of the line between designing for failure and just choosing the right architecture—because a lot of that failure-handling logic is baked into modern infrastructure. But it's still relevant. If you're running a website, for example, you want to keep the application stateless, and you want to run multiple instances—ideally across different availability zones or even data centers. That’s one layer of protection.

But in my experience, the thing that causes the most problems is… humans. That’s where most failures originate. Bad code, faulty deployments—human error. That’s why I think it’s so important to focus on the process layer as well.

For example, when we deploy something at Queue-it, we do it very carefully. We deploy changes to what we call “partitions”—essentially logical groupings of customers. So if a deployment fails, it only affects a limited group. Then we can either roll it back or move those customers to a different, healthy partition.

So it’s really about thinking along those lines—how do you minimize the impact when things go wrong?

Jose
And I really like that focus on process. Like you said, we’re all human—we make mistakes. I’ve seen in our postmortems that the emphasis is often, “Okay, how do we improve the process or the system so that the next person isn’t able to make the same mistake?” Adding guardrails—that’s a really interesting angle.

Martin
Exactly. And I think there’s maybe a third category of failure too—those unexpected errors that only show up when the system is used in a way you didn’t anticipate.

Let’s say a customer starts using the system in a way you hadn’t planned for, and suddenly that behavior puts heavy load on some backend database or service. These issues just happen—they could surface at any time.

So what do you do about those? How do you ensure your service keeps working, even when some component suddenly breaks down because of an edge case?

Those are probably the trickiest types of failures, because you need to build something into the application that can detect those conditions—and then fail over gracefully to something else. That takes careful thinking and preparation.

Jose
And I guess that’s also where monitoring, tooling, and alerting come into play. Even if it’s something we didn’t account for, at the very least, we can react quickly and address it.

Martin
Right—but that’s not really designing for failure. That’s more like reacting to failure.

Jose
Sure—but you can design how you react to failure, right?

Martin
Of course. That’s part of your processes again. But ideally, when those failures do happen, the system should be able to heal itself and recover. And that’s the hard part.

Jose
Very true. I think we’re getting to the end of the interview here. Maybe one final question—are there any specific things you’re keeping an eye on or getting excited about in this space for the future?

Martin
Yeah, actually. One thing I really appreciate is that the company continues to focus on this. Right now, we’re about to replace one of our databases with another alternative. And we’ve been really happy with the existing one—it’s been smooth, and a lot of our logic is built around it.

But now we’re switching to something new, and we’re not entirely sure how it will behave. So we’ve started thinking: what if we had a total database crash? How would the waiting room behave if we didn’t have that database? Could we still serve users?

One of the challenges is that we use consecutive ordering in our queue—so user number one, two, three, and so on, in a forward-running sequence. Just like at a pharmacy or anywhere with a physical queue. We made that choice early on because we were one of the first to build a virtual waiting room, and we wanted the experience to feel familiar.

But now we’re starting to question whether that still holds. Are people so used to virtual waiting rooms now that we don’t need strict sequential ordering anymore?

Because keeping that ordering in a distributed system is actually very hard. It centralizes responsibility in one component. And sure, you can make that component resilient—but eventually, it could still fail.

So what happens then? Can we run the waiting room without this numbering? That’s something I’m really looking forward to trying to solve.

Jose
That sounds exciting. I think one thing we can agree on is that you’ll come back to the podcast at some point and tell us how that research and that part of the work turned out.

And that’s it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed it, consider subscribing—and maybe share it with a friend or colleague.

If you’d like to share any thoughts or comments with us, send them to smoothscaling@queue-it.com.

This podcast is researched by Joseph Thwaites, produced by Perseo Mandillo, and brought to you by Queue-it, your virtual waiting room partner.

I’m your host, José Quaresma. Until next time—keep it smooth, keep it scalable.

[This transcript was generated using AI and may contain errors.]

Successfully handle peak traffic—no matter the scale