Running High-Traffic Product Drops at Rapha with Tristan Watson

In this episode, seasoned platform engineer Tristan Watson shares his learnings from managing peak traffic at Rapha and Booking.com. Tristan reveals the key challenges, trade-offs, and best practices involved in preparing infrastructure for high-traffic product drops and collaborations. He shares insights into maintaining uptime with a small team, balancing technical and business needs, and why every young engineer should be on the on-call roster. Whether you're navigating traffic surges or optimizing for resilience, Tristan’s advice will help you prepare your systems to handle the pressure.

Tristan Watson has spent over a decade mastering the art of keeping websites fast, stable, and scalable. With experience leading teams and steering key projects across tech, retail, and finance he consistently balances technical excellence with business goals. His pragmatic approach and passion for emerging tech like AI make him a sought-after consultant. Off the clock, you’ll find him exploring new tech trends or out on a bike ride. You can find Tristan on LinkedIn here.

Episode transcript:

Jose
Hello and welcome to the Smooth Scaling Podcast, where we talk with industry experts to uncover how to design, build, and run scalable and resilient systems with the ultimate goal of providing a great user experience.

I'm your host, Jose Quaresma, and today I'm joined by Tristan Watson, who's a seasoned platform engineer, has worked at Booking.com and Rapha, amongst others, and is now at G-Research focusing on the financial sector.

We had a great chat today on how to handle peak traffic and the challenges and trade-offs of scaling infrastructure. I learned a lot. I loved his learnings and the advice he would give to his younger self, and I hope you do too.

Tristan, welcome to the podcast. It's great having you.

Tristan
Yeah, really happy to be here. I think it's been a long time trying to line this up, but excited to have this conversation.

Jose
I would maybe start with a bit of a broader question, trying to understand a little bit about your background and your journey, because you've had quite a diverse career, right? Across industries, with some very interesting companies in the space of scalability and resilience. Can you tell us a little bit about your journey?

Tristan
Yeah, absolutely. My background's very technical. I started off—I did the whole university thing and all of the things around that, of course. But yeah, my background is technical, sort of in the new-school system administration space, so scaling, DevOps, infrastructure engineering, site reliability engineering—all of that good stuff. And yeah, I've had quite a diverse career working in very different industries. My first job was at a startup, and one of the things I came out thinking was this is great, it was a public cloud company, and it was really wonderful. We were running around doing things, trying to scale a public cloud infrastructure on OpenStack. But one of the reasons that I left, and felt it was the right time to leave, was that I had no idea how people in industry were truly using cloud at that moment. So it became quite obvious to me that I had a lot to learn.

Since then, I've gone on to work as an SRE across numerous companies, from Booking.com to Rapha Cycling. And currently, I’m taking a bit of a wild move into contracting, but I’m essentially contracting in the finance world, which is a very different scale and environment, it’s very interesting. But it's been a lot of really good learnings across the board, all around scaling, all around reliability—basically trying to keep systems online and understand them.

Jose
You mentioned Booking.com, you mentioned Rapha—and we’ll be talking a little bit more in this episode about the Rapha use case with product drops, right? But maybe before we go into that, was there a very different way of looking at scalability and resilience between Booking.com and Rapha? How would you compare the perspectives there?

Tristan
Yeah, I think it's really interesting the way that those two companies are different, and it's mainly just to do with scale and size. Booking.com is a huge international business that's very well known. The side that I was specifically working on was the rental site—rentalcars.com—which is an arm that rents vehicles to people, and that's part of the Booking Holdings group, under the Booking umbrella.

Essentially, we were trying to scale to allow people to book cars any time of day, right? You need a car, you need to be able to drive from the airport, you need to be able to drive to where you're staying. And that's the challenge we were trying to solve. It's kind of a two-way problem, these sorts of things, because you've got customers coming in, and they need a great user experience. So, can they search for a vehicle in a timely manner? Can they book a vehicle in a timely manner? How do we delight them?

But then you've got all these really complex things behind the scenes that allow you to be able to do that. I'm talking about the challenges of grouping up aggregated car rental providers—the enterprises of this world, the Europcars, the Hertzs, the Sixts. How do you aggregate all of those together in a timely manner? How do you store them? How do you ensure the pricing is where it needs to be? It’s kind of a two-way street in terms of scalability.

I think one of the main problems we had at that point was being able to scale quickly. A lot of that was because of the hardware we were on. So at that time, and even now, to some extent, there was a lot of owned data center hardware, co-location and everything. Same kinds of problems, just at a different level. There were more modern setups at the startup, and less modern things compared to what I’m doing now. But you can’t always just plug in servers, right? And if you can, it’s going to take you a long time to get them provisioned. You have to install the images, ensure all the networking is correct, make sure it can talk to the databases it needs to talk to. So it’s a long game.

You have to do a lot of gymnastics there, and we did a lot of this at Booking—because the hardware and software you're working with in a company like that is very difficult to change. It’s making a lot of money, it’s serving a lot of traffic. Do you need to change it? Hopefully at some point. But it's still making money and keeping the lights on.

So the challenges there became: how can we get the most out of this? How can we tune it? What can we do to aggregate the traffic? That was load balancing, CDNs—all that good stuff that allows you to reduce incoming load and distribute it around. That was really challenging in some ways because it’s very slow to work with, and you’ve just got to rely on the things you've already got.

So there’s a big piece of resiliency when you're working with your own hardware. Compared to somewhere like Rapha, which is a smaller business, there was a lot more freedom and opportunity in terms of relying on other people to do that lifting for you. The unsung hero was the CDN software we were using at Rapha in a lot of ways—which, while I was there, was Cloudflare. It allowed us to absorb a lot of load and trust, ultimately, that it would serve traffic in a timely manner for us. We really piggybacked off that.

So to loop it around and wrap all that up, the challenges were kind of the same: different scale, different technologies, but the customer expectation remained the same. People want a seamless user experience without delays and carry out what they’re doing in a timely manner. So although one was hardware-based and serving hundreds of thousands of requests per second, and the other was not, you still had to meet the same customer expectations. That was quite interesting—to see the difference between a very customer-focused business like Rapha in the apparel world, and a services company that’s aggregating and selling access to vehicles. But the customer expectations were still the same.

Jose
The traffic—I mean, you have big load on Booking.com, right? But I guess the difference between the baseline and the peaks is way different than with Rapha?

Tristan
I think one of the most interesting things about Rapha, which you sort of take for granted at the time, but when I reflect on it, it’s quite interesting, is just how international it is as a business. By that I mean they have a presence in Asia, the US, Europe, Australia, and the Oceanic countries too. So all of these markets—for a company that's not huge—are ongoing, and you have to serve them all and delight them all at the same time.

That’s a hard challenge as well because, with something like rentalcars.com, there are different companies serving different regions and therefore different specialisms around how people rent cars. Do people even rent cars? Is it motorcycles, for example, in different areas? And we didn’t really have to deal with all of that—we were happy focusing on the key markets and making sure it was a competitive service in those. Sometimes there’s just no provider in a certain location, so you're trying to aggregate nothing, which definitely happened.

In comparison, Rapha’s website had to be up all the time. It had to be able to serve traffic. We had, at that point, numerous distribution centers—being able to just post out things from different locations. That’s an interesting problem too.

In terms of peaks and troughs, I would say it was a bit more static at Booking, because people are always doing something. You don’t always have a clear idea of what that is, but there are definitely popular spikes around school holidays in particular. There’s also the commercial element—Booking was driven by partnerships. So if you’re booking a flight, you might see a car rental. If you’re booking a hotel, you might see a car rental. And you can’t always predict that, because there might be a sale on, or a deal with a partner, and suddenly you get a spike. So there were many small peaks, driven by different things.

At Rapha, the traffic and trends we could analyze were much more based on standard consumer behavior and trends—Christmas, summer, and so on. But also around major events. You’d have a lot of racing and road cycling. For example, if someone from a major team wins a big race, traffic spikes. So then, yeah, you try and make the most of that. That’s very good.

Jose
One thing I was curious about—you mentioned it's a small company but doing business around the world. From a support and SRE perspective, how did you make that work? Did you have a central team? A distributed team across the globe? How did that work?

Tristan
Yeah, that’s a really good question, and something that was quite challenging at times. The real answer to that is: hard work. In an ideal world, we would’ve had teams supporting it, technical teams or even just a technical individual somewhere else who could understand the setup. But I think we did a really good job of making the most of a monolithic infrastructure and a monolithic service at that point. So, you know—servers with lots of resources, serving many, many requests.

A lot of it was just toil, really striving for uptime, which was pretty successful, I’d say. We had our moments, of course. I think the lockdown brought some of those moments on quickly, with the increased demand for cycling products. People had nothing else to spend their money on apart from bicycles, which led to a bicycle shortage. That made things interesting.

It was also a moment where expectations shifted in the industry. People now expect software and infrastructure teams to be capable of solving incidents. You’re seeing many companies spring up—Rootly, for example, or Incident.io—trying to bring better tooling for transparency and better communication during outages. We had a few conversations with those companies. Incident.io at that time was just starting out—incubated at Monzo Bank in the UK. They have offices now in New York and they’ve grown a lot.

But I think fundamentally, the expectation for software and infrastructure teams to be on call is becoming more and more a given. That might mean it’s part of your contract, or that it’s unpaid, or you get time off in return. At Rapha, with an international customer base and a small team, the real challenge was: how do we ensure this is up all the time for all our key markets?

The other challenge was: how do we do this without relying heavily on headcount? How do we use best-in-class services to support us? Things like CDNs and firewalls, to ensure we’re not exposed. It’s that age-old problem of trying to mature a small organization, pushing forward the need for reliability without burning everybody out.

There were some really fun incidents—and some really tough ones. The fun ones are when you're taken offline or having scaling problems because of too much traffic. That’s a good problem to have. We had a collaboration at Rapha with an artist in LA during lockdown, and we just had too much traffic. That was great—but also terrifying. It shouldn't happen, but it’s a nice problem when you’ve got that demand.

Then there are the incidents where you just mess up. And those are the ones you really need to learn from. When too many people are banging on the door, that’s a good problem. It can happen once, but you have to learn from that. But when you as a team mess up, you really have to take learnings from that.

Jose
I think it's a super interesting area, right? The team that I lead here at Queue-it—we’re across the globe, providing 24/7 support, kind of a follow-the-sun approach. I think it works super well; it’s a super good team. It also comes with its challenges in terms of time zone differences and distance itself, right? So it’s a hard one to get right.

There are always things to improve on. But it’s very rewarding when you see things working well across the globe, and teams working well together, supporting our customers and supporting the systems that we have up and running. That’s very rewarding.

Tristan
And those moments where you have to bring things together and everyone brings their own different expertise—for me at least, maybe this gets forgotten sometimes in the industry, but those moments, especially when trying to come up with solutions for things... For example, with Queue-it: how do we ensure we stay up during high demand peaks? Those moments to me are the best in technology. That is what gets me going. That’s what gets me riled up, motivated, and excited—those moments where you have to deal with something.

And yeah, it's difficult. Of course, you’re following the sun. Sometimes you're up at 4 in the morning doing something—you have to be. It’s part of the job, almost. But those are the moments where I’ve learned the most. Those are the moments where I’ve seen the best in people and had lasting experiences—where you’re in the trenches together, fixing things. It’s difficult, you’re having a bad time, but you laugh about it later and talk about it later.

The stakes are high, because normally you're just doing a job—completing different tasks, improving things, hopefully monitoring things, improving the views. But at least in the SRE world, it’s all in preparation for something going wrong. And you try to prepare the best, but you're always going to miss something. From my experience over many, many years, those are the moments I think about almost fondly. Don’t get me wrong…

Jose
Yeah, I think I get what you mean. Yeah.

Tristan
In the moment, it’s awful. But afterward, you look back fondly, and you can learn things and meet really good people. I’ve got a friend who said to me once, “I love an incident.” You know, he’s like, “Let’s get in there, be the incident commander, communicate things out, come up with a solution, action the solution.” And I think he’s right. I really like those things, and I love those moments.

I think there’s an element now where we expect incidents not to happen—because that’s kind of the downside of using services where companies promise to solve everything for you. But generally, technology is a lossy system. We should expect things to go wrong. That might be like Cloudflare’s BGP route going down tomorrow and suddenly you can’t get to your services. Those things are really interesting. They’re not easy, but yeah, I think there’s a lot of value in that.

And I hope juniors have the opportunity to just get thrown in the deep end and pray nothing goes wrong—because that’s how you learn.

Jose
I fully get what you mean. It’s when the pressure is on that, hopefully, we also learn the best and see the best in people. I see that.

Maybe just double-clicking on Rapha, right? You talked a little bit about some of the tech stack. Maybe we can go a bit more into it. You mentioned Cloudflare as the CDN. It would be interesting to talk a little bit more about the rest of the stack there. And then maybe give some background on how you got into the limited edition collabs and product drops—and give a little context before we get into how you worked toward being able to handle those peaks.

Tristan
Yeah, sure. I’ll just delve in quickly at a high level on the infrastructure at that moment. So it was very few servers—big servers, lots of resources—in a co-location, serving all of the traffic to an international audience. But it was slow, it was archaic, and it was difficult to ensure uptime during upgrades. There were a lot of manual processes, a lot of typing commands. You’d get a playbook from someone who worked there a long time ago—they knew how it worked, but they’re no longer there. And you're like, okay, this is the infrastructure.

The CDN in front did a lot for us, and I think it’s a good product, fundamentally. But behind-the-scenes it was interesting. Coming in from somewhere like Booking, where you’ve probably got redundancy ten times over, to see an international business being served by very few servers—it worked, but it was definitely a wake-up call.

There was a moment in my first few weeks at Rapha where I asked, “How many servers have we got?” And I’d come from around 2,000 servers—lots of hypervisors in a standard VM environment, plus some physicals for raw bare-metal performance. And the answer was, “Oh, we’ve got less than 100.” That was a moment for me—wow, that’s not very many. I expected more. And you realize that while the company is mature, but it's not a tech company. That was a really interesting shift for me. What’s the focus of someone like Rapha? It’s delighting customers. It’s a loyalty-driven brand. The mission is to sell the best cycling garments possible.

So the infrastructure was rough and ready. We had to rely on, as mentioned, the services around it to ensure uptime. That costs money. It costs more than running servers yourself, but it's easier to manage. You can do changes at the click of a button, but it doesn't scale very well.

There’s one thing in serving consumer traffic in a timely manner and bundling things up via a CDN to get them to the customer. But it’s another thing entirely when you have to deal with back-end event handling—like, someone’s made a purchase, they’ve paid, now send a message back to continue the process. And those things didn’t always work very well.

I mentioned the collab with the artist earlier. That was becoming a bigger thing. Everyone knows those businesses that ride hype—Supreme, Obey, Palace in London, all those super cool “hypebeast” companies. We started doing similar product drops with different collaborators.

At that point, Rapha had this amazing, state-of-the-art custom platform. You could use it to design your own kit, different patterns, different logos, whatever you wanted. It would get manufactured and delivered. Now it’s all bespoke again, which I think makes sense. But at the time, you had these opportunities to do collaborations.

Companies or individuals would come in saying they wanted to work with Rapha, or someone at Rapha would reach out to people they admired and invite them to do a limited edition collection. They’d work on patterns and put the drop together.

So two really interesting elements there, and that set the landscape, sets the stone for a bigger company. You’ve proven you can do this at a small scale. So what about a bigger drop with a bigger name?

That moment came maybe three weeks. We did the first for the Giro d’Italia during lockdown—it had been rescheduled slightly. And we’d already done a few successful drops. We’d learned from them, scaled to the best of our ability, put limits in place to keep the website up during various drops. But then the conversation shifted to: okay, we’ve got something bigger than we’ve ever done before. Is the infrastructure capable of staying up?

And you look around, check the metrics, dig into system load from a database perspective, from frontend all the way through. You look at it and think: captive audience, lockdown, we’re in danger here.

We needed to find a solution to scale it. And the worst thing you can do in those moments is go down. You've got a captive audience—maybe someone sneakily on their phone under the desk at work—trying to buy this super cool product. You have to stay up. You have to delight the customer. And that’s when we had to start looking into alternatives.

Jose
Do I hear it right, that you didn’t actually get to experience the bottlenecks because you kind of did the homework? You mentioned database and frontend, so I guess you’re saying that somewhere in there, you would’ve hit bottlenecks, but you actually didn’t because you planned ahead and prepared.

Tristan
For the most part, yeah. We had one. We had one outage where I hadn’t been there that long, and things just started falling over. It’s hard to see how much traffic you’ve actually got coming in sometimes. That’s the problem with these sorts of systems—when you're trying to link everything together, it becomes really difficult. You’re getting metrics at each level, and you're not entirely sure which one is correct. There are often discrepancies between them, so you have to spend a lot of time stitching things together.

And that was a luxury we just didn’t have at that moment, we didn’t have the observability we needed to make confident decisions. But yeah, everyone loves the word "iterate," right? And in that sense, iterating became: how do we stay up? Because we didn’t last time. Those were the iterations. We had a lot of learnings.

There were definitely times when the website was down. I think the team did a really good job bringing everything back up, but we had to learn from those moments to really know where the bottlenecks were.

Jose
And then once you got to the conclusion, okay, we need to do something so we can handle the really big drops—what were the alternatives that you looked at? Can you share a bit?

Tristan
We were kind of fortunate that we’d already done a big migration away from a co-location setup, where we had servers and hypervisors, into AWS. So we had the ability to scale. But, as always, you're only as strong as your weakest link. And in that case, the weakest link was definitely the database.

This happens a lot. Unfortunately, it was a particular type of database that’s expensive to scale—licensing and things like that. So then you face a trade-off: do you scale this and spend X amount of money, then scale it back down? Is that safe? Do you even want to scale it back down?

We looked at building out more servers. We looked at improving the auto-scaling groups on AWS, so you have the ability to look at those things and really improve there. But fundamentally, it becomes a moment where you’re just not confident. And we didn’t have much time. You’re thinking: this could be one of the biggest trading days for the business. And I’m not sure the systems we’re relying on are up to the task.

So you start doing what you can—this is where it gets fun. The only real way to test it is to do some quick load testing. Ideally, if we had all the time in the world, I would’ve loved to do load testing monthly. But we didn’t have that much time.

So you’re doing load testing is but you don’t know where the high mark or the low mark is, so you’re hammering the systems and thinking okay that’s pretty bad, lets tune it down a bid. You’re trying to find that middle ground. And in those moments, you need something you can parachute in—something both you and the business have confidence in.

Everyone’s looking around in the go/no-go meeting thinking, okay, the product’s ready, can we sell it? Is the platform ready, available, operational?

It was an interesting moment to have those conversations. And yeah, it was maybe two weeks before the drop. In an ideal world, it would’ve been months before. But when an opportunity presents itself, you have to act. That’s business. You react. So it became quite clear that we needed all the help we could get.

Jose
And then in the end, you found that help with the virtual waiting room, right?

Tristan
Yeah, I think we were really lucky in some ways. It’s a product I knew kind of existed, but I didn’t know the full use case. And what I mean by lucky is, this was during the COVID pandemic. So I’d experienced it trying to buy groceries—you’re trying to go to a supermarket website, and it’s like, “You’re in a queue, come back in an hour.” You’re like, what is this? I just want to buy some food.

And some of those shortages were just totally unreal. You’re trying, and maybe you can’t go out of your house, or you’re buying for someone else—family members who don’t know how to order food online. I spent a lot of time in those queues. That’s when it became normalized. It was like, okay, this software exists.

There were pieces of software I’d probably used before, like for ticketing, but it was really reinforced during the pandemic. Seeing it in widespread use helped it click. So instantly, I thought, okay, this is the solution. This is a big lever—a parachute we can pull and say, “Hey, come to our aid.”

Then it was just a matter of figuring out who that was. Lots of frantic Google searches: which ones are the industry leaders? What’s the best one out there? Who can we get a meeting with quickly? We basically looked around using the wonderful world of Google, figured out who the market leaders were, and started reaching out.

But I think, without the pandemic, it would’ve been a more arduous process with less certainty. You’d be trying to figure out a solution you maybe saw once. How easy is it to bring that in? But if a supermarket in the UK can parachute this in, then we probably can too. It gives you a lead, something to chase down.

Jose
Good. And you said you didn’t have a lot of time—I think I heard you say a couple of weeks. So how was the process of setting it up, and what was your experience—and your team’s experience—during the drop?

Tristan
It was super interesting. We ended up doing two launches, and this was the first one. We used Queue-it for both. But there was a lot of uncertainty, if I’m honest, because it’s not something you own, right? And it’s sitting in front of your site. If we go down—because we got the numbers wrong or the software doesn’t work as expected—it’s going to be hard. So you’re putting a lot of faith in it.

And yeah, two weeks—it was frantic. It was like, “Here’s the problem, I know what I need to do—how can I do it?” We had a few options, but we needed the easiest to deploy, the easiest to have confidence in. We were on Cloudflare as a CDN, which saved us a lot of stress, and luckily there was a connector. We’d been playing around with serverless—Cloudflare Workers were still early then. And thank God there was an integration with Queue-it.

We got access, had a call, tested things out. There was a lot of trial and error. And that’s also what’s interesting about those moments. In tech, you're always trying to replicate the big companies. Whether you like it or not, you’re looking at Google and thinking, wow, look at what they’re doing and you’re always trying to bring something into the business that’s been proven at scale somewhere else. In the hope that your business gets to that scale too.

But now the idea that we’re not committing code to make these changes. That was kind of a big moment. You’re used to rolling things out in code—that’s best practice. But now you’ve got this piece of software where you’re copying and pasting code snippets, figuring out how it works. You’re in a web UI writing code to put in front of your site and handle every single request.

There’s this element of wanting to understand how it works—not just clicking around and saying, “Yeah, okay, that’ll do.” It’s like, how is this doing what it’s doing? So you end up thinking about the code, thinking about the architecture. But without it, it would’ve been really challenging.

We were able to just deploy—whether it was Cloudflare or any CDN, it became our gate. And with Queue-it, we were able to put a bouncer on that gate, so to speak. Like, “You’re coming in,” or, “Hold on, you’ve got to wait a little.”

Being able to deploy it quickly was a real win for us. That got us to proof of concept fast. Because I didn’t want to spend all our time at the 11th hour deploying something to a web server. You could try to spin up your own queue, maybe use a cache or something, but confidence in that was low. So we wanted to get something working fast—and then start figuring out what the experience would feel like for the user.

I think we did a good job, but it wasn't on our own.

Jose
You mentioned one point there about putting faith in us at that time, as a partner—and that’s something we really take to heart. It’s super important to us, because the customers who work with us are exactly, as you said, putting their trust in us to help—especially during what’s usually the most difficult but also the most exciting of times. And I bet that drop was a big moment.

Tristan
Yeah, it was a real moment. The thing that’s really interesting about collabs—this one was with Palace—you have a conversation with Palace from their technical side and you're asking, "What should we expect from a request perspective?" And they’re telling you, and you’re thinking, okay...

But their whole offering online, their whole web experience, is built specifically for drops. And you’re aware that that’s not the case for you. That’s not how your infrastructure is set up. That’s not how you run a cost-effective platform day to day. So you’re panicking a little bit.

I think that’s where the human connection becomes really important—something that can easily be forgotten. You're a tech professional, and you're asking someone for help. And that’s hard. There’s a moment where it's like, psychologically, you don’t want to look like you’re not doing your job. But the reality is—you can’t scale, and someone will ask, “Why?” And you’re like, “Well... we just can’t.” And you don’t care anymore about that question. You just need a solution.

You’re kind of panicking a bit, because as a company there was a lot of uncertainty: can we stay up? Can we really do this? And it was under embargo, so I was told about it only as early as they felt they could tell me. I wish I’d known earlier.

But that’s where relationships—and those individuals, those customer success folks, team members from the partner side—really come into it. Because you're panicking, and you need someone who’s calm, collected, who’s done this before, to say, “Yeah, obviously, you just do it like this. Don’t worry about it.” And then you're like, oh, okay.

It’s that level of care and trust that’s worth everything in those moments. When you’ve got a partner who’s confident, they can lift you up and give you confidence too—especially in those moments when you’re struggling.

Jose
That’s a really good story—and thank you for sharing. It was a pleasure for us. It predates me at Queue-it, but it was a pleasure for us to be part of that journey as well.

Was there one or two main learnings for you from that whole process—something you’d maybe share with other engineering managers who might be facing similar challenges?

Tristan
Yeah, I think I mentioned it a little there, but that’s the main learning for me: don’t expect your technology team to solve every problem. You cannot solve every problem all the time. You need to play to your strengths and understand that the business needs to be flexible—and there are solutions out there that enable that flexibility.

That’s hard to accept, because you’re going to be proud. You’re going to think, “I’ve got a great team, we can solve it.” And sometimes you can. But sometimes it’s just not worth it. What’s the point of being able to scale to 100,000 requests per second or minutes when that’s not your normal traffic flow? You can probably spend that time elsewhere doing something more valuable.

That’s the best learning: the show must go on. And if someone can help you do that—to orchestrate the show, or give you the confidence you need to take back to business leaders and say, “We’ve solved this, we’re comfortable, here’s how it works”—that’s the win. That was the best takeaway for me.

Because before that, I was very much of the mindset: you’ve got to write it yourself, it has to be in code, infrastructure as code, all the rest of it. And this was the first time I had a solution I could just plug and play in the moment. And that was wonderful—because it worked.

It made me realize that, yeah, sometimes the business is more important. And you just have to make those calls. And that’s happening more and more—you see it with businesses like the supermarket. Others are making those same decisions. It’s a known solution now.

You’ve got to focus on your core product. That’s the most important thing. The bottom line. We get caught up in pride and expectation, but it’s about confidence and solving the real problem.

Jose
I know we’re almost out of time—which I think, in this case, is a good thing. It was a really great chat with you, Tristan.

We’ve got a few rapid-fire questions to wrap it up. You don’t have to think too much—just say what comes to mind.

First one is: scalability is?

Tristan
Hard. That would be my answer. It’s not always easy. It’s still not easy. You need a lot of expertise. You need to be in the right place—and it’s hard to get to that right place.

Jose
Is there a resource—it could be a book, a podcast, maybe a thought leader—you’d recommend to people working in this area?

Tristan
Yeah, that’s an interesting one. I think the “Bibles” are all the Google books. The Site Reliability Engineering workbook is fantastic to go through as a mental process. The learnings are valuable, but it’s not the textbook in the traditional sense. You have to adapt it.

Everything in there is great, but it’s written by engineers at companies operating at massive scale. You might want to be there someday, but you might not be there yet. So those books help if you bring pragmatism to them.

I’d still recommend the original SRE book, Seeking SRE, and The Site Reliability Workbook. Those are still worth reading—and the workbook gives you some practical, hands-on examples too.

Jose
And the last rapid-fire question—I think you already answered it a little earlier—but what advice would you give to your younger self?

I’m thinking of the point you made: you don’t have to solve everything yourself. It’s okay to find out-of-the-box solutions when you need them. That was a great insight. But is there anything else you’d share, either with your younger self or someone just starting their career in this area?

Tristan
Yeah, I think the best piece of advice I’d suggest, from my learnings so far, is: you can’t be a generalist. There’s a lot of that that goes on, but it’s very difficult to be a good generalist. It’s actually quite a bit easier to be an amazing specialist. And I think that deep, ingrained knowledge, the kind you get from weird incidents, from understanding how things work under pressure, when things aren’t behaving as expected, that’s what you should aim for.

Don’t expect to be an amazing database administrator and an amazing site reliability engineer and an amazing software developer all at once. You have to learn about all these things—start broad, but don’t be afraid to narrow in. I think that’s becoming more and more true, especially in the world we’re heading into now.

You might have colleague that’s a generative AI model where that is the expert. You need to know how the systems work and the system thinking behind it. That’s where I’d recommend people spend their time. You only really get there by specializing early on, then generalizing, then re-specializing with more context and experience.

And one more thing is just time. You’ve got to sponge things up over time. You can’t figure out everything. All the hours in the day and all the late nights. You should be on the on-call roster. You’re going to learn a lot. It’s going to such, but you’re going to learn a lot. But you’ll be amazed at how comfortable you become getting up at 3 a.m. to fix something—and how much satisfaction you get from fixing it.

Jose
That was a beautiful way to wrap up. Thank you so much, Tristan, for your time.

Tristan
It’s been wonderful. Thanks for the conversation. It’s been really rewarding.

Jose
And that’s it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed, consider subscribing—and maybe share it with a friend or colleague. If you want to share any thoughts or comments with us, send them to smoothscaling@queue-it.com.

This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo, and brought to you by Queue-it—your virtual waiting room partner.

I’m your host, Jose Quaresma. Until next time—keep it smooth, keep it scalable.

[This transcript was generated using AI and may contain errors.]

Handle peak traffic with confidence, no matter the demand

Discover Queue-it