From Chaos to Reliability with Gremlin CEO Kolton Andrus

In this episode, Kolton Andrus, Founder and CEO of Gremlin deep dives into all things chaos engineering and reliability testing. Kolton shares his journey from leading reliability efforts at Amazon and Netflix to founding Gremlin, an enterprise reliability platform. They discuss what it really takes to build resilient systems, the cultural shift required to prioritize reliability, and how Gremlin is working to reshape accountability in engineering teams. From testing dependencies to aligning incentives, this conversation is packed with real-world insights into scaling systems (and teams) that don't break under pressure.

Kolton Andrus is the CEO and founder of Gremlin. Prior, he focused on building and operating reliable systems at Netflix and Amazon. At both companies he operated systems at scale, managed company wide incidents and helped build out their respective reliability programs and toolsets.

Episode transcript:

Jose

Hello and welcome to the Smooth Scaling Podcast, where we talk with industry experts to uncover how to design, build, and run scalable and resilient systems. I'm your host, José Quaresma, and today with me I have Kolton Andrus, the CEO of Gremlin, an enterprise reliability platform.

We had a great chat about system reliability and how Kolton has used chaos engineering strategies across several major companies and as a key part of Gremlin. I really enjoyed his insights on how important it is to focus on reliability within the engineering teams, how to measure invisible improvements, and how to use them to make the case for reliability to leadership and business. Enjoy.

Welcome, Kolton. It's a pleasure to have you here.

Kolton

Thank you very much. Pleasure to be here.

Jose

Maybe before we get right into the technology discussion, can you tell us a little bit about yourself, kind of your journey? You have been in this world of reliability for many years with very interesting roles. Can you tell us a little bit about that and how that intersects with chaos engineering and what's beyond that?

Kolton

Sure. So I'm an engineer by trade. I worked for a couple of startups before I ended up at Amazon. I was on the Amazon retail website availability team. So we were in charge of making sure the website didn't go down. And if Amazon.com is down, people notice. So we did a lot of work. When I joined, I was a single engineer on a team of PMs. And we had this idea that we need to go out and get more proactive, you know, instead of just waiting for the postmortem or the incident review and for things to go wrong, we needed to get in front of it.

And so interesting, great ideas come about at similar times. We had really come up with this idea for something that internally we called Gremlin that was about going out and causing some mischief to see how people responded. And this was happening at about the same time Netflix was coming out with Chaos Monkey.

So it's kind of cool to see another company thinking about it. They were building it out. They were just rebooting hosts. We were doing the whole gambit, CPU, memory, disk, IOPS, network failures, but they were on board with the idea.

So I rolled that out within Amazon. We had dozens of teams use it. We saw a lot of value. I actually moved into management for a year. I did work on performance. So I was in charge of making sure the website was fast and we did a lot of optimizations. And I kind of finished my four year tour of duty at Amazon and I was done. I was ready for the next thing.

So I went and joined Netflix and I went and found the teams that were working on Chaos Monkey and on the reliability stuff. And I went and joined one of those teams. And when I showed up, it was great. They were culturally bought in. They cared about it, but there was still some room for improvement on tooling. And I had the opportunity to build an application-level fault injection. So something, you know, code level that we put in our libraries, we put in our RPC library, our platform libraries, and then everybody could inject failure basically at the method call level. And that was very powerful.

We went through and we did a lot of testing. My team was the edge platform team. And so we own the proxy and the API. So all network traffic flowed through us. And if we went down, everybody noticed. And the biggest thing we did is we went and tested all the mid-tier services. We went through each of those and we made sure they could gracefully degrade and they could handle as much failure as possible.

And the outcome of that was when I joined Netflix, we were at three nines of uptime or about eight, eight and a half hours of outage a year. And when I left, we were at four nines of uptime or under 45 minutes of downtime for the year. So we saw demonstrable benefits from this, saw a lot of value.

And while I was at a conference giving a talk about the great work we've done, I bumped into some venture capitalists. They asked me if I was planning to found a company. I was. They asked me if I was planning to raise money. I told them I wasn't. I wanted to bootstrap. We talked a little bit longer. At the time, I lived in California. I had five kids. They were like, you should take some money.

So that was kind of the genesis of the company. I left Netflix. I founded Gremlin. We really thought, hey, this is something everyone's going to need. This is important. It's hard to do right. Let's go out and take what we've learned at Amazon and Netflix and bring it to everyone else.

Jose

And when was that that you then started Gremlin? How many years ago was that?

Kolton

2016, January 2016.

Jose

That's already an impressive journey in that area. So maybe when you started Gremlin, then what surprised you, right? Because you had this hypothesis, this idea that, okay, there's this need out there for this type of work. So what were the main surprises there when you started building the company and reaching out to prospects?

Kolton

Yeah, I think one pleasant surprise was there was a lot of excitement, there was a lot of interest. People were really curious. Sometimes that worked out. Future competitors and future internal teams that wanted to build it showed up and learned a lot. I think that's all goodness. Make the market stronger.

I think some of the less positive surprises were how mature the general market is. If you come from Amazon and Netflix, you're really working with a lot of top tier engineers. And look, there's top tier engineers everywhere.

By the way, one of my favorite things to say is, look, the sausage is made the same everywhere. And Netflix and Amazon have duct tape and bailing wire all over the place. So don't feel bad. Just, you know, they have this perception of being great and they are great. I think people take it a little too far sometimes.

So working with a lot of customers, you know, their leadership wasn't bought in. They weren't culturally bought in. They didn't have the freedom and the autonomy to go make big changes or make big process changes.

And I think this is one of the things we've learned in this space with reliability. You know, if you treat it like a nice to have, it'll be nice to have. And you'll end up, you know, dealing with outages and feeling the pain. If you treat it like you do security, a must have, you can button it up and have quite a tight ship. And so I think, you know, like many things in software, the hard part isn't really the technology. It's getting the people and the processes to change in the way that we need.

Jose

You mentioned about the processes, and I've heard you talking about this before, and I think it happens with reliability, but also with security, but how do you reward the right behaviors? I think it's super hard because if you're doing everything right, nothing happens, right?

And my understanding is that Gremlin is really focusing on that part as well. How do you help people understand all the invisible work that is being done and all the improvements that are being done? So how did you come across that and kind of what did you do to address it?

Kolton

Well, I think you just hit on probably one of the biggest problems we had in the first arc of Gremlin and the first iteration of our product—which is, you know, we assume good intent. We're optimistic. We're engineers that want to do the right thing. And so we expect everyone else wants to behave that way. And what you learn is that's not necessarily the world everyone else is living in.

And so, the examples I love to give are, you know, the SRE operation space in general, we have a bit of a firefighter hero culture. And we need to recognize and reward the people that jump in and save it. And look, that was me. I was a call leader at Amazon. I fixed the Amazon retail website from the side of I5 next to my motorcycle in the ring. You’ve got to put in the time, you got to do that work.

But you got to reward the behavior you want. And so while you might need that behavior in the short term, the behavior you want in the long term is more boring. You want people to do this testing, you know, as just like unit testing, integration testing, reliability testing, you know, distributed system correctness testing, I would call it.

And so if people are spending an hour a week, an hour a month, they're doing these tests, they're making sure they're passing and they're automating them, they're just not going to have issues.

Now, what's the incentive for them to do that, beyond being a good engineer and wanting to do the right thing? And this is the discussion I have with leadership. At your company, if I go do this great work and you have no outages, will anyone know and will I get promoted? And if the answer is no, don't ask me why your engineers aren't doing this because the answer is clear. It's a motivation.

Jose

We talked previously on the kind of the number of nines at Netflix, right? And is that, have you been able to also link then the measures that you have now in the, I guess, the scores at Gremlin? Are you trying to link them as well into the number of nines and seeing it? I guess that would be a very objective way of measuring it, right?

Kolton

That would be the holy grail. I think what's hard is we have the defender's dilemma, like everything that you know everything can go wrong. So if you do work and you prevent some of those failures, you're totally in the hypotheticals, you know, if a tree falls in the woods and no one's there does it make a noise? Well if you do all the work and you prevent an outage, would that outage have happened without you?

And so unfortunately sometimes you get into these debates with the business where they're like well you just fix some minor bug. And I was like, that same minor bug might've been a four-hour multimillion dollar outage had it not been caught. But because we caught it early and did the right thing, you think of it as a minor bug. So again, it comes back to, to incentives.

Actually, a little story I'd like to share. At Gremlin, we kind of sucked at this five years ago. We had the tooling, we had the expertise, we had the intent, and we struggled to get people to just run the tests on a regular basis. It was a bit spotty. It was a bit one-off.

Over the last couple of years, I've focused a lot of my time and effort on product and engineering. One of the things I did when we built this new reliability management, we built these reliability scores, they did a few things.

First of all, we have an on-call rotation at Gremlin. Every engineer takes a turn and every engineer runs all the reliability tests the week they're on call. So we just made it clear this is every person on the team's responsibility. We're not pushing this off to one person.

The next thing I did is in our company public meetings, our product operations meetings, and our engineering staff meetings, I pulled up the dashboard and I pulled up the score and I looked at it and I asked questions about it. And I said, this is important to me. Why is this going down? And the team actually started in the on-call handoff comparing the diffs of the scores so they could explain why it had changed.

Well, fast forward a couple of years, every person on my team has run all the tests, all the tests, you know, almost every test pass, you know, our scores are in the upper 90s. And, you know, Gremlin's platform, I'll, you know, I'll brag for a minute. We've been at four to five nines for the lifetime of the company. We know what we're doing here and we've practiced what we've preached. But the last couple of years have been exceptionally solid because of all the work we've done.

So it's hard to do, but if you, again, bring in that accountability and those incentives, you show your team that it's important, you can drive the right behavior.

Jose

Congrats on all the nines as well. I think it's always very good to see that you're also putting in practice all the things that you preach and getting the results there, right?

Kolton

Thank you.

Jose

You mentioned a hypothetical bug that was fixed that prevented in a hypothetical four-hour outage, right? I would imagine that a big part of building the business case for doing this work is also trying to do the math on how much would that outage have cost, right? And do you see your customers go through that exercise? What is your experience with going through that?

Kolton

I would say, you know, what's difficult for us here is if we could, we would measure the uptime of every service that uses Gremlin and how much, and how heavily they use Gremlin so we could draw a correlation.

Unfortunately, a bunch of the uptime metrics the customer owns and we don't have. And so we have to ask nicely, “hey, are you doing this analysis? Could you do this analysis? Could we help you with this analysis?”

I have a customer that just did this analysis. They're a large telecommunications, well-known brand. Everybody knows them. They have a big sales event every year that they're preparing for. And one of the things they did is they went and ran this analysis. They ran it with the services, the uptime, and the amount of times those services got paged.

And their analysis was, yes, Gremlin was clearly, you know, the services that use Gremlin and reliability management were clearly paged less and more reliable.

Now, what's funny is when you're a vendor, you know, you go to talk to management, you're like, “hey, your team did this great work.” And the leader goes, I don't know if I believe that. And they kind of just assume you're there to sell them more and to use it as an upsell opportunity.

And so this is what's hard is genuinely we want to we want the science behind this. We want to show that this works demonstrably. And it's such a hard problem with all the what ifs and what could have happened that we really need customers to do a little analysis and prove it themselves and share that data with the community.

But people don't want to talk about their failures. People don't want to share their outage numbers. Most people, most SaaS companies I talk to are in the three nines world. Some are in the two nines world. And they don't want to be shamed and they don't want to feel bad because they're not up to where they think they should be.

So all this kind of contributes to people don't really want to talk about it. People don't really want to dig into it. But then that means, it's hard to make the business case. So with some of my customers, we'll come up on a renewal and we'll say, look, your outages are down. Your service uptime is up. Everything's good, right? And they're like, ”Yeah, now we don't need groundwork.” And it's just a dagger to the heart where, you know, my answer is always, well, would you go turn off your unit tests? Oh, all your unit tests are passing. Let's turn them off and shut it down. No, that's a bad idea.

Jose

We're talking about chaos engineering, right? And I know that you've mentioned before that maybe from a pure marketing perspective, chaos engineering was maybe not the best choice of names there, right? And I think you're also now looking at what's beyond chaos engineering. Can you tell a little bit more about that and what else do you see kind of within this area complementing chaos engineering?

Kolton

So chaos engineering a fun name, an interesting name catches people's attention.

Since we haven't done a quick definition: chaos engineering, going out and purposely injecting failure in order to see how your system responds. We might go create a memory leak or an out of control CPU thread we might go drop all the network traffic to one of our dependencies to see what happens when that fails.

So the name chaos, I think in the name is a bit of a misnomer, because what it implies is we have to do this chaotic leap. And I think chaos monkey really cemented in everyone's mind: Oh, we're going to randomly cause this failure to occur to you.

Now, take a half step back. I think that was a brilliant move by Netflix. Netflix, you know, early 2010s, they're moving to the cloud. They know that in the cloud, these hosts can go away. And so they need their developers to live in the real world, to understand that this happened. So they decide to make it happen to them in staging. So they feel the pain and they fix the problem. I think that's a great social mechanism to cause the right behavior. But it was a conscious choice. And I think that's where a lot of people just copy that choice without understanding why it was made or what the intent was.

And so chaos engineering, you know, chaotically, randomly, scares a lot of people. They're afraid their systems aren't going to withstand it. They're afraid they're going to cause a real outage. They're afraid they're going to cause real customer pain. And so I just want to be clear, everything we do at Gremlin is about avoiding outages, avoiding customer pain, doing it in the safest way possible. And that's how we run our ship.

And so chaos is a bit of a misnomer. We really want to engineer the chaos out of our system. We really want to understand our system. We want our system to behave simply, straightforward, to gracefully degrade.

So that's kind of my problem with the name chaos engineering. I think the approach chaos engineering, great idea. You know, we're going to go out and we're going to cause this. But how have companies implemented it? Well, 10 years ago, it was tell the SRE team to go solve this problem for everyone. And by the way, that's not how we did it at Netflix and it's not how we did it at Amazon. We went team by team and said, “Reliability, efficiency, performance, those are all your problems. You write the code, you make sure it's performant, it's reliable and it works.” And so we went to those teams and we said, “Look, help us, you know, help us go build these reliable, reliable solutions.”

So chaos engineering as an approach where you just have the SRE team go out and kind of do it for everyone. It doesn't really work. And one of the other reasons why is it's the engineers that need to go fix it. They need to understand why they've made a network call in a loop, and that's a bad idea. They need to batch it up. Or they need to understand that they're calling a dependency, and if it fails, they could gracefully degrade. They could have some cached fallback that they could return so they don't have to break the customer.

So I think it's important the developer feels the pain. That's how we were taught at Amazon. You feel the pain, you'll make it more efficient.

The other thing is leadership has to think it's important. We just talked a bit about being promoted and recognizing it. I think the other thing I've learned personally in the last 10 years, if you treat chaos engineering as a nice to have, it's going to be a nice to have. If you treat reliability as a nice to have, it's going to be a nice to have. And so people that are out there running these tests willy-nilly, but not in a systematic way, people that aren't saying, “Hey, this is important, we need to do it as a business,” they're going to struggle.

The most successful customers, enable the engineers, give them the tooling and the support they need to be successful. But they also draw a line and say, “Here's our quality bar and we're going to hold people accountable. And if you're not there, we're going to come ask you why you aren't there and we're going to tell you you need to get better,” just like they do in security.

Jose

And by the way, are you still taking candidates for alternative names to Chaos Engineering? Because I actually had an idea the other day. What about chaos-proof engineering?

So you’re engineering something that is chaos-proof. You can inject chaos to test it out, but ultimately you want something that is chaos-proof. So if you think that there's some legs to it, you're welcome to take that suggestion further.

Kolton

I like it. I like it.

Jose

So we talked about chaos engineering and kind of deliberately, not necessarily randomly, but deliberately going out and creating some outages and some issues so that you can understand the system and improve it and making sure that it cannot happen in reality.

But what is then the issue, the gap between that and reliability, right? And reliability engineering, because you said it's more than just chaos engineering. So what are the other things around it?

Kolton

So chaos engineering is the how. We need to go do the testing, uncover the problem, and fix it.

But I think what's missing is the overall process you wrap around it. And so back to how do we give accountability? How do we help people understand where to focus their time?

What we did at Gremlin is we built reliability management. So we track your service within Gremlin. We track your code. We look for detected risks. We look for passive things. So some things we can tell you, “Hey, you're not following a best practice without ever injecting a failure.” That's just low-hanging fruit. You can go fix.

We came up with our own test suite. We said, this is the 80-20 rule. These are the things that if you just do these things, this will get you the most bang for your buck. CPU, memory, lose a host, lose a zone, know your dependencies, and fail each dependency.

And then we aggregate that up into a score. And that score is really, are you passing and have you tested everything? So it really covers coverage and quality in the same note.

So my customers that have been most successful, what we see is they were doing chaos engineering. They were struggling to scale it across the organization. A few teams did it well. The SRE team did it well. Your average team didn't want to do it, didn't know how to do it, and wasn't doing it. And so you still had a lot of issues. And this new approach where they're tracking it, it becomes a standard. “Hey, across our company, we're going to do this type of testing. And we're going to go through and we're going to run this test suite. We're going to get these scores. And then we're going to look at these scores on a regular basis.”

And that's what really helps the accountability and the ability to understand what's happening and reward the right behavior. Because now they have something to look at and something to measure.

In the old world, it was one-off. Hey, we found a bug. You know, they might have reported that to leadership once or twice a year. Leadership didn't know how big of a deal that was or how important it was. Sometimes they did a lot of work and something still broke. How do you get credit for making progress but not solving the problem?

And I think that's where we wanted to, really understand: where are customers starting from? We need to baseline. Let's not say, you know, your goal is 99.9% availability if today you're running at 70%. We’ve got to find realistic stepstones to move you along the way.

So I think those are, a lot of what we learned. Another piece we've learned is just more things we can do on the engineering side to make it easy, integrating with alerting and monitoring, integrating with logging, doing a bunch of the analysis on behalf of the customer so they don't have to do it on their own.

So that's what really allows us to take a team that maybe isn't an expert in this. They can install Gremlin. They can go run a set of tests. They can see what's failing. They can go work on those and fix those, get them failing, get them automated. And by spending an hour a week or an hour a month, they can actually prevent hours and hours of time dealing with incidents down the road.

Jose

And it's interesting you sharing the process that you see it being a practice that is more successful, not as a kind of a centralized practice within an organization, but something that is spread throughout an organization within the teams, right?

It feels like it's a pattern that it has repeated itself a few times, right? I've seen it with test automation as well.

I worked quite a bit with DevOps before as well. There was a little bit the same thing, right, that you could be doing DevOps, with quotes, in a centralized way, but you're not really taking full advantage of it. When you bring the teams involved in it and working themselves in the automation and the practices, that's really when you reap the benefits, right?

Kolton

Yeah, I think that's a perfect analogy. And we've seen this many times. And I've seen this at Amazon. I've seen it at Netflix. I've seen it debated.

Netflix had a core SRE team, and they wanted to do the embedded model for years where they would come straight with teams that were struggling.

The truth was the best teams didn't need that because they knew their software well, and they partner with that team. We worked closely with that team, but we never embedded that team in ours because it was important for us to have that skill set and to be able to develop it.

Now, when you talk to maybe your second-tier or third-tier services or bachelor, things that maybe aren't in the line of fire that aren't as important. I think that model makes more sense.

But as an engineer at heart, I guess I'll take a hard line here. Like if you're the developer writing the code, you should know how the code gets run. You should know how the code gets deployed. You should know how the code fails. You should know how the code scales. I think as a craftsman, that's an important part of our skill.

Jose

If we could make it maybe a little bit more concrete, right? So I don't know how concrete we can make it, but let's say that I have an e-commerce platform, and I'm just about to kind of start getting ready for my kind of holiday sales, right?

And it's good. We still have quite some months until we get there, right? And I reach out to Gremlin. How would it look like? Take me through a little bit the journey that you would take my company through?

Kolton

Yeah, great example too, because you've started preparing for Black Friday now and you're kind of done by October. And if you're starting in November, it's too late.

Yeah, so we go through a sales process where we meet with the customer, we help them install it, we help them understand the types of tests and what they can run.

But I think, what you're really looking for is when we onboard people, we really want them to go model their services in Gremlin and go baseline those services, run those tests.

And so typically, you know, one thing I see is a lot of customers, they want to start with some small out-of-the-way application, which is usually a mistake because if you find something there, no one cares. So you need to pick a service someone will care about so that when you find and fix things, people will be happy that you're making progress.

So we typically start in a dev or a staging environment. One of the things I like to say is if you think about all the failures that could happen as a pie chart, like a third of that pie chart lives in production. Can't test for it in staging. Production has, you know, customer traffic, diversity of traffic, load balancer, security groups, you know, all sorts of things that are different. So do all the testing that you can in dev and staging. That's just efficient engineering. But know that you won't be done until you go into production.

And so in staging, what do we want to do? Well, we want to identify a service and we want to go test it. Typically, if you're starting from scratch, what I would say is go test a single host as that service. How does that service happen when a host goes bad or fails? Okay, that worked well. Now go fail a zone in that service. How does that service lose, you know, a third of its capacity? What is the side effect there? Then go run it at 100% and understand what happens when the entire service goes down or is experiencing issues. So do that for each test. I'm not saying just do it once, but like CPU, memory, disk, IO, network, you know, we run through the gambit, we run the tests, we run them at a small scale, and then at the larger scale in staging.

Then we built the confidence to go run in production, but we reset, we come back down to, we want to run on a sub-percent of traffic. We want to run on a single host. We want to run a small experiment that'll teach us something, but mitigates the risk. And this is, I think, where people don't listen or miss it. It's like, I don't want to run in production. Look, we're not asking you to show up day one and take out a data center in production. That's down the road, we'll get to that. In the beginning, what we need to do is little baby steps. And as each step we take, we build confidence, and then we're able to go a little bit further.

And so we did this. One of the examples at Netflix is we needed to fail our identity service. We have a lot of outages related to our identity service. It was coupled to the service itself, our cookies, the way devices are. There were multiple paths. It was kind of a pain. And we did a bunch of work. We had a war room for three weeks with every front-end team where we worked through all these tests and all these bugs. We found and fixed all the issues. then we were ready to go run it in production and the first production test we ran it on 0.1 of Netflix traffic that went well then we went up to one percent of traffic: “Oh we hit an issue. Hit the halt button, stop it, clean it up, pull it off, get everything back to how it was”, which takes moments and then we go out analyze it. We find the bug and we fix it. Now we come back. We start at 0.1. We go to one. Now we make it up to 10. “Oh, you know, we found something else. Now we stop.” You know, we roll it back. We go fix the other thing. But we were able to run that up to 100% in production.

And so that's the bit of advice I have on that journey is small steps that grow in scope, allow you to mitigate a lot of the risks and learn different things about your system. Functional versus the scale side as you go.

So yeah, what would we do? We'd want to come in. We'd want to identify your critical services. We'd want to baseline them. We'd want to run tests against them. We'd want to make sure really the thing, if there's one thing you take away, go test your dependencies. And I'm sure, you know, a lot of folks that have dependencies that run on GCP learned this recently, but you know, it's like, look, if one of your dependencies goes down and you go down, that's on you. Your customer doesn't know that it was your dependencies fault. Your customer knows you went down. And so you've got to go find those dependencies and make sure you know what happens when they fail.

And if you've done that kind of baselining and first set of tests, then the next step is automate. Just have the teams automate the successful tests and keep them from regressing. And that way, you fall a bit into maintenance mode. If things are running smooth and nothing's regressing and you're not having problems, you can either go back to your regular work or you can go do some exploratory testing to try to get ahead of the trickier outages that are a bit more complicated.

Jose

In this specific use case, and I'm also asking because that's also kind of within our world at Queue-it and with our customers, often we're talking about high traffic and peak traffic, right?

So in this case, from a Gremlin perspective, from a Chaos Engineering perspective, would you also specifically test system under load to identify the bottlenecks? So how would you say, oh, the use case is a peak scenario. Do you then have a specific test suite or a set of tests that are kind of targeted towards that?

Kolton

So within Gremlin, we have this scenario workflow engine that allows you to add pre and post conditions or in between steps. Sometimes those are make sure things are healthy before you proceed. But a lot of those preconditions are like, kick off a load test. I think you need that in staging because you need traffic for it to be interesting.

Most failures aren't interesting if the system is flat and there's no traffic and there's nothing going on.

I would say in production, a load test, it switches from being helpful to hiding things. Because customer behavior is going to be so much more diverse than whatever you captured in your packet capture or whatever you have saved in your load test. Even if you have a pretty robust set of load tests, customers are just going to do wild stuff that you don't expect. And so that's why it's fine.

You know, do the load test plus failure test and staging, do the load test plus failure test in production, but ultimately just running it in production is the real world.

And I think one of the cruxes to chaos engineering and reliability engineering is we want to get out of the hypotheticals. There's so many layers in our code. Yes, we're testing this library. Yes, we're testing this service within integration. What we care about is production. It's an organic beast. It has multiple moving pieces. It's changing daily as multiple people make changes. Customers are doing wild things, the internets, the routing. There's all sorts of things happening. And so to really be prepared for reality, you've got to go live in reality. You've got to go test in reality.

Jose

It's a very good point. But me with kind of this e-commerce website, right, I don't really have traffic that reflects the amount of traffic that I'll have on Black Friday, right? So how would you help me approach that in production?

Kolton

Fair point. I think you need to do the load testing, you know. I guess what I would say is, load testing is a sister to fault injection reliability testing. So, you know, we doing them at the same time makes sense. But load testing has a purpose. And that's let's find the scaling boundaries for our services. So we know if they're going to fall over, or are we going to get alerted? Are we going to know when they're having issue? Do we have enough capacity available? And I think that's all still important. You still need to do that.

So you're building it up, you want to do some fault testing, you want to do some load testing, and then you want to do together because either one on their own isn't going to quite prepare you for those kind of worst case scenarios.

Jose

And Kolton, before we wrap up, we still have a few minutes. I don't think I would be allowed to record this with you without asking you about AI and about how you see AI impacting the reliability work? I would love to get your thoughts on it.

Kolton

Absolutely. Well, I think I'll answer the question in two sides. One side is how do I view it as a technologist? And one side is how do I view it as a person that cares deeply about reliability and reliability tooling?

So on the technologist side, as an operations SRE kind of person at heart, I'm dubious that we're going to be allowing agents to go make production changes anytime in the near future. And one of the things that's missing there is it technologically capable? Sure, but there's a people aspect, just as we were talking about with accountability.

You know, there is an old quote I love from IBM that says, to the effect of, “We're not going to allow the computer to make a decision because we can't hold the computer accountable.” And I think that applies today as much as it did back then, because this is a sad truth, if the AI makes a bad decision takes down production for eight hours and causes the company $5 million damages. Somebody's going to want to have a discussion with who made those decisions who allowed that to come to pass. At the end of the day you know maybe you can fire the AI, maybe you can hold it accountable, maybe it will learn, but really you want people involved and you want people that are able to oversee and to ensure the right things are happening.

The other side of AI, especially the LLM version of AI, is that it's non-deterministic. It doesn't always give you the same answer. We're talking about distributed system correctness, the holy grail of all computer science problems. And we want to just add some non-determinism into it. I think that's a little scary. And I think we need, you know, correctness and accuracy when it comes to these types of things. So I'm excited, but I think we got a little ways to go there.

From the Gremlin perspective, you know, our goal has always been make it easy to do the right thing. Help people get value quickly, get to the answer quickly.

And this is where I think AI does a great job because it's augmenting engineers with additional intelligence and capabilities. And so that's part of what we did. The transition from chaos engineering to reliability management was really the transition from: you do all the homework about whether your test passed or failed and your service did the right thing, to Gremlin does a bunch of the homework about whether your test passed or failed and whether your service did the right thing.

But in this world, the engineer still has to go fix it. Oh, it failed. And we can tell you it failed, but you've got to go figure out how to fix it and where the fix needs to go and how to approach it. And the one thing I love about AI, it's been trained on my docs. It's been trained on all my content from the last 10 years. It's actually pretty good in general at understanding the right things to do when it comes to reliability or chaos engineering, especially if you throw in the right keywords.

And so one of the things we're building right now, we're going to have it done this summer, is a way to analyze your tests. So when you run a test, whether it passed or failed, we'll analyze it. We'll tell you why it passed or why it failed. And if it failed, we'll provide you with a set of suggestions, recommendations of how to go fix it.

Now, we're not going to do it people, yet, for the reasons I just said. But, you know, this is where if you have an average engineer who hasn't spent a lot of time on call or dealing with these problems, we can run the test, we can halt it safely when it doesn't work. And we can tell that engineer, here's what went wrong, and here's how to go fix it. And so I think that's really, you know, what I'm excited about is just let's make it easy for everyone to get the value and the expertise more quickly and with less risk.

Jose

I love that usage of AI, and it does feel like the right way to approach it at the state that we're in now.

Kolton

We've moved out of the very, very dubious phase. I think a year, year and a half, two years ago, me and my team were like, “No way. It's all snake oil”. We’ve come a long way. We're seeing real value. And so I think that's an exciting time to be a technologist.

But we've also seen some of these bubbles in technology come and go and leaves you a little jaded. So you've got to be optimistic in one hand and pessimistic in another and try to find your path forward.

Jose

Yeah, I do try to be carefully optimistic. So, Kolton, before we wrap up, we just have a few rapid-fire questions. So to you, scalability is?

Kolton

Scalability is elasticity. It's the ability to absorb new work without falling over or to expand or collapse.

Jose

Do you have any kind of resource that could be kind of a book, a podcast, a person that you would recommend for people to read or follow? Could be within this area, could be anything else.

Kolton

To every engineer listening, you should go read Never Split the Difference by Chris Voss so that you understand the sales and go-to-market side of that. And I think there's a lot of engineers think sales and negotiation is win-loss. There's a lot of win-wins out there if you do it right, and it helps you have a lot of understanding and empathy for the sales team.

There's not one technology book I can point to. The resources are just so vast. Nowadays, we're using AI to simplify it and give it to us faster.

Jose

I love the suggestion. Earlier today, I saw that book by the window in one of our colleagues here, and I did read it a couple of years ago. So for me as an engineer and technologist, things really clicked when kind of getting some help, thinking of sales and that work as looking for win-wins, right? And they are there often.

Kolton

I think I gave my copy to somebody because I was looking for it. I'm going to grab it, and I don't see it on my bookshelf.

Jose

I wasn’t expecting to go that way, but awesome tip. Thank you.

Is there a particular technology that you're really excited about and you're not allowed to say AI?

Kolton

Yeah, you took the easy answer. So, you know, you got to sit and ponder for a minute.

I'm excited to see where the whole journey around containers and Kubernetes ends. I think we've gone back and forth.

There's this pattern someone pointed out to me, I think when I was in college, that through computing, we've gone from client-server to everything on the server, back to client-server. It's like, do we want to interact here and do the work here, or do we want to do it at the same place?

And I think when containers first came out, this is great, but these are just VMs. This isn't that interesting. And really, it's about dependency management, how you make the container and how you contain everything.

But as I've watched people manage Kubernetes over the last few years, it's too hard. It's too much. There's so much we're trying to do and so much value we're trying to get, and we're just shooting ourselves in the foot left and right and going through a lot of extra pain to try to get to where we want. So I'm excited to see the container world kind of coalesce down to something a little simpler that's still as valuable.

Jose

Very interesting. I'm looking forward to that as well because I do see the complexity that they can lead to as well.

And last question, do you have one piece of advice that you would give your younger self or an engineer starting right now? I guess it could be read the book, "Never Split the Difference", but do you have another one?

Kolton

Yeah, as a startup founder—and I've sat in the CEO role and I've sat in the CTO role, and I've gone through nine years of this—you get a lot of people that want to show up and give you advice. And a lot of advice you read on the internet.

You go on, X or social media and you see lots of “Here's what a great founder does. Here's what this founder did. This is how you should do it. You better wake up at 5:30 every morning. You better work till…” All sorts of advice.

And I think I've learned the most is to trust myself. And that sounds maybe simple, but it's hard. You want to do a good job. You want to learn from everyone else. You want to take people's advice. And so I've experimented. I've tried with a lot of things along the way.

And sometimes I've tried what other people thought was a good idea that I didn't really think was a good idea because I wanted to let them try out their ideas or I wasn't confident. And I don't regret that. I've learned a lot along the way. I think it's good to really be well-rounded in your thoughts.

But I could have saved myself and the company some time and pain if I'd have just been willing to say “Here's where my conviction lies, here's the direction we're going to go in.”

Jose

Great advice Kolton, and I think it's a great way to wrap up this episode. Thank you so much for coming by and sharing your experience with us.

Kolton

Yeah this is great. Thank you very much for having me, I loved the discussion.

Jose

And that's it for this episode of the smooth scaling podcast thank you so much for listening If you enjoyed, consider subscribing and perhaps share it with a friend or colleague. If you want to share any thoughts or comments with us, send them to smoothscaling@queue-it.com. This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo, and brought to you by Queue-it, your virtual wedding room partner. I'm your host, Jose Quaresma. Until next time, keep it smooth, keep it scalable.

[This transcript was generated using AI and may contain errors.]

Handle peak traffic with confidence, no matter the demand

Discover Queue-it