A Decade of Kubernetes Lessons with Chris Nesbitt-Smith (UK Gov, Digital Service)

Chris Nesbitt-Smith has been running Kubernetes in production since version 0.4 — long before pods, before managed services, before most of today's tooling existed. In this episode of Smooth Scaling, he sits down with José Quaresma to share what a decade of running Kubernetes for UK government citizen-facing services has taught him about scaling critical infrastructure. The conversation covers why Kubernetes was the least bad option (and largely still is), why relying on autoscaling means you've already lost, and how Gregor Hohpe's "guardrails versus lane assist" metaphor changes the way you think about capacity. Chris makes the case for climbing the service stack — SaaS first, then Functions as a Service, then Platform as a Service, and only reluctantly managed Kubernetes — and explains why tech is one of the only industries that builds critical systems without ever pricing the risk of failure. A direct, opinionated look at what scaling really demands when the stakes are real and the budget isn't infinite.

Chris Nesbitt-Smith is an independent technology strategist, a Kubernetes instructor at LearnKube, and the architect of the UK Government's National Digital Exchange. Based in London, he works at the intersection of policy, security, and modern infrastructure — advising UK and international government departments, multinational enterprises, and large NGOs on cloud-native transformation and DevSecOps. A regular speaker at KubeCon, DevSecCon, and Open Source Summit, his talks span container security, policy-as-versioned-code, and platform engineering. He also blogs regularly on his blog Cloudy with Chance of Freefall.

Episode transcript (auto-generated):

Jose:

Hello and welcome to the Smooth Scaling Podcast, where we speak with industry experts to uncover how to design, build and run scalable and resilient systems. I'm your host, José Quaresma, and today we had the pleasure of chatting to Chris Nesbitt-Smith, who's a technology strategist and consultant on architectural direction for the National Digital Exchange. We had a great conversation with Chris about his work going all the way back to the early days of Kubernetes. What I really appreciated was his approach to designing and developing systems. His recommendation is to focus on the business needs and to stay as high as possible in the service stack. If you like this episode, please subscribe and leave a review. It really helps the podcast. Enjoy.

Welcome, Chris. It's great to have you on the podcast.

Thanks for having me. And I would actually like to start straight in with Kubernetes. So you have been working — or you started working — with Kubernetes quite some time ago. I saw that you ran it all the way from version 0.4 in production. Can you tell us when that was? I don't know exactly when 0.4 was, I cannot quite pinpoint that. But can you tell us how you ended up there and what the experience was?

Chris:

Well, it was about a million Kubernetes years ago and about a billion AI years ago.

Jose:

Yeah.

Chris:

So, yeah, we were in a government department. We were looking at how we could roll out some of the new apps. The UK Gov had a bit of a digital transformation around 2013, so I think this would have been 2014 or 2015, something like that — I don't know off the top of my head. We had a load of teams that had built all these amazing applications, and we had a reasonably fixed virtual machine provisioning space where actually getting compute when required involved bureaucratic processes. So we were looking for different ways of doing that.

We were obviously moving away from people doing lots of Java dev — where you might have ordinarily just gone, "well, we'll run Tomcat," and that would be the equivalent of your Kube cluster, right? Because everyone was doing the new shiny things: some Ruby, some Node, some Python, and other things. So we were looking at how we could fundamentally solve an application deployment challenge and do that in a more ready fashion.

It was very early days for OCI containers — Docker was the predominant one at the time — and very early on Kubernetes. I think some of the members of the team got some contributions into core Kubernetes at the time, feeding into some of those early bits and dealing with some of those things. But yeah, it was a journey. It was, and still is, the least bad option a lot of the time.

It's always one of those things — if you were to go and completely design it afresh now, the whole API and so on would all look somewhat different. I don't think anyone would look at the API — and hopefully no one will be listening, or were you to begrudge me saying that — yeah, there are some weird things and warts in there that you'd probably go, "that doesn't make any sense."

Jose:

It's actually interesting that you mention how you started using it, because I have a similar experience in a previous job. We were working at a bank, and the main driver there for looking into Kubernetes — we used Red Hat OpenShift at the time — was also the bureaucratic process of getting virtual machines just to spin up new test environments. So we ended up having OpenShift running and it was very easy to spin up new dev or test environments.

It was heavy as hell, right? We had Oracle Database, we had Oracle WebLogic running in containers, and I'm sorry to the Kubernetes gods for that. But hey, it worked, right? It was some heavy stuff, but it reduced things from weeks — and that was not technical dependencies — down to, I don't know, around one hour, because it did have to load the database and stuff.

Chris:

Well, you say you can ask the Kube gods to forgive you, but there are some weird new things that have arisen. Like the new stuff of being able to live-resize a pod, because apparently rescheduling a pod is now a thing that people are afraid of doing. And we're going all the way back to having those long-running pods that would have been the things your WebLogic or whatever else would have been on.

We started with more of a purist view of: no, if you are changing the size and the resource limits, you will need to reschedule it. A lot of the time — we were one of the first, well, the department I was in was the largest AWS Spot instance customer — because we really embraced the virtues of rescheduling, because it forced the apps to restart. And that was a good thing. Embrace that, like the in-built chaos monkey: worst case scenario, best case, your app's going to run for 48 hours, whatever a Spot is, and then it will reschedule.

But yeah, it's weird seeing that go full circle, with the big enterprise customers demanding that pods should be able to resize on demand.

Jose:

I have been, I would say, away from the Kubernetes world for about three or four years now. So it's good to hear some of those updates from you. I guess it's the system and the community adapting to what it's being used for. But thinking about it from back in the day, and from the philosophy you would go into it with — in terms of immutable infrastructure, "we'll just redeploy" or "we'll reschedule it," as you were saying — yeah, it's interesting to see that shift.

Chris:

Yeah, it's a funny one. I guess it might encourage some folk to right-size stuff in the first place, because maybe they can size things down. But we've had vertical pod autoscaling for some time that would look at the actual trends and scale the thing and reschedule it. So I just found it quite peculiar.

I only saw it in the news. I've not used it in anger yet. I don't get my hands dirty with much of it apart from when I do some teaching and training on it. But yeah, it's weird — like, "oh, okay, fine, there's another oddity in the API."

I do like the philosophy of Kubernetes, which is basically that it's not much of a thing. As more stuff moves out of core Kubernetes, it's going more towards just a bag of API standards and schemas where other things are going to actually do the interface. Which is an interesting way they've broken into a market and a space — by leaning in, doing a lot of the heavy lifting, and then waiting for the rest of the community and industry to pick up and fill in the schemas. That's why you'll see some API schemas like Ingress fading away, because there are other more dominant ones that folk are using which are being filled by the market.

Jose:

Okay. I would actually still like to circle back to the beginning and talk about how you used it in government services. So it was a pretty early Kubernetes stage, right? Was it still in alpha the first time you got it running in production? I would like to know what gave you the confidence to run an alpha-stage new technology in a government, customer-facing production service — because there are usually quite some demands on that side, and risks and downsides.

Chris:

Yeah, I mean, for a series of citizen-facing applications — so open to the internet — it was both brave and stupid, but it was the better option. And I'd still stand by that. Like, even running a fresh Kubernetes cluster today, it's still in the space of the better answer. It's not best — ideally you'd wish for more — but it's the best thing that we have right now. There are lots of bits that don't quite make sense, but most of the time it works.

It was an interesting bit. At the time, you could have hand-cranked it with just running the containers and a load of shell scripts around some containers; you could have done some other things. Kubernetes wasn't probably much more for us than a load of shell scripts really, because it was very simple, with very basic primitives. It was before pods were a thing — you had ReplicaSet controllers, which was about as much as you had as a low-level resource. Very low-level primitives, the shiny things. The kids don't know how lucky they've got it these days, right? Yeah, the pain that that was.

However, you can still see the scars of that with some bits in the API where, if you were to design it afresh, you wouldn't do it that way. Things don't quite make sense — mixes of plural and singular wording on API schemas and all sorts of stuff like that. Just be consistent. But it's one of those bits of legacy that you have to carry, hopefully until maybe there's a 2.0 and we can clean the slate and start again on some of those weird things knocking about.

Jose:

And is there anything you had back then that you miss now, from a Kubernetes perspective?

Chris:

No, no, I don't think so. There's a lot more tooling to do the things I remember. That was long before you had `kubectl exec` or something to go and look into a pod. Nowadays you have ephemeral containers to do that with, so you can go and do some debugging in that sort of world. Whereas at the time, you were SSH-ing onto one of the nodes, and then connecting through `docker attach` to actually go and look at something. So there are some more mature bits of tooling.

I don't really long for any of that — any of that pain and misery that we had then. I guess the only thing that I'd wish for would be the original vision of truly deterministic compute that you'd imagine, where we'd move away from naming your mail server Bob, and when Bob breaks at whatever, then all hands to the pumps because it's got to be fixed because the CEO can't get his email, and then big celebrations for whoever manages to revive it.

The original vision of that immutable and deterministic infrastructure has just moved up an abstraction level — up to the control plane. I challenge and welcome any of the listeners to comment on the amount of times that you blow away and recreate a cluster from scratch on demand. Because the majority of the time, your production world has gone through a journey of upgrades that your non-production world probably hasn't. You might have done your non-production world of incremental versions — every patch version, every minor version — and then your production release, you will have done more major ones where it needs it, where there might be a vulnerability that you're concerned about or anything else like that.

But in any case, production has very unlikely gone on the same journey through. You'll end up with all sorts of, if you were to actually inspect it in your etcd or whatever is providing your cluster state, all sorts of weird orphaned bits of artifacts and things knocking about in there. That ultimately gets you to the same point of what Kubernetes was trying to solve way back when — giving you deterministic, predictable environments and removing a lot of the developers saying "well, it works on my machine," or "it works in our prod," and then kicking it over the line to an ops team that's looking after the prod cluster.

Which is not much different to now. Often when I go work with organizations, you'll see the same cultural drift. It doesn't help when you end up with DevOps being a job title, or SRE, or anything else like that. The original thing it helps a lot with — whilst it hurt the cognitive load — meant that the developers had a lot more control. So you end up with much more of a "you build it, you run it" team approach, as opposed to "you build it, then you hand it over to ops, and then they'll run it in live service, they will do the patches," and then we'll end up with a fight between an ops and a build team.

Jose:

But would you see that as more of a limitation of culture and processes, than necessarily a limitation of the technology?

Chris:

It's the two roll around each other and orbit each other, right? So yes, Kubernetes has got us a fair distance, but the cultural pushback means you would get oddities that look like things like Helm, for example — which is a horrendous idea that seems to have still persisted as a technology. But culturally it kind of fits, because those two haven't managed to decouple. We've not managed to fully move away from the desires and dreams of people doing totalitarian control over some engineers — "you'll only have these things you can change and configure" — to what the original ideal would have been. The utopia would have been that you would have had, as an engineer, a quite rich environment, very much the same level of depth and control that you get from your local dev machine.

Jose:

But are you then telling us that we shouldn't be using Kubernetes at all? What's your stance? Are there cases where it makes sense? Are there cases where it's better to use something else? What's your thought on it?

Chris:

So it's definitely not my first choice. It's better than some answers, but if I were going into something greenfield, it's not going to be the hammer that I'd reach for to start with. I mean, don't get me wrong, I run Kube on my home cluster that runs all my home automation and stuff, and I have to deal with fixing that regularly.

It inspires a lot of bike-shedding fundamentally in an organization. If you're a bank, or an e-commerce website, or whatever your actual line of business might be, you end up having to have an operations team that's looking after the Kubernetes cluster. It's not given you the panacea that anyone was chasing.

I'd always look to reach for higher-order services as a business — chase the actual business demand. So look for a SaaS solution first, and then if you can't find any of those, or where you need to produce glue between the SaaS and your business world, then Functions as a Service. Exhaust that long before you then look to Platform as a Service.

And when you get to that level, definitely don't try and run your own Kubernetes cluster on your own hardware. Still reach for the managed ones that the cloud vendors will give you. Turns out they do a much better job than any of your folk can, and that's built into the price. They have a fiduciary responsibility to their shareholders.

So yeah, whilst there are some oddities, and you might find that you don't get all of the new shiny, exciting, sexy features immediately, or you find some constraints — actually, on the whole, those constraints are useful. Same thing as going into a restaurant: the menu is just— I mean, look at the CNCF landscape and how that's evolved. It used to be that you could recognize the logos because they were big enough. Now, if you don't look at the CNCF landscape — I've not watched it for a few months but last time I did, it was mental. Just the amount of tiny little, basically two- or three-pixel-sized logos. Which is mad. So you are just overwhelmed with choice.

If there's anything that can constrain you — if your world is not running a Kubernetes cluster, if that's not your line of business — then cool. But if you're actually trying to run an actual business or have an actual purpose, public sector, whatever you might be, that's not just running the cluster, then don't avoid all of that responsibility.

And even the managed Kubernetes clusters will always be a corruption of the ideals of cloud, where the shared responsibility model puts the cloud vendor on the hook for a lot of things. So where you can, avoid taking on responsibility — avoid taking on responsibility for doing your own key management, for doing your own storage backends, for running a service mesh over the stuff, and all of the other stuff that you can totally do. It's all very interesting if we all stroke our beards and look very clever whilst we mess about with it. But your future self at four in the morning, when it's all gone wrong, will not thank you. And that will happen — I expect that to happen, that's the future thing. And if it's not in your career or your job, there will be — and your successor will end up with that gift.

Jose:

And do you see that tendency to stay away from Functions as a Service and going more towards Platform as a Service, or even building and running your own clusters, as people being resistant to giving up control? Do you see engineers leaning more towards "you do it all yourself because we think maybe we can do it better"? How have you experienced that?

Chris:

I've certainly been guilty of it in the past. As anyone can go through their professional career, you normally get, as you get further up, you get closer to the actual business and further away from the weeds of the actual code. And you can see it from a different perspective. You go, "yeah, why would I try to follow some philosophical campaign that doesn't align with the actual business?" It's a bike-shedding exercise where you are just investing all of the energy and time in that. But if you could repurpose those engineers to do something more aligned to your business, that would be infinitely more value.

Typically, unless you're like the OpenAIs of the world with near-infinite money, your majority constraint is people — the actual ability to hire people, bring them on board, make them understand your business, vet them, do all the things, maintain them, do all the other HR-type stuff as an organization. That's a big investment. And if you can get more actual business value out of that, then chase that. Typically, running low-level bits of compute that you could outsource to someone is, most of the time — I'd argue, there are some edge cases, but most of the time — not where the business value is to be found.

Jose:

I think one component there is also the cost. Because, as humans, the cost of running something managed or as-a-service is a very clear cost — you see the bill, you see the number. There's so much hidden cost if you do it yourself. So it's usually not an apples-to-apples comparison. You have a clear number here — "oh, it's expensive to run it, then we can do it ourselves" — but then you're often not counting all the hidden costs of people and having the incident and having to wake up people at 4am. So I think that's an interesting side of it as well.

Chris:

Yeah. I think one of the main things that frustrates me in tech is we don't quantify risk. In other bits of the world, you'll put a price tag on stuff. So your insurance company, as an employer — fundamentally there'll be a price tag on events, and then the likelihood of those events. So maybe someone getting an electric shock whilst in the office, or tripping and falling, and then the consequence of losing a limb or whatever. That has a fundamental — whilst you can be quite saddened by the fact that there is a price tag on it, there is one. Therefore, that ratchets your premiums.

And it's the same thing of going, "well, I could reduce my premium by putting better locks on the doors, or I could reduce my premium by other means." That provides some motivators that can lead you towards better things — like you put leak detection in and other things that an insurer might offer you as a, "I mean, here's an investment, it will pay off over three, five years, whatever, by doing this; if you're going to be there long term, do it."

We don't typically do any of that in tech at all. As you said, any of those decisions carry a risk. And if you've not gotten to a point of quantifying that, yeah, you might be fine. So five, ten years of the life of your business exercise, where that Kubernetes cluster or clusters are going to live and run, it might be fine. But if your CFO or whatever were to actually see that on the balance sheet, you might find yourself uncomfortable with the amount of risk you're carrying.

I'd say, quite probably, you'll be fine, but your CFO and your shareholders may have a different opinion on the gamble they're willing to take. And that's a lot where that value is missed and lost, particularly in tech. And that's not specific to Kubernetes or serverless or cloud or anything — it's just generally with tech choices, they're largely made, even in big regulated organizations or public sector, typically made emotively, as opposed to with any kind of real science behind the risk validation.

And if it was, it's definitely not ongoing. You'll do it at a point in time, but very rarely — I've never seen personally, I don't think, an actual continual reassurance process — in order to go, "ah well, we did our Kubernetes cluster, we hand-cranked it, and we defined the exit criteria: if this, then that, we will stop doing it because industry will have caught up."

That was the thing with — we were doing Kubernetes way back when, long before Amazon were the first ones with a managed Kubernetes service, which then took a while to become mature enough to migrate and pivot everything over to. But it was very much "if this, then that, we will look to migrate stuff over." If you can do that and bed that in, then it can also give you an exit point that can make some of your risk folk a bit more comfortable, if they can understand what it is you're doing.

Jose:

I see that. It's a good reminder to keep revisiting your decisions and seeing if they still make sense.

We do love scalability in this podcast, right? So I would actually like to dig a little bit into that, specifically with Kubernetes. I've heard you say at some point that, regarding Kubernetes, if you're relying on Kubernetes autoscaling, it's already too late. Can you tell us why that's the case?

Chris:

It's not too dissimilar fundamentally from Gregor Hohpe — he has a comment, an attribution; he might be attributing someone else around — if you've got guard rails, it's already too late by the time you hit them. What you need — and he uses lots of car metaphors — what you need is lane assist. The guard rails — if you even touch them, it's going to be a really bad day. What you need is lane assist.

So, to the point on the autoscaling stuff: by the time you are pushing up against it, you may find yourself in a world where it takes you too long, or there's no available capacity in your cloud provider or your physical on-prem infrastructure, to be able to service your need.

But that all said — I made a game a while ago for a KubeCon thing, and the point I was trying to get across with the scenario was: it's a Black Friday, you're looking after the ops, and you've got to tune the cluster so that it scales up and down over a Black Friday and a Cyber Monday in order to save the most money. The bit that I was trying to articulate — although I'm not sure I necessarily came across — was that it's okay to fail sometimes. Because, again, it's a business problem.

If you end up with a self-imposed SLA with maybe a financial penalty and a dropped connection, then you can make an informed decision. The right answer is not always being able to service every request, because from a business point of view, that quite possibly doesn't make sense. It can be overkill if you are over-provisioning and over-scaling in order to know that you're never going to drop something. As an engineering thing, we can all feel very clever about that and very proud, and we can put on our CV that we ran a cluster with so many thousand nodes or whatever — like, cool — but the business problem might not actually have required that.

And reference to when you tune any quality of service stuff — that's a business decision, not really a technical decision. What is the more important thing to run if you end up running out of physical compute capacity, what you can lay your hands on quickly? Recognizing that if you're in cloud, then best case scenario, you're a few minutes away — if the compute's available — from being able to boot up a machine, then start your container running on it, then have data running on it, or database replication, like all of those things. That's time.

The right answer is not necessarily to always be able to service it. There's a business and a risk and a financial choice behind that. If you can put a price tag of an SLA impact on your platform team — "here's your objectives quite clearly, not just 'don't drop a packet'; here's your budgets, and here's what we're going to cut from your budget when you mess these things up."

Jose:

And with your game, was there a right answer? If anyone were playing your game, what do you think was the right strategy there?

Chris:

I mean, looking at the scoreboard of the people that did play — I built it for a company called Appvia, and they were giving away an Oculus Quest or something, a few of them at KubeCon, so consequently it got a reasonable amount of attention, as you'd imagine. It was a sweet spot which did mean dropping some requests. Looking at the scoreboard and how people played it, you can't just turn everything off — that's not the cheapest, that doesn't get you the best score. And it's also not always being able to handle all the requests. There is a line where you just go, "well, this actually makes this the cheapest from a business operational cost of running a thing for your Black Friday — here is the end state where we make the most profit as a business."

Jose:

Yeah. I know I'm extremely biased here, but I was thinking, "oh, I would love to play again, but only if I would be allowed to get a virtual waiting room in front of it and do a little bit of traffic orchestration there." But I don't know if that was — I have no idea if that was allowed in the game. Probably not.

Chris:

No, I mean, again, it comes back to how you mitigate the thing. If you can offload onto something, or show an error page that has a graceful message on it, or any sort of things you could do to mitigate some of that — this was entirely intended to be a very crude penalty of "yes, you've dropped the request, therefore the business side would like to fine you an SLA fee."

Jose:

I think one thing that comes to mind on my side is more around — and I think you did mention chaos engineering — that idea of, well, we do some autoscaling, but if we don't make it, can we at the very least try to have some graceful degradation of the service? Some of those cases — like Netflix showing you an alternative list of shows if the service is not responding — I think that's a quite good balance between, okay, we do try to have some autoscaling in place and the ability to grow with the request, but if we cannot, what is then a good alternative?

Chris:

Yeah, well, as long as you don't have a reverse exponential backoff, then you're probably fine. The business side of the degraded approach — where it might not do everything, might not be the best experience, but you're still able to retain your customer, you're still able to engage them, do something with them — that's great. The more that you can do that, the better.

And then cascade your graceful degradation all the way up the stack. So if your database is failing, make sure the middleware doesn't just smash the database in a sense that it will never recover. Float a message up through the stack so that you can present your end user — because most of the time, we've still got human beings driving stuff today, at least. For the minute, this week, you can float that up all the way to the end user, your customer, or whatever — or their agent. Or their agent or whatever. Give them a meaningful thing that is not just, "here's a 500 error, I don't know why, I can't find out," but something more engaging. And if you can find inventive ways to retain them as a customer, if they're particularly enchanted, then all the better, right?

Jose:

You've done quite a lot of work with government, right? Are there any patterns specifically there that you see make things hard to scale? Any specific constraints that you've seen?

Chris:

Mostly myths.

Myths, mostly. So UK government — we were one of, if not the first, to have a cloud-first policy. So for central gov, and highly encouraged for the rest of the public sector, there is a mandate to fully exhaust the option of using public cloud before you do other things, apart from higher classification tiers. So not applicable for Secret and Top Secret, but for anything that will be Official — which is the vast majority of any normal operations — you should fully exhaust running on public cloud.

There are a number of organizations, as always, that will be convinced that they're a special snowflake and it doesn't apply to them, or for their weird edge use case, often kindly lacking the perspective that every bit of government, or every bit of any organization, is unique because otherwise it wouldn't exist. So everyone's a special snowflake; everyone's nuanced — because otherwise, if it wasn't nuanced and unique, then it shouldn't exist.

So there are some beliefs around that. Beliefs around — here's the horrible sovereignty word — presumptions of data sovereignty being interchangeable with, or digital sovereignty being interchangeable with, data residency. Those are not the same. Believing that, "oh, we could only possibly host services or host any data in, say, UK data centres." Similarly, I've seen other businesses do similar things and argue — as we're seeing play out in some of the courts and some stuff — there is very little real reason for doing that. You'll typically find, certainly at least stuff here, is the most expensive and the least sustainable most of the time, and the arguments of latency are pretty minimal these days. You're unlikely to notice much of a lag for the vast majority of stuff to cross over the Atlantic. It makes no odds.

But still, we see a lot of organizations and businesses reaching for deploying stuff locally and picking London as a region. Similarly, I've seen other things in other countries — when I've dealt with companies in other countries, they'll pick the nearest one instinctively, not really fully thinking it through, not following — looking at the actual cost comparison and risk impact.

To a scaling point, that fundamentally just means that you are limited by that. It's interesting to see how the cloud vendors also reflect this maturity in the demand side by the lack of progression. There aren't many global services. We have not moved dramatically — and Kubernetes, or anything else, is part of this space — but we've not moved dramatically far from "somebody else's computer" a lot of the time. So you are still typically deploying to a zone that's in a region that's in a cloud vendor, and often to a node that's in a zone. Even if you're using a managed service, you're still aware of zones as a primitive and still probably aware of nodes of actual compute.

And then you take a provider like Cloudflare, where Cloudflare will give you — you deploy to the world.

Jose:

Yep.

Chris:

That's your only real option. You can exclude some regions, potentially, but you are deploying to the world. That's a very different paradigm to work with. It's a very different perspective to what the other hyperscalers are doing. Frustratingly, they seem to be generally quite quiet about that and haven't found a way to market that as much as I'd wish for them to, and celebrate them for it. It is a very different paradigm.

But even they — I haven't spoken to them that much about it — but they seem to have compromised on some of their original philosophies. Like, you've got a Wasm container and a thing — now they've started falling back on some of the early purist philosophy. And then you can now do normal classic — I think they call them — containers. But you were in a world where, if you were doing Wasm containers, that meant you didn't care about the CPU architecture. It would run on an ARM or a PowerPC or an x86 or whatever — not your problem. As an engineer, "I want my application to run, I shouldn't most of the time want to care about the CPU architecture, as long as it compiles and runs." So just run it in the cheapest, most sustainable, fastest way, somewhere on the planet that's hopefully going to be near to my end user, and delegating that entirely to your provider is really fortuitous.

But yes, that's your original question on the things that go wrong: a lot of that, the myths, and the thing of, "oh, we couldn't possibly do that," because something that might have been true at one point is not regularly re-evaluated — and needs a fresh, "actually, is this still true?" — and actually really evaluated against business demands and what's right for the organization.

Jose:

I know we're getting close to the end and we do have a few rapid-fire questions to wrap up, but before we get there: in the work on the government side, do you still see a lot of legacy systems there? You were saying that there's a cloud-first policy in the UK government, but do you still see a lot of the legacy systems sticking around?

Chris:

Yeah. So, credit to some of the folk in one of the teams that I work with, they published last year — early last year — what was called the State of Digital Government Review. The very short, abridged version: it's an interesting read, and there's a lot of stuff in it actually, whether you're a UK government employee, a UK citizen, or otherwise. As a very honest publication, it's really fascinating to see the scale of the problem space of technology.

The very abridged is that they identified — was it £45 billion GBP of annual savings that could be realised if we were to transform and address that. So that's the operational, day-to-day, UK public sector run cost of legacy. And that's obviously what they'd spent.

Whilst we do have a cloud-first policy — and I'm so grateful for that existing because it removes a lot of choice and decision, and it's further cemented with language that is "services, not servers," so again, look for the higher-order services, don't just look for some low-level compute, go for your SaaS, go up the stack as far as you can — whilst all that's great, there is still plenty of legacy. And it will be naive and new legacy, as in legacy being built in cloud, or where we've lift-and-shifted and not quite concluded a strangler pattern, because we have some bits where there'll be a data centre exit, and there will be contractual or sometimes physical requirements that you just need to move it. And then public money — same as any kind of large organisation enterprise — may not come for the actual rework. That's always the thing that's the can that's kicked down the road. And you don't ever get round to the full rebuild that you really wanted, or that any of the people committed to: "oh yeah, we're going to do this now, and in three months' time we're going to completely re-engineer it and make it better." More often than not, that doesn't realize, doesn't come about.

I particularly come back to think there are very few people that are actually technically competent that you can get, certainly in public sector, with the amount of money that public sector and government will pay people. Very little that you can actually do — your limits, you're resource-constrained more than anything else. There'll be other things that, actually quite rightly, are important. If you had infinite human resource — it's not typically a shortage of money, it's mostly the human beings — if you had infinite human resource, then yeah, the whole thing would be a lot better and we'd save tons of money.

But the amount of people that I would personally trust to go and produce a cloud landing zone architecture and run that into an organization is maybe 10, 12 out of thousands of people I've ever spoken to that I'd actually go, "yeah, I'd vouch for them to do a good enough job that might meet some organizational needs and might not just create further ongoing technical debt."

Jose:

So getting to the rapid-fire questions now — just three. You don't have to overthink it; you can just share what comes to mind. The first one is maybe the more complex one. So, do you have any suggestion, tips, or guidance if there's someone out there getting ready to start a new platform for critical infrastructure?

Chris:

Focus on the business problem. The higher up a technical stack you go, where you can offload as much responsibility onto a vendor or provider, that it's their day job to provide the stability — the better. Because you are inevitably limited with human resource. You don't have all the capacity to run all of the stack. Don't try to. You won't do a better job. Accept the limitations and some of the constraints that are present, and focus on the business problem that's at the very top of the stack, and just go for some tried, tested, boring technology beneath it. So probably not AI most of the time.

Jose:

Awesome, thank you. Is there any book or thought leader or blog or podcast that you recommend?

Chris:

*Acquired* as a podcast — very long-form stories that will ground you in the reasons why business things happen, in actual scale of things. They're fantastic, really well-curated stories that they put together.

Gregor Hohpe with *The Software Architect Elevator*, I think it's called — but he's got a few books, they're all fantastic.

Simon Wardley with Wardley mapping — he's a personal friend of mine, he's fantastic, and will change the way that you think about stuff, particularly with some of the technical choices that you look at. He's got some fascinating stories. He has all the war wounds from early bits of cloud and Infrastructure as a Service behind him, and can produce a very articulate narrative around when to build versus buy fundamentally — which, when you're looking at scaling, or any kind of building of something that might be critical national infrastructure, is really important to factor in.

Jose:

Awesome. Thank you. Last question. To you, scalability is...?

Chris:

Resilience.

Jose:

There we go. Short and crisp. Thank you so much for being here, Chris. It was wonderful talking to you.

Thank you. Thanks so much. Thanks for having me.

And that's it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed it, consider subscribing and perhaps share it with a friend or colleague. If you want to share any thoughts or comments with us, send them to smoothscaling at queue-it.com. This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo, and brought to you by Queue-it, your virtual waiting room partner. I'm your host, José Quaresma. Until next time, keep it smooth, keep it scalable.

[This transcript was auto-generated and may contain errors.]

Handle peak traffic with confidence, no matter the demand

Discover Queue-it