In this episode, José Quaresma sits down with two Queue-it engineers — Zaigham Sarfaraz, Engineering Manager, and Šimon Bučko, Senior Software Engineer — to talk autoscaling in production. They cover the fundamentals of horizontal and vertical scaling, why stateless architecture matters for scaling out, and what happens when the metrics you're scaling on don't match your actual bottleneck. The conversation gets real when Zaigham shares a war story of autoscaling failing during an iPhone launch — one million users in one second — and how that experience reshaped how the team thinks about pre-scaling for extreme traffic. Šimon challenges the temptation to rely on default configurations and explains why the days you most need autoscaling to work are exactly the days it might not.
Šimon Bučko is a Senior Software Engineer at Queue-it, working across full-stack development. He is an AWS Certified Solutions Architect Professional with strong experience in software architecture and bridging the gap between business needs and technical execution.
Zaigham Sarfaraz is an Engineering Manager at Queue-it with over 15 years of experience across frontend, backend, infrastructure, and people leadership. He is an AWS Certified Cloud Practitioner and plays a key role in ensuring stable system operations while contributing to the continuous improvement of Queue-it's backend architecture.
Episode transcript (auto-generated):
José
Hello and welcome to the Smooth Scaling Podcast, where we speak with industry experts to uncover how to design, build, and run scalable and resilient systems. I'm your host, José Quaresma, and today we had the pleasure of talking to Šimon Bučko, a senior software engineer, and Zaigham Sarfaraz, engineering manager, both here at Queue-it. We pretty much went all in into autoscaling in production, when it works, when it doesn't work, and we ended up with a pretty detailed conversation about the qualities, challenges, and trade-offs that need to be top of mind when making relevant infrastructure decisions. If you like this episode, please subscribe and leave a review. It really helps the podcast. Enjoy. All right. Zaigham, Šimon, welcome to the podcast.
Zaigham
Thank you.
Šimon
Thank you.
José
I would actually like to start straight on, right? And so we're talking about auto-scaling in production and when it works and when it doesn't. So I would maybe start a little bit from the basics. And I would start with you, Zaigham. Can you tell us a little bit about autoscaling? So when we talk about autoscaling, what's actually happening under the hood? Yeah, sure. So when we say autoscaling, especially in the context of cloud computing,
Zaigham
we actually refer to the ability or flexibility of our cloud infrastructure to shrink and grow with the changing load over the system. So what happens actually under the hood is that our cloud infrastructure monitors our servers on several different metrics. And these metrics could be like CPU usage or number of network requests or memory utilization. So it constantly monitors our servers who are hosting our applications. And based on what auto scaling condition you have configured, if some of those metrics increase because of the changing traffic, then auto scaling condition will kick in, meaning that it will add more servers along with your existing servers to serve all the increased load that you're getting. And it is also important to note here that Load Balancer plays a very important role here because Load Balancer then distributes the traffic evenly on all the old and new servers that are being added with the auto scaling condition so that the single servers do not become bottleneck of receiving all the traffic. And then when the traffic is no longer relevant, it means when the load goes down, then the autoscaling condition is no longer relevant. So the cloud infrastructure takes back all the server that they had previously added, and then you no longer have the capacity that you no longer need. So that happens under the hood.
José
So I think often when we talk about autoscaling, We are more focused about how fast and how are we adding load or adding capacity, not load, but capacity to our system. But I guess we need to remember that there's also the other side is that you can also remove capacity because otherwise, if it's not needed, you're just... Exactly. It's an added cost that you don't need.
Zaigham
Exactly. So this autoscaling makes sure that you have the capacity that you need at the moment. And when you don't need that capacity, you no longer have that capacity.
José
Simon, do you have anything to add here? It was a pretty complete answer, but anything?
Šimon
Only one small detail I would just add on top of it is that, you know, people usually talk about how fast you can scale, how fast you can actually meet the demand where it's needed. But it's also important not to downscale too fast, right? Because if you then downscale too fast, you're probably also risking that suddenly you don't have enough resources to handle the load. So it's also very important that people have this in mind when testing their systems. is not only how fast they can scale up, but how slow they should then scale down to the base.
José
And I think also on auto scaling, I think often we talk about horizontal scaling and vertical scaling. And I always, sometimes I do get a bit confused. I need to kind of think and I usually kind of picture it right as horizontal or vertical. Can you lay it out for us? What is the definition?
Zaigham
I can talk about it. So I can basically, I can define what, when we say vertical scaling and horizontal scaling, what it actually means. So when we say vertical scaling, it means making your existing machines more powerful by adding more CPUs or more memory or more disk power so that now your existing machines can handle more load or more traffic and without adding any more servers on your allowed number of servers. Whereas when we say horizontal scaling, we actually mean not touching the configuration of your existing machines, but adding more servers or more machines in the allowed number of server. Like there's always a limit of how much you can extend in the horizontal scaling. So that going forward, you have more machines to handle more load that is on your system. But it is also very important to note that when we say auto scaling, we generally mean horizontal scaling. there's no such thing as vertical auto scaling. Vertical auto scaling, for the most part, is a manual process. I mean, you can configure that as well, but auto scaling usually means horizontal scaling.
José
Okay. And it's also, I think, the most common thing that we're seeing these days, also in the world of containers and all that. Simon, can you tell us a little bit, are there any scenarios where vertical scaling is better?
Šimon
I think scenarios where your team is either small and you might not have experts or capacity in managing all of this horizontal scaling, because it sounds nice as a concept, but once you actually need to do it, once you need to actually take care of it, then it might actually be an issue for you if you're either a small team or you don't have experts to do so. Also, I believe, you know, most of those horizontal scaling goes well or is used with our microservice architecture, which also can sometimes bring the problem that if you do it incorrectly, you might actually end up in the, we call it like a distributed spaghetti, because, you know, you could just handle all of that on one server, which is much easier, cheaper, then suddenly you're managing it across multiple services, right? So in that sense, or in those situations, I still feel like, you know, using the vertical scaling might be the better option.
José
Yeah. So maybe, and I think that's, I think we're hearing that these days, right? I mean, if you're starting up, maybe don't over-engineer your architecture and go for the, I don't know if it's called Nirvana, but all the kind of the setup with the microservices and all this stuff. you maybe don't need all that complexity when you're starting, but then maybe at some point it does make sense to do so and then you can always reevaluate, right?
Šimon
Yeah. Maybe just one thing to add here is that I feel like maybe sometimes even people don't know where their limits are and that might be an issue to go for a horizontal scaling too soon, you know? Because I believe the vertical scaling can still handle quite a lot of load or workflow for you meet the demand where there is. So I feel like people are maybe then not comparing the risk of moving to something new versus like if I can still, you know, go somewhere beyond where I'm right now
José
with the horizontal scaling. But there's also some, I guess, once you want to go into horizontal scaling, there's also some requirements towards that from an architectural perspective, right? So especially I think when we talk about stateless versus stateful, can you tell us a bit more about
Šimon
Yeah, so just to tell a little bit of definition for the stateful and stateless. So stateless application is an application or an architecture where any instance in the system can respond to the request. And here is very important detail that it's not HTTP request, but it's any request that goes into the instance. For example, of course, we might be talking about HTTP calls to your server's architecture, but we might also talk about messaging and how you can meet the demand of a long queue that you need to process or a long message bus that you need to then pull and do your workload or whatever you need to do with it. So when it comes to this stateless application, it means that no matter which instance picks up the message, HTTP request it can answer or do its job without actually extra information. When it comes to stateful, it's usually an application that needs either a previous state from the previous actions or it might actually need to store a piece of information in order to be able to answer it
José
later on from the client. And how does that, can you then help us to kind of take the listeners through so and how does that impact the ability to do horizontal or vertical scaling
Šimon
so maybe to go in concrete example is that you can have sessions for your logging right so you would like to know who is authenticated who they are and so on and you know if you don't have a proper setup for this kind of use case because sessions are usually stateful operations on your applications you might end up in a situation that if you would go to the horizontal scaling straight without actually thinking how this might be impacted you might end up in a situation where your clients will start losing their login session because suddenly the request hits another instance they might not have the knowledge about the cast the about the customer or client who is calling right there might be some things you could do about it like the sticky sessions where the load balancer can actually remember which instance was hit before so it you know sends the traffic to the same direction but those are also huge issues later on where for example you might need to update this instance or maybe in the real world usually those sticky sessions are risky because you might have a customer that is much bigger than the others and suddenly you might impact only one instance more than the others yeah you know so in that sense without solving this stateful and going directly to horizontal you could actually introduce some issues into your system without knowing it
José
probably and how i like this this example and how would you how would you solve that so in this example with authentication right how would you solve that to enable horizontal scaling i think
Šimon
pretty common pattern is to actually use some other session storages like i don't know redis or some There are tools because it's pretty common that people are facing this issue. So I wouldn't try to reinvent the wheel, like use whatever there is, and then try to take the state out from your application so it becomes stateless. And then when it needs to answer a certain request and ask about the session, it can then ask the session from this external resource.
José
Maybe something that was a bit counterintuitive for me when talking about this concept. Just because it's stateless, it doesn't mean that the whole interaction is stateless, but just that the instance that we're talking about can pick up the state somewhere else, right? But then it just means that you can run as many instances as you're saying without having that issue because they can pick up the state from a common place with Redis, as you suggested. Nice. Thank you. Yeah, sure. And I think both of you have already mentioned load balancers and the importance of load balancers here, especially in the horizontal scaling, right? Well, first of all, maybe you can tell us a little bit why is it that load balancers are so important in that case?
Šimon
It is very important because you do not want to sit and actually always connect your new started instances to a load balancer, right? So load balancer is able to explore what is in the cluster via some health checks and some other tools it has in the tool set, let's say that this way. So the load balancer can easily like add more or can detect like if there are more and then it can distribute the loads to the new new instances. Or there are also other algorithms you can use in the load balancer to distribute the load either to some bigger nodes or to some bigger instances. Or maybe if there is some instance that is pretty utilized, you could then, you know, switch to another one. So it plays quite critical role in kind of like knowing what is there and how to distribute it between those instances.
José
Nice. And is there, have you then seen cases or know of cases that then it's where it's actually then the load balancer that becomes the bottleneck?
Šimon
Yeah, we have seen these issues before. And also it's something that AWS is aware of. So you can sometimes even ask for it to be warm up in a way if you expect huge traffic. It is pretty performant, I would say. Like most of the companies and people don't even need to worry about it. But in certain cases, it can go pretty beyond what even AWS on the normal schedule can do. So then you might need to also ask AWS to pre-warm it or do something about it.
José
Okay, nice. And still kind of within the, of course, the auto-scaling part. I think, Zaigham, you have hinted a little bit at that when you kind of defined autoscaling for us. But you talked about which metrics can be used, right? And I think when I think about autoscaling, I kind of default mostly to CPU and memory when I talk about the instances that are running. Is that, Simon, is that what do you see being the most, I guess, the best, the most useful metrics for us? Is it enough with CPU and memory? Are there some others that are better signal for the load?
Šimon
I think I'm usually a big fan to call it like this is the best solution, because in my opinion, especially in engineering, there is no such a thing as a silver bullet. You usually need to think about your context, your use case. For example, of course, as you said, the thumbs are like the rule of thumb is that you might use CPU, maybe a memory, or maybe how many requests are coming in, right? But let's say that your application is memory-heavy, that you actually need to maybe load more stuff into your memory, right? And in that case, if you're too slow to scale based on your memory, it means that your CPU might not spike as much, but the memory will hit its limits even before the CPU, which then can cause you an issue, right? So it's very important that, yes, you have some rules of thumb, for example, for HTTP traffic, or for example, for processing a queue. But it's very important to then test it how it works in your workflow. And in cases where you might want to be more critical about how much my CPU goes up or how much my traffic and so on.
José
I think that, yeah, of course, and maybe the other side, maybe it's also that if you have an application or a component that is in the sense that it always tries to save as much stuff in memory as possible, so maybe memory will almost be full in every instance, regardless of the load, then it's also maybe a metric that you cannot use, right? Yeah. So that's on both sides, right? And you did talk about silver bullets, So I would like to kind of piggyback on that and talk about serverless architecture because I think they were presented maybe as the silver bullet for autoscaling, right? So if we have a Lambda or AWS Lambda or Azure functions, then we don't need to worry about it as autoscaling anymore because it's just there magically. Is it really a silver bullet or are there still some things that we need to be concerned about if we're using them?
Šimon
And are we talking specifically about those functions that clouds are providing us or serverless in general?
José
Serverless in general, yeah.
Šimon
Well, I think it's pretty important to first do this division because there are serverless ways of how you can deploy your long-running applications like your server's APIs. And then you have also this other part where you can deploy applications as functions. So when it comes to those serverless applications, you can, quote unquote, go really high with the numbers of what you can do. And in the quote unquote, sky is the limit for you. You could add as many servers as you wish, in a way, as long as there's someone who pays for it, right? But when it comes to these serverless functions, you usually just get some limits, right, that your cloud provider allows you to actually scale up to. And that, of course, might be sufficient for most of the use cases. But you might find yourself many times in the use cases that it's not enough. And then you might hit like a very hard limit. And then you either need to build your application. So when your, for example, function gets throttled, which means that it's not able to respond to more requests right now, you might need to actually then rethink maybe the solution. Like, hey, my Lambda is now the limit, right? I need to maybe build it on top of some long-running task. And very important thing that I don't really find, or when I look on those LinkedIn posts and talking about, hey you just use lambda so easy what what i'm always missing is that you're most of the times locking yourself to a specific vendor like of course there are some similarities between the functions across clouds but usually they might have some specific like specific things that you might need to follow or to like you know get it as a rules so you might actually lock yourself to for example lambda yeah but it might be very hard for you to then migrate it to something else or might
José
be very expensive for you later on. But does it really remove any autoscaling issues? I mean, it's still, I guess, I don't know, in the end it's still physics, right? I mean, there's still something behind it supporting it, right? And I would think that things still need to other one options, of course, if the cloud provider that you're using it is severely kind of over provisioned so that the auto scaling is magical and you don't see the cost, which I don't think that's the case. Or otherwise, there would still be some compute that would need to scale as part of the increasing demand. So do we still see, I think we usually remember back in the days talking about kind of the kind of warming up the functions to prevent cold starts and prevent issues with that. Is that still the case these days? Do you still see that?
Šimon
You mean the issue if we still need to pre-warm the functions?
José
Yes, or that it's not a silver bullet for auto-scaling.
Šimon
Yeah, I would still stick to the point that it's not a silver bullet. You might have a cold start, but that's the same even if you would try to add a new instance to your cluster. So in that sense there is not big of a difference. The only huge advantage I see is that if you're just starting out and you actually don't know how your load will behave, it might be be a good option to start with. And then, you know, as you learn more about patterns, how your system behaves, then you could either like stick with it or maybe transition to something else. If you hit the limits, of course, of your cloud provider.
Zaigham
And I would also like to add something on top of that. I mean, we have seen that in the past that, you know, completely relying on these serverless Azure functions or Lambda functions or DynamoDB or any tools that are serverless and they are guaranteed and we want them to scale sort of backfires. So, I mean, and due to the increasing demand, we sort of hit the threshold and then it sort of backfires. So it's always important to learn your own traffic patterns and set up your infrastructure accordingly.
José
And, Zaigham, I'm talking about setting up the infrastructure accordingly, right? I mean, with auto scaling, and you also already hinted at that in the beginning, as I was like, there's a, it's a balance that has a huge impact on cost, right? So you want to be just provisioned enough to be able to handle all the requests, but not over-provisioned that has a big impact so that you're also not overpaying for the infrastructure. So how do you see that balance?
Zaigham
Yeah, I mean, when you talk about cost, it's a very hot topic in terms of the cloud computing and also this auto-scaling condition. So technically and by definition, autoscaling is guaranteed to have a positive impact on the cost because by definition, it will give you the capacity that you need at the moment and will take away the capacity when you don't need. So you're only paying for what you need. And I mean, imagine if you're not using autoscaling, what would you be using? It means that you will be paying for the peak capacity all the time. And then you will have the manual process of, you know, turning down your vertically scaling down when you don't need the capacity. So it's definitely you'll be paying more. But there's a big but in it. You need to set up your auto scaling policy very smartly by understanding your traffic patterns. meaning that if you set up or configure your auto scaling condition poorly, you can have a negative impact on your cost. For example, if you are setting your auto scaling condition too aggressively when you don't need it to be too aggressively, then you will have so many servers spin up that you did not need. And then you will be ultimately paying more money than less money. But on the other hand, if you do this auto scaling very slowly, then there is a possibility that you might have downtime where what actually was needed that you you needed to auto scale more aggressively but you were very slow and it took some time to spin up more servers but then your system was down and your system was not up at that time. So the best course of action here would be to understand your traffic pattern and then set up your auto scaling conditions smartly, then there's definitely a cost benefit.
José
And do you have any examples of this kind of some of those extremes that you talked about of, I guess you could say, autoscaling going wrong either one way or the other?
Zaigham
Yeah, I can actually recall one of the examples that happened, like, I think around three years ago when we had a big iPhone launch when so many of our customers who were using our product to launch iPhones. But, you know, you know, iPhone launch or similar, like, you know, a very famous artist ticket sale is something when the whole world comes in and tries to buy both the legitimate traffic and also the bot traffic. And then, so I think what we did wrong at that time is that we did do some prep work in advance, like we did do vertical scaling, but for the most part, we were relying on auto scaling. And I would also agree that we were not, we did not set up our auto scaling condition very aggressively. And we were relying mostly on auto scaling condition, auto scaling to take care of all the traffic that comes in. But that sort of backfired because the traffic pattern was like, you know, at one moment there was zero traffic and the next second there was like one million people in the queue. So autoscaling condition did kick in and it did add more servers, but those servers also require some time to be ready to serve the new load. But it took around three to four minutes, but that three to four minutes was too much. And then the impact was that there were end users in the queue and then they saw error messages and then we could not serve them. So we learned a lot of lessons from it. We learned a lesson that we need to understand our traffic more and then we need to be a little bit more aggressive on if we want to handle the uncertain spikes in the traffic.
José
Yes, and I think that you're hinting at that, that there are some very special, very few events throughout the year. I mean, we have hundreds and hundreds of events every hour, every day, every day of the year. But there's this tiny amount of global scale events that we do see that auto-scaling would not be enough. And in those cases, we do do some pre-scaling. Can you tell us just a little bit about that? Yeah, I'm going to talk about this pre-scaling process.
Zaigham
But before that, I want to share a very interesting story that is with my personally. that when I was getting hired at Queue-it and I was going through to the process and I was introduced with this, how do we serve as like an insurance company to our big customers who, you know, who trust us on their most important days and how we work. So I was really thinking, why do these customers need to use Queue-it product? Why don't these customers simply use cloud technology and use auto scaling and all the things that mentioned in order to make sure that their websites do not crash. So because autoscaling does guarantee that your website will not crash and you will keep on serving. But then I got the answer right away with this pre-scaling term. So because autoscaling only is like a good tool for the traffic that has linear spikes where traffic shrink and grows in a linear pattern. But we are serving the customers where the traffic patterns are never linear. So at one second, you have zero number of requests. And the next second, you have 1 million requests. And the next setting, you have again, zero number of requests. There's no auto scaling that can handle this sort of traffic. And this is where we come in. So what we do is, I mean, we have our infrastructure already warmed up enough because we have all those big customers on our plate. So our infrastructure is warmed up enough to handle any business as usual kind of traffic spikes. But we also have our limits, right? We cannot solve the whole world. And there are situations when the whole world wants to buy one thing, right? So that is where we have a process called Prescale. What it actually means is that we have some contracts with our customers, and then we have defined our own thresholds and our own limits that we have communicated with the customers. And when customers expect the traffic more than that threshold, so they are supposed to pre-inform us that they are expecting traffic at this moment, at this time of the day and all that. And then we do our prep work. And that prep work is called prescaling, which is a combination of both vertical scaling and then a little bit of reliance on autoscaling as well. So we make our machines bigger for the expected spikes in the traffic. And then for the little part, we also rely on the autoscaling.
José
And so that actually reminds me a little bit on, it was a quote from Martin Jensen that we had in our podcast and you know him well as well. It was a quote from him from a few episodes ago that said that autoscaling is necessary but not sufficient. So is that, do you agree? It sounds like you agree with that, Zaigham. What did you say, Simon? Do you also, yeah, that's good. Yeah, but it was, thank you, Zaigham, for also sharing your personal experience because that was also, that was actually a little bit my experience before joining Queue-it, which was like, but do people kind of, do people really need this? I mean, why don't they just re-architect their system or just do auto-scaling? But then when you start looking at the sharp requirements and the sharp peaks of traffic and the need for people to kind of control the online traffic into their infrastructure, then you start saying, hmm, okay, that makes sense. I see the use case now so much so that I decided to join, right? So that's something. And yeah, Simon, how was your experience?
Šimon
Yeah, I just wanted to add on what you said, because I fully agree. It's just also the opportunity cost, right? Because by the time you try to solve this by yourself, you could actually really focus on your business or what you actually want to serve to your customers. Yeah. Yeah. So it's just a small add-on.
José
And so if we have, for our listeners that are most likely interested in autoscaling, is there something that you would like to share with them on kind of something that they might be underestimating or overlooking with autoscaling? Simon, is there anything that you would share?
Šimon
I'm not sure if it's a term or I just made it up.
José
It's okay. It's allowed. We're allowed to make the terms up.
Šimon
I feel like this fallacy of a rule of thumb, because of course I'll mention AI in this year, in this age, where of course AI can write you some very good defaults for your autoscaling, which is this kind of like rule of thumbs either from your cloud providers or learning from other patterns it's seen before and yes it might work for you in most of the cases but it might not work for you in the patterns and in the situations you really need it right so this is kind of like mistake i actually see is you know just using things because someone else used it but you don't really understand and never actually really tested how that application will perform in the days you you really needed to perform.
José
And is that there where load testing and chaos engineering and some of that other practice, is that where you would kind of, would they be helpful in those cases where it would help you see where kind of how your system behaves under different circumstances? Is that, of course, having it running in production is also a way of finding it out, but I guess would rather do it before, right? Yeah.
Šimon
It's definitely something that you should do. And maybe going back to your previous point, you said that things like this, you want to also do it on some regular basis, right? It's not like the architecture is static, right? It evolves with the time, it evolves with the customers, and it evolves with the demand you're getting. And if you're not doing this on a regular basis, you might be caught by some event you run, or you might be just caught completely off guard, and then you're like, oh, actually, thought that autoscaling is enough or like oh i thought that i'm pretty fine but people are missing that they are evolving their systems and with those evolve like a new parts like for example adding an extra database adding an extra call to some third-party company right that might actually with no testing of a load and how much you can handle you could actually screw up on the days you really needed to work.
José
That's a fair point. Thank you. And I think we're getting close to the end. We do usually wrap up with a few rapid fire questions. Usually it's, we try to take the same three questions. I'll just split them a little bit between the two of you. And I would start with you, Zaigham. Do you have any advice that you would give to someone starting now and within the infrastructure engineering space?
Zaigham
Yeah, the only advice that I would give them that don't, you know, hear about any fancy terms and words and try to implement in your cloud infrastructure, for example, serverless or auto scaling or anything. Understand your business needs first and then try to implement whatever suits your infrastructure.
José
Nice. Thank you. And Simon, do you have any resource? Could be a book, a podcast, a thought leader person that you would recommend?
Šimon
Not specifically a person, but what I really like to read or listen to is how those big players like Discord or maybe WhatsApp are talking about how they were actually serving their billions of customers in their initial stages where it was just really few people. And those type of podcasts on YouTube videos are really interesting to see like how they really actually pushed not only the limits of autoscaling, but then also further figure out how they can maybe shard their databases, shard their services and all kind of different stuff you might not think about if you're not into this space or if you just think, oh, scaling is fine for me. Thank you.
José
Zaigham, anything you would recommend?
Zaigham
You know, what I would recommend is this AWS certification called AWS Certified Solution Architect Associate, because then this will give you a very nitty gritty detail into what is best and what is optimal and everything. So I would really recommend going for that certification, especially for somebody starting as an infrastructure engineer.
José
Thank you. And Simon, last question for you. For you, scalability is?
Šimon
It's ability of a system to meet its demand, either if the demand goes up, but also, as I said, when it goes down.
José
Awesome. Thank you so much for joining the podcast. Thank you very much. And that's it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed, consider subscribing and perhaps share it with a friend or colleague. If you want to share any thoughts or comments with us, send them to smoothscaling@Queue-it.com. This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo, and brought to you by Queue-it, your virtual waiting room partner. I'm your host, José Quaresma. Until next time, keep it smooth, keep it scaled.
[This transcript was auto-generated and may contain errors.]