How to design for failure AND success: 7 key steps we take
Downtime is poison for most SaaS services. But eliminating all failures in computer systems is impossible. At our SaaS startup, we've used these 7 strategies to design for failure AND success.
The list of failure types that cause downtime to a SaaS service like Queue-it is seemingly endless. There’s network latency, hardware failures, database errors, programming errors and human mistakes, just to name a few. Eliminating all failures in computer systems is unrealistic, if not impossible.
We are a virtual waiting room SaaS that ensures web performance during heavy online traffic. We exist to protect our customers on their most business-critical days. If our service is down, it becomes useless and could make our customers worse off than if they didn’t use our service. For some customers, the cost of failure reaches millions of dollars and is something they won't forget. Downtime is poison at Queue-it. That’s why we work with the “design for failure” approach to our services.
What does design for failure mean to us?
Designing for failure means that we anticipate system errors, building software that handles them and is self-healing. It doesn’t suggest we want to fail, but that we acknowledge the reality that failures will happen.
Here are the 7 key steps we take to design for failure AND success.
1. Redundant Components
Every component we host and any service we consume is designed for redundancy and high availability. The software we develop runs on Amazon AWS cloud computing in multiple data centers around the world. Every component or service is deployed to several regional data centers, eliminating single points of failure like power outages and network issues. We only utilize third-party components that are designed with redundancy and high availability in mind. Redundancy is the foundation of high availability – and something we continuously improve upon.
2. Server Replacement
Server instances in the Amazon Cloud are very stable. Yet, we acknowledge that servers will eventually fail. All components are developed with redundancy at their core. When failures are detected, replacement instances launch automatically and connect to the network. As a result, our services are unaffected and continue to operate.
3. Software Defects and Human Errors
Once the infrastructure is in place, the biggest threat to availability is software defects and human errors. We aim to eliminate such errors by using automated testing extensively and by automating all processes, such as deployment and scaling. We deploy small updates continuously into production and initially release new features to only a subset of users to minimize the damage of any errors. When errors do occur – and they will – we design our software so the service will still be operational. The user experience will be downgraded, but it ensures a more resilient application.
4. Safeguards Against Success
Success is another threat to SaaS services, but it’s often ignored. Services become successful so fast that their design cannot handle the flood of users accessing them. They drown in their own success. In fact, many of our customers are using Queue-it because their web applications cannot manage the overwhelming success of their own services. At Queue-it, we not only have to deal with the success of our own service, but also the success of our customers. Any online queue we host might be so successful that it will bring down our service, which means we risk DDoSing our own service.
5. Linear Horizontal Scaling
The key to handling heavy volumes of users is to build elasticity into the software. Our services are designed for linear horizontal scaling. In other words, we scale to double capacity when we add the double amount of resources. Resources will automatically scale when needed with no manual intervention.
6. Client Back Off
Sometimes the number of users accessing a queue will spike rapidly. Within a few minutes, the load on the system will increase to above what the allocated resources can handle. In these cases, auto-scaling resources will not help as the changes happen faster than we can allocate new resources. If this happens, the client code running on end-users’ browsers will automatically back off the system and reduce the load on the resources, giving the system a break while the new resources are allocated. Obviously, this results in a reduced user experience. But after a few minutes, the resources will be scaled and ready to serve requests at a normal rate.
7. Noisy Neighbor
You all know about the upstairs neighbor that plays garage punk on his electric guitar at 3 AM. SaaS services have the same problem. Customers share the physical resources and a high volume of users in one queue can cause issues in other queues. Resource sharing is required to have a solid business case in SaaS services, but we do protect customers from each other. Some services put hard limits on the number of resources a customer can utilize. But we opted for a different approach at Queue-it. We partition customers on our hardware so that if we run low on resources because of a specific queue, it will only affect a subset of customers. Finally, if we have a high-risk queue, we execute that on dedicated hardware.
We take availability seriously at Queue-it by designing for failure – and for success. Our availability in the past year (July 2018 – July 2019) was 100%. We’re proud of that. But we’ve had downtime in the past and know the possibility of failure never goes away. That’s why we continue to refine our design and refactor our code, always bearing in mind to design for failure – and success.
Explore related blog posts for developers
Written by: Martin Larsen, Queue-it's Director of Product, MScIT
(This post has been updated since it was originally written in 2015.)