The cost of downtime: IT outages, brownouts & your bottom-line
Downtime affects organizations of all sizes across industries. But just how common is IT downtime? How much does it cost businesses? And how can you prevent it? Drawing on the latest research and reports, this article covers everything you need to know about the cost of downtime and how you can avoid paying it.
What is IT downtime?
IT downtime occurs when a system can’t complete its primary function. It can be broken up into two types: outages and brownouts.
IT brownouts occur when a system is slowed or partially available. With an IT outage, the system is completely unavailable. In the case of Meta’s 2021 IT outage, not only were Meta’s flagship services down, but so too were their internal systems—employees couldn’t even get into offices using their keycards.
Some IT downtime is planned for system maintenance. But most downtime is unplanned, and occurs due to high-traffic, system failures, or malicious attacks.
How common is IT downtime?
Now you understand downtime, let’s look at how much of an issue it really is. IT outages and brownouts are more common than you’d think—and they’re on the rise.
LogicMonitor’s survey of enterprise-level IT leaders found that over the past three years:
- 97% of enterprises experienced an IT brownout
- 94% of enterprises experienced an IT outage
- The average amount of brownouts for enterprises was 19 per year
- For IT outages, that number was 15
51% of IT leaders say downtime has increased since the March 2020. 59% of these leaders suggest the rise of mobile computing is responsible for this increase, and 57% attribute it to the rise of digital transformation.
And it's not just the frequency of downtime that's on the rise. The costs of downtime are rising too. ITIC reported a 2% year-over-year rise in downtime costs in 2021. They predict that number to continue to grow as organizations rely more on the internet for business-critical processes.
It's clear downtime is no rare occurrence. And as internet use continues to grow, the world’s IT leaders are convinced it’ll become more common still.
But if IT downtime is so common, how bad can it really be? What’s the cost of downtime?
What’s the cost of downtime?
Every number you can find on downtime tells one simple story: avoid it at all costs. For enterprises, the costs of downtime are staggering. And the effects go far beyond revenue lost while servers are down:
- Costs are up to 16x higher for companies that have frequent outages and brownouts compared with companies who have fewer instances of downtime
- 91% of enterprises report downtime costs exceeding $300,000 per hour
- For 44% enterprises, hourly costs exceed $1 million per hour
- And for 18% of enterprises, downtime costs exceed $5 million per hour
Where do these costs come from? Forrester’s survey of IT directors in large US enterprises gives us a clue. When asked where the cost of downtime comes from:
- 53% said lost revenue
- 47% said lost productivity
- 41% said lost brand equity or trust
LogicMonitor's survey of global IT decision makers revealed similar concerns among IT professionals, finding:
- 53% think their company will experience a brownout or outage so severe that it makes national media headlines
- The same percentage (53%) think their company will experience a brownout or outage so severe that someone loses their job as a result
- 31% say they have experienced brand/reputation damage due to IT brownouts, while 32% say they have experienced brand/reputation damage due to IT outages
- 30% said brownouts and outages lowered their stock price
When Meta suffered their outage, its stock plummeted 5%. And while few will pity Mark Zuckerberg, this fall represented a $6 billion hit to his net worth in a day. That’s more than the average lifetime earnings of over 2,000 Americans—lost by one man in one day.
But Zuckerberg and Meta should consider themselves lucky. 16% of IT leaders say their organization was shut down permanently because of IT outages over the past three years.
The long-term costs of downtime: A threat to customer experience
In the long-term, outages and brownouts are a major threat to customer experience (CX). In their Future of CX report, PwC surveyed 15,000 consumers and found that 1 in 3 customers will leave a brand they love after just one bad experience, while 92% would completely abandon a company after two or three negative interactions. These findings are mirrored in a Fullstory survey which found:
- 77% of consumers leave a site without buying if they encounter an error
- 60% are unlikely to return to a site later if they encounter an error
- 65% trust a business less when they experience a problem
To make things worse, IT outages and brownouts are most likely to occur when you’re at your most visible—think Black Friday sales, product drops, successful marketing and PR campaigns. This is because high traffic and sudden usage spikes are among the most common causes of downtime.
In a recent survey, 46% of Brits said they’d ditch retailers altogether if their apps crashed on Black Friday. It's no wonder 70% of retail marketers reported concerns about downtime during the holiday season.
Downtime costs add up
Let’s put some of these percentages into concrete terms for an enterprise-level retail site.
Say you’re a large retail site running a Black Friday sale. You have 1,000 visitors per minute, an average order value of $20, and a 10% conversion rate. This would give you an hourly visitor count of 60,000, and an hourly revenue stream of $120,000 during the sale. To keep things simple, let’s say your site crashes for exactly one hour. That means:
- $120,000 in lost revenue
- 36,000 customers (60%) who are unlikely to return to your site
- 20,000 customers (33%) who won’t return to your site
- 39,000 customers (65%) who have less trust in your business than before
And these numbers only account for the customers who visited your site while it was down. They don’t include impacts to productivity, negative reviews, and the potential for a stock price impact. They don’t include the impacts of headlines like this:
The good news: downtime is avoidable (sometimes)
IT outages and brownouts are terrible for business. And while the bad news is they’re common and on the rise, there’s good news too:
51% of experts consider IT outages and brownouts avoidable.
What would avoiding downtime look like for your business? Forrester asked IT directors what benefits they would expect for their organization if it had no downtime:
- 63% said increased revenue
- 53% said reduced operational cost
- 51% said improved competitive advantage
- 50% said improved employee productivity
You’d be hard-pressed to find a change to a business that would have an impact as substantial as eradicating downtime.
So in the second half of this article, we’ll dive into exactly that: how you can avoid outages and brownouts. But first, we need to understand what’s causing them.
What causes IT outages and brownouts?
There are 6 main reasons websites and servers crash or experience slowdowns:
1. Code errors
A typo at Amazon took a backbone of the internet offline in 2017.
2. Domain name system (DNS) provider failures
A DDoS attack aimed at DNS provider Dyn cut off dozens of top websites in 2016.
3. Web hosting provider issues
Danish construction workers dug up a fiber cable to a server center on Black Friday 2018, causing over 2,000 websites to go dark.
4. Malicious attacks
In 2022, over 70 Ukrainian government websites crashed after an onslaught of cyber attacks.
5. Expired domain name
The Dallas Cowboys forgot to renew their domain name, causing their website to crash on the same day they fired their head coach in 2010.
|6. Website traffic surges||
Coinbase’s 2022 Super Bowl ad had a QR code linking to its website. The massive influx of traffic from the ad brought their service crashing down.
LogicMonitor’s survey of enterprise-level IT leaders found that the most common of these downtime causes were:
- Network failure (web hosting provider issues)
- Usage spikes (website traffic surges)
- Human error (code errors or expired domain names)
5 proven strategies to prevent website downtime
1. Embrace monitoring
Monitoring isn’t itself a solution to website overload. But the first step to preventing errors is understanding them. And without detailed monitoring, it’s not just your customers that get left in the dark.
If you don’t have insights into your application’s metrics, you’re inviting issues you’ll never understand. Monitoring helps alert you to failures and provides more detailed insight into uptime and traffic.
A massive 74% of global IT teams rely on proactive monitoring to detect and mitigate outages. And among the IT professionals surveyed by LogicMonitor, those with proactive monitoring had the fewest number of outages and brownouts.
It’s not just proactive monitoring that helps mitigate downtime. Detailed monitoring means that after downtime occurs, you can conduct a root cause analysis based on facts, not guesswork. An accurate understanding of the root failure will allow you to optimize your system for the future.
If your application is a black box, there’s good news: There are a slew of log management and application performance management (APM) SaaS solutions that can help you and which are easy to get up and running (Datadog, New Relic, Loggly, and Splunk just to name a few).
2. Use a CDN and make the most of caching
If you’re not already using a content delivery network (CDN), you’re missing out on important improvements to your site’s resilience and performance.
By using a CDN, you can deliver your site’s static content without placing additional load on your web server. This frees up your server to do what it should—serve up dynamic content.
But even among organizations who already use a CDN, many don’t optimize their processes to make the most of the service.
CDNs can do more than just serve static content. They can also cache certain dynamic content, protect against DDoS attacks, optimize routing, and even execute code. The more of your website you can offload onto your CDN provider’s infrastructure, the more strain you can take off your own servers.
3. Scale your servers and prepare for traffic spikes
The typical response to downtime caused by server overload is to scale your servers. This is an important first step, but it’s far from a downtime cure-all.
While scaling servers is crucial, alone it’s often insufficient to prevent downtime. The issue with approaching traffic-related issues simply with scaling is that even with advanced autoscaling, bottlenecks often go unaddressed, and servers remain unprepared for sudden traffic peaks.
For an in-depth look at what makes autoscaling so difficult, check out our blog Autoscaling explained: why scaling your site is so hard.
In short, successfully scaling your site involves much more than just increasing server capacity. Autoscaling fails to address performance intensive bottlenecks like search features and third-party services like payment gateways.
The shortcomings of autoscaling are less about the capacity of website servers, and more about the on-site behavior of customers. So when autoscaling isn’t enough, the solution is to manage and control the traffic.
This is what a virtual waiting room does. It complements scaling to ensure infrastructure stays online during peak demand.
A virtual waiting room is a cloud-based solution for websites and applications to manage surges in online traffic. When traffic exceeds an organization’s site or app capacity, visitors are redirected to a waiting room using a standard HTTP 302 redirect. They're placed in a customizable waiting room, given transparent wait information, then redirected back to the site or app in a first-come, first-served order.
It manages and controls what autoscaling can’t: the customers.
4. Load test your website
Most customers who come to Queue-it looking to prevent website crashes have no idea how many concurrent users their site can handle. Or, worse still, they believe their site’s capacity is far more than it actually is. This is one of the most common website capacity mistakes, and there’s a simple tool to understanding this that we point customers to: load testing.
Load testing is a process which tests the performance of a site, software, or app under a specified load. This helps organizations determine how their service behaves when accessed by large numbers of users.
With load testing, organizations can discover how much traffic their site can handle before bugs, errors, and crashes become an issue. They can also identify bottlenecks in their systems to understand their vulnerabilities.
The bad news is it’s often difficult to simulate true user behavior with load testing, and even for sites that run load tests, some common bottlenecks are easily missed.
This happened to equestrian brand LeMieux, who load tested their site before running their hugely popular Black Friday sales. They identified the capacity of the site and implemented a virtual waiting room to ensure the outflow of customers to the site was below the capacity their load-tests revealed.
The issue was, when the sale went live, there were slowdowns caused by customers using the site search and filter features. The load tests didn’t run scripts that tested these server-intensive features, resulting in slowdowns on the site.
"We believed previous problems were caused by volume of people using the site. But it’s so important to understand how they interact with the site and therefore the amount of queries that go back to the servers."
Jodie Bratchell, Ecommerce & Digital Platforms Administrator at LeMieux
With Queue-it in place, they took control over the customer flow by lowering the outflow of visitors from waiting room to website in real-time. The rest of the day was a resounding success. The limited-release products sold out and customer complaints disappeared completely.
5. Block bad traffic & bots
The above steps are important to handle traffic as it reaches your site. But there’s also traffic you’ll want to keep off your site altogether. This unwanted traffic often comes in the form of bad bots or attacks on your servers. Keeping them off your infrastructure is essential to mitigating downtime.
Bots make up nearly two-thirds of internet traffic. And while there are good bots like the Google scraping bots that help your site appear in search results, there are also bad bots looking to abuse your system. These bad bots are responsible for a staggering 39% of all internet traffic.
But there’s traffic you’ll want to keep off your site even more than bots: Distributed Denial-of-Service (DDoS) attacks.
DDoS attacks work by flooding a target server or network with massive amounts of malicious traffic. They’re called Denial-of-Service attacks because this massive flood of traffic can overwhelm servers, meaning regular users (your customers) are denied service.
DDoS attacks are getting cheaper and more common. These website hitmen are now available for hire for around $300. In the second half 2021, Microsoft mitigated over 359,713 unique DDoS attacks, a 43% increase from the first half of the year.
While most businesses are concerned about high traffic caused by real users, many are unaware of the real risks that come from bot, DDoS, and data center traffic.
Queue-it offers several tools to block bad bots and malicious traffic before it hits your site. These include data center IP blocking, proof of work and CAPTCHA challenges, soft blocking of suspicious traffic, and an invite-only waiting room letting only the users you choose get access to your site.
Summary: Cost of downtime infographic
Don’t pay the cost of downtime
It’s safe to say downtime is prohibitively expensive. It’s an avoid at all costs problem many large organizations remain vulnerable to.
While downtime may be increasing, so too is our capacity to understand and mitigate it.
Queue-it specializes in keeping websites and applications online no matter the demand. Virtual waiting rooms give organizations control over their web traffic to deliver a fair and seamless user experience to visitors.
With advanced monitoring features, robust tools to block bots and abuse, and superior protection against usage spikes, a virtual waiting room equips organizations avoid the costs of downtime. It empowers them to get back to doing what they do best—delivering high-quality goods and services to their visitors and employees.