Website crashing? Here’s your recovery plan essentials
Is your website crashing? Here are the key steps you can take right now to get your website up and running again.
*If your website is down right now because of overwhelming traffic, we can help. Implement Queue-it’s virtual waiting room in under 30 minutes to regain control of your site performance.*
Your website is crashing, but know this: it’s not the first website to go down, and it won’t be the last. Even internet giants like Amazon, H&M, Target, Twitter, and Walmart have seen their websites crash.
Websites can crash for several reasons:
1. Code errors
A typo at Amazon took a backbone of the internet offline in 2017.
2. Domain name system (DNS) provider failures
A DDoS attack aimed at DNS provider Dyn cut off dozens of top websites in 2016.
3. Web hosting provider issues
Danish construction workers dug up a fiber cable to a server center on Black Friday 2018, causing over 2,000 websites to go dark.
4. Malicious attacks
The BBC website crashed for several hours due to a DDoS attack in 2015.
5. Expired domain name
The Dallas Cowboys forgot to renew their domain name, causing their website to crash on the same day they fired their head coach in 2010.
|6. Website traffic surges||
After Meghan Markle wore one of their dresses, women’s fashion retailer Goat experienced a sudden surge in traffic that crashed their site in 2018.
But when your site is crashing, it’s little consolation to know other sites do too. Time is money, and getting control of the situation is critical.
Getting your website back online is analogous to handling someone in medical distress. If you saw someone you thought needed help, you’d first check to see if the person was ok. If not, you’d get an overview of the situation and ensure no one is in danger. Next, you’d call emergency services for help. Then you’d perform first-aid, like CPR, until help arrives. The patient would have the medical problem fixed by hospital staff. Finally, in the analogy, the patient notifies loved ones that everything is ok. Each of these steps has a parallel in dealing with your website outage.
Here are the steps you should take to understand why your website is down and how to get it up quickly.
Before the alarm bells go off, ensure there’s a problem with your website in the first place.
A reported issue could always be an isolated case of poor internet connection. Or if your site was down briefly, the cache in a visitor’s browser could continue showing an error page, even if your site is back up and running.
Always verify the problem exists before trying to solve it.
Get a preliminary overview of the situation. Which of the typical causes of website crashes listed above are most likely in your case?
If there is a risk of a data breach, or loss or corruption of data, take measures to mitigate that first.
How you do this depends on the type of attack and the system(s) affected. Usually you’ll want to isolate the system(s) accessed by the bad actors to prevent their attack from spreading. This could involve stopping the database or disconnecting breached user accounts. After you contain the attack, you’ll need to eradicate it. Again, this depends on the type of attack, but it could involve blocking certain IP addresses or deleting affected files and restoring them from a backup.
If your website has experienced a data breach, you’ll want to follow steps specific to breaches to perform extra cleanup and stakeholder management.
Now it’s time to enact your escalation plan and notifying the responsible contact people. For example, if you’re an online retailer, you’d want to notify your internal IT, digital, and marketing departments, as well as contacting your hosting provider and any applicable consultants or digital agencies you work with.
Quick, clear communication can make a world of difference in mitigating your crashing website.
Even if your website is crashing, there are several concrete steps you can take to give visitors a better experience.
For example, you could redirect to a landing page that provides relevant information and keeps visitors feeling like they’re still in your ecosystem (cute dogs certainly help alleviate some stress, too).
Another option is to move visitors to a virtual waiting room. There you can send real-time updates for visitors in the virtual waiting room while you fix the website issues. Once your website comes back online, you can provide transparent wait time information as visitors return to your site in a controlled, first-in-first-out manner. If traffic peaks were the root cause of your crash, this could actually be the solution to your worries instead of a temporary band-aid.
Studies on the service recovery paradox have shown that recovering well from a failure has the potential to generate to higher satisfaction than never failing to begin with.
But to do so, effective initial, ongoing, and post-downtime communication is absolutely essential. Clearly setting expectations, showing empathy for customers, and demonstrating earnest effort to fix the problem all factor into how customers perceive you handled the situation.
Hopefully you have a template prepared for such situations. But if not, there are fantastic resources that outline guidelines for great status updates. Remember that your goals in the communication are to inform your customers and build their confidence in you.
How should you communicate with your visitors? Here are three main channels:
Have a status page that everyone can access. It doesn’t help if you’re pushing out communications on a page that no one can see. That’s why it makes sense to host your status page on separate infrastructure.
Queue-it's status page shows the availability of our services and website.
For example, during an August 2020 outage of G Suite products like Gmail, Google used its status page to keep users around the world informed in a centralized, controlled way.
Leverage your social media accounts to spread your outage communication, linking to your status page when applicable. If you’re able to serve customers by phone, by email, or in-store, remind them of those opportunities. If your resources allow, use time to field and respond to complaints and questions on social media channels and email.
There’s no point in sending visitors to your site when it’s down. Your marketing team will need to be aware of the outage so they can pause any marketing campaigns (this again highlights why internal escalation plans are so important). It could be they have a huge email or social media promotion planned that would just leave customers frustrated. What’s more, pausing paid ad campaigns ensures you’re not paying for ads that have no chance for ROI.
Your team will need to diagnose and treat the root cause of the website crash. Is it a code conflict with a new plugin you added to your site? Has traffic overwhelmed your payment and inventory bottlenecks, causing a cascading failure? If you have monitoring or logging set up, you’ll already be a step ahead in identifying the issue.
There’s no way to outline here exactly what steps you need to take, as that depends on many variables including your type of company, the root cause of the problem, your infrastructure setup, and what internal resources you have available. But do remember to continuously update your customer base using the channels outlined earlier.
Once your team has identified and resolved the issue(s) and your site is back up and running (congrats! ), you’ll need to share the good news.
You’re ready to communicate that your website is back up and running. But first, check a few things, especially if your website crashed because of overwhelming traffic.
If you’re using a CDN (if not, you really should), its cache normally removes a lot of strain from your web servers when people visitor your site. When your system fails, this cache can be cleared.
What happens then when the site returns online? It will crash again. Visitors hit the site while no content is cached, and everything has to load from databases and render at the same time. So, pre-load your cache before the system goes back online, if possible. Implementing a virtual waiting room is another way to ensure traffic remains under your website’s thresholds.
After you communicate your website is up and running, you should write a post-mortem statement explaining what went wrong and apologizing to your customers. This statement shouldn’t shift blame or beat around the bush. It should get straight to the point.
Atlassian recommends using the following outline:
- Acknowledge the problem, empathize with those affected and apologize
- Explain what went wrong and why
- Explain what was done to fix the incident and what was done to prevent repeat incidents
- Acknowledge, empathize, and apologize once again
Remember, even the biggest companies have outages. If you handle the situation well, you’ll be able to bolster and regain the trust of your customers.
We’ve just reviewed the main reasons behind website crashes and what steps you should take if your website is down right now.
If you’re looking to better understand and prevent future website crashes, here are three valuable posts just for you:
(This post has been updated since it was originally written in 2019.)