Understanding website outages during timed onsales & launches

Queue-it saw many website outages due to end-user overload during the iPhone 6 pre-order launch at large telcos, even though the timing for Apple’s launch was heavily predicted.

Published: 22. Sep 2014
laptop screen

Recently, there have been large investments in telco infrastructure, meaning website outages are highly costly in terms of both revenue and reputation. So, even if the nature of these end-user peaks are very sporadic, why do telcos continue to experience webshop outages?

Queue-it helps hundreds of customers handle end-user peaks, multiple flash sales, and onsales each day. Based on our dialogue with hundreds of web-based organizations and our expertise, we strongly believe that the answer is two-fold in that:

  1. The traditional way of defining capacity is misleading and leads to incorrect conclusions.
  2. The change in webshop dynamics when a set release time or date is defined (i.e. a 00:01 release) requires capacity well beyond what is commonly thought and is probably unrealistic to deploy.

Traditional way of defining capacity

The traditional way of defining the capacity of a web-based system is by “number of concurrent users”. This definition is widely used and is also found in Google Analytics within “active users on site” under Real-Time/Overview.

This way of defining capacity is based on the assumption that end-users are spread throughout the entire user-journey. However, it does not take users entering and exiting the user-journey into account. Please see the end of this post, where an example of a timed-release situation (a time/day release, i.e. a 00:01 release) is shown describing the traditional way of defining capacity leads to an incorrect understanding and conclusion about capacity.

The flow-based approach

Our approach to capacity is a flow-based approach based on Little’s law. At optimum over a longer period, the throughput per time unit in a given transactional system must have average inflow equal average outflow. If average inflow is below average outflow, the capacity is not used. If inflow exceeds outflow, the system will congest.

Little’s law states that “The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = λW.”

So, for example, if the maximum number of concurrent sessions is L=1000 and the average session length is W=10 minutes, then the arrival rate must be: λ=L/W=1000/10=100 users per minute in order for the system to be stable.

The rate can be calculated using Little’s law equation, given the maximum number of concurrent sessions, the average length of the session and a large enough sample size.

In the example, the transactional system below average inflow is 1 per time unit and each user stays in the system for an average of 20-time units.

Transactional system inflow ticketing onsales

At optimum, the average amount of users in the transactional system is: average inflow x average time = 20.

Set release time

Within ticketing, it is an industry-standard that popular tickets go on sale at a specific and pre-announced time, like 10:00 am. Increasingly, online retailers are adopting this approach, for example, with Black Friday/Cyber Monday capped campaigns and launches like the Apple iPhone 6 00:01 release.

This tends to set an unnatural block in website flow. End-users cannot continue the flow until the given release time comes up. Even displaying the “System under maintenance” page will create the same situation with an unnatural block inflow.

Although this does not seem like a major issue, it actually radically changes the entire webshop dynamics; hence, the entire capacity situation, demonstrated in the following example:

Example

Assume that:

  • Website capacity is defined to 1000 concurrent users
  • Campaign release begins at 10:00
  • There is a 10 min. user-journey

So, if you have 1000 users on the website at 10:00, would everything be ok?

If you use the flow-based approach (using Little’s law), you will get:

  • Capacity to 1000 users / 10 min. = 100 users per minute
  • All 1000 users begin user-journey at 10:00
  • In actuality, overshooting capacity by 10x the first minute, as you have 1000 users the first minute

Furthermore, the 10 min. wait time and subsequent inflow time allotted to the 1000 users prior to the sale will, in actuality, be approximately 60 min. of wait time and inflow, as end-users tend to queue early when waiting for a popular item, like an iPhone 6. So, there will be e.g. 6000 users at 10:00, as users arriving within the 60 minutes leading up to the timed release will end up waiting until the user-journey can be continued.

Therefore, capacity will be overshot, with: 6000 users / 100 users per minute / 1 min = 60x, in the first minute.

Conclusion

Understanding capacity in the flow-based approach is critical when you plan for a timed release on a webshop.

As demonstrated in the example, just looking at capacity from a flow based perspective with an announced set start time will radically change the entire webshop dynamics and need for capacity.

Therein, we highly recommend that organizations start using the flow based way of understanding and describing capacity.

 

What if traffic exceeds your website capacity?