Improving UIlicious Downtime Response

By Eugene Cheah | July 9, 2020

Cause of 8th July 2020 downtime

Our memory cache server crashed on 8th July 2020. In most cases a single server would not have caused a service disruption, as with most distributed systems, the infrastructure would be "ok" if any 2 out of the 3 servers were operational. Unfortunately, it was not a single server downtime, but two servers that went down during "1-4 am local time" for most of our staff members.

As such no one woke up to the message notification previously set. And while we are a fully remote team, all of our staff members are currently operating from the Asia timezone. Causing an extended downtime for our login systems.

So what was done to prevent this from happening again?

The memory cache servers were restarted, and we have added in health checks and scripts to automate the restart process for this issue. Something we do for every major component if we, unfortunately, found a "new way for it to go wrong". While we had existing checks if the "server is running" which already automate this process. The health check was not specific enough to validate the memory cache process.

But that fix will only prevent the exact same crash pattern, what about other patterns?

While some of our staff members may be midnight owls, what we realized from this is that a single message notification is insufficient, for certain time windows, when most of the team is deep asleep. A service downtime can escalate much longer than it is supposed to be acceptable.

Originally our downtime monitoring uptimerobot.com was configured to only escalate issues via message notification.

This was insufficient to wake anyone up, as such, we have changed it to escalate the downtime beyond a single message notification - to phone calling every few minutes if the downtime persists (a feature that we did not know was added till recently), and would hopefully help wake up someone in the team quicker.

This escalation to spammed phone calls was long overdue, to prevent extended downtime. - And if there was a banging at the home door as a service, that would be something I could consider adding as the next form of escalation in our alerting system.

How bout longer-term fixes?

Our team will always be adding in health checks and automation to help repair any infrastructure issue before it impacts users. We do have to acknowledge that Murphy's law will hold true. And our rather small team would need to be much better prepared to handle downtime across all time zones and minimize the time to resolve to a few minutes.

So while most of our team is in the Asia timezone. We realize that a majority of our users are from either Europe or USA. As such one thing planned for our next fundraise this year, is to expand our remote team to include US/Europe specifically to help ensure we have staff members at home across all times.

This process however is expected to happen over the next 6 months.

Overall, I hope the above will help clarify transparently how we intend to better improve our processes internally to improve overall system uptime for everyone testing with UIlicious. And once again...

~ Happy Testing 🖖🚀

About Eugene Cheah

Does UI test automation, web app development, and part of the GPUJS team.