Cause of 8th July 2020 downtime
Our memory cache server crashed on 8th July 2020. In most cases a single server would not have cause a service disruption, as with most distributed systems, the infrastructure would be "ok" if any 2 out of the 3 servers were operational. Unfortunately these it was not a single server downtime, but two servers which went down during "1-4am local time" for most of our stuff members.
As such no one woke up to the message notification previously set. And while we are a full remote team, all of our stuff members are currently operating from the asia timezone. Causing an extended downtime for our login systems.
So what was done to prevent this from happening again?
The memory cache servers were restarted, and we have added in healthchecks and scripts to automate the restart process for this issue. Something we do for every major component, if we unfortunately found a "new way for it to go wrong". While we had existing checks if the "server is running" which already automate this process. The health check was not specific enough to validate the memory cache process.
But that fix will only prevent the exact same crash pattern, what about other patterns?
While some of our stuff members maybe midnight owls, what we realised from this is that a single message notification is insufficent, for certain time windows, when most of the team is deep asleep. A service downtime can escalate much longer then its suppose to be acceptable.
Originally our downtime monitoring uptimerobot.com was configured to only escalate issues via message notification.
This was insufficent to wake anyone up, as such, we have changed it to escalate the downtime beyond a single message notification - to phone calling every few minutes if the downtime persist (a feature that we did not know was added till recently), and would hopefully help wake up someone in the team quicker.
This escalation to spammed phone call was long overdue, to prevent extended downtime. - And if there was a banging at home door as a service, that would be something I could consider adding as a next form of esclataion in our alerting system.
How bout longer term fixes?
Our team will always be adding in healthchecks and automation to help repair any infrastructure issue before it impact users. We do have to acknowledge that murphey law will hold true. And our rather small team would need to be much better prepared to handle downtime across all timezones and minimize the time to resolution to a few minutes.
So while most of our team is in the Asia timezone. We realise that a majority of our users are from either Europe or USA. As such one thing planned for our next fundraise this year, is to expend our remote team to include US/Europe specifically to help ensure we have staff members at home across all times.
This process however is expected to happen over the next 6 months.
In overall I hope the above will help clarify transparently on how we intend to better improve our processes internally to improve overall system uptime for everyone testing with uilicious. And once again ...
~ Happy Testing 🖖🚀