Incident Report - 03 Dec
Just before 5pm on 03 December GMT, multiple services on Evance were affected by an incomplete update. Evance experienced intermittent service and prolonged downtime of approximately 5 hours.
An update to our database cluster crashed resulting in a partial update, subsequent database failure and inconsistencies across some nodes within the cluster. Unfortunately, this affected the cluster's ability to synchronise. Our reporting and monitoring systems alerted us to the issue immediately. In order to protect the integrity of your data we took the unusual decision to block traffic from hitting databases whilst we addressed the problem. Ultimately, no data was lost.
This incident affected the database cluster only. The following were unaffected.
- No static content was affected by this incident. This includes your images, videos, audio files etc.
- Application servers were not affected by this incident. This means all of the code for your sites and your themes were not affected.
- This was not a security related incident.
How we fixed it
With our database cluster out of sync and reporting multiple failures we took the cluster offline to avoid any data integrity problems. Unusual discrepancies between database nodes meant internal constraints, which are designed to ensure data integrity, prevented the nodes from restarting successfully. Once we isolated the issue we were able to rebuild the nodes and bring them back into the cluster. Unfortunately, this took several attempts due to reconciliation of data constraints and significant performance issues when traffic was redirected to the cluster. This resulted in an intermittent and downgraded service.
When connectivity was finally restored we received several bug reports at an application level resulting from an out-of-date schema cache. Regenerating the cache resolved all of these issues immediately.
An accidental positive
During the process of addressing this issue we identified small bottlenecks on some of our database tables. Making small adjustments during the offline period has delivered marginal speed improvements to some live services.
We've analysed this event for opportunities to serve you better. Our aim is to find "silver linings" - solutions to make long-term improvements to our service and preventing similar incidents.
- Our communication capabilities were insufficient during this incident. We'll be implementing improvements to how we relay incident status.
- Although we push hundreds of updates to our servers every year, we've identified potential improvements to our tool kits to affect safer upgrade processes in conjunction with improvements to disaster recovery.
- We have identified improvements to how we manage our backup facilities and how regularly we run them.
We are committed to the infrastructural and organisational investments required to address the issues above.
We are aware of the importance of our services to your business. All of us at Evance would like to sincerely apologise for the impact this caused. We're aware of the trust you place in us and take pride in building a resilient platform. In this instance we failed that trust and exposed flaws in parts of our platform we need to address. Rest assured, we're taking steps to ensure incidents like this do not happen again.