Incident reports
Data loss - 22nd April 2025
On the morning of Tuesday 22nd April 2025, it became apparent there was an issue with the service (at this time, the service was still in private beta).
Timeline
- 09:00 - incident identified
- 09:14 - incident message posted in #data-catalogue
- 09:20 - recovery options discussed
- 09:50 - RDS snapshot restoration commenced:
- 10:05 - Cloud Platform PR raised and posted in ask-cloud-platform
- 10:27 - after a Concourse build failure, a subsequent Cloud Platform PR raised to remove RDS deletion protection
- 11:15 - further Cloud Platform PR to add the RDS snapshot ID for restore
- 11:35 - service fully restored; service restored message posted in #ask-data-catalogue and #data-catalogue
- 12:22 final Cloud Platform PR to switch DB protection back on, and remove snapshot ID
- 2025-04-23 - post-incident review conducted and circulated
While the service was most likely down over the long Easter weekend, the service was restored in around 2.5 hours from us first becoming aware of the issue.
Symptoms
- Multiple Sentry alerts:
FIND-MOJ-DATA-D9 - CatalogueError: Unable to execute list domains query
- DataHub user accounts had been reset to lowest-priviledge defaults (this was when we noticed that 1Password was missing the admin credentials)
- User reporting the issue in #ask-data-catalogue
- The home page displaying the “there is a problem with the service” page
- The DataHub home page was displaying the URNs of domains, not their names
- All DataHub assets were missing (datasets, domains, glossary etc). Note that DataHub still displayed a count of datasets
- Ingestions failing from the following day as a result of tokens no longer being present in the database
Options considered
- Manual restoration of user accounts
- Before it became clear what the problem was, we considered manually updating the DataHub database to restore admin permissions back to our accounts. We discounted this when it became clear that the scope of the problem was wider than data domains and user permissions
- Restoring an RDS snapshot
- Since use role assignments were affected (not just data assets), and were reasonably sure that the previous Thursdays’ snapshot was good, we quickly decided that a restoration of the DataHub database was the quickest route to restoring the service
Cause
Since RDS low storage has been a constant worry for the team, a manually executed database cleanup script was run on Thursday 17th April ahead of the 4-day Easter weekend. The script has been run several times before by the team, but does require manually updating before executing.
It appears that a bug in this script was responsible for removing more data than intended - the intent was to remove only older dataProcessInstance
records - “log” entries related to removing data. But since DataHub stores everything in one table, any slight bug in the script will have wider consequences, removing more data than was intended.
Lessons learned
Lesson | Action |
---|---|
We should have posted an ongoing incident message in #ask-data-catalogue | Added clear and simple list of “steps to take” to the top of the disaster recovery runbook |
We no longer need manual database cleanup | Priortise the ticket to implement DataHub 1.0.0’s new native cleanup |
RDS restoration could be more efficient | Update our runbooks to set out steps, rather than just relying on Cloud Platform docs and ask channel |
DataHub admin password missing from the team 1Password vault | Admin credentials added to shared vault |
Other conclusions
Restoring the service within 2.5 hours was well within our stated aims, and we have identified improvements to the database restore process which would potentially reduce this to two hours.
As part of this review, we re-visited our assumption in our disaster recovery guidance that re-ingesting all metadata is still the preferred option over restoring the database (as we did in this case). Given that the ingestion process currently takes around one hour, we still feel that in some cases this is still the correct advice.
In this case, where DataHub user account settings are also affected, an database restore is the best option. The disaster recovery runbook page has been updated to make this distinction.