Disaster recovery plan: Find MoJ data
Overview
This runbook provides step-by-step guidance for recovering the ‘Find MoJ data’ Django application hosted on the Ministry of Justice’s internal Cloud Platform (CP) using AWS EKS. It covers the actions required by the Data Catalogue team in the event of a disaster, focusing on minimizing downtime and data loss.
Summary
Find MoJ data is a Django application that calls a backend service DataHub for all of its data. Both services are hosted on the MoJ’s Cloud Platform, and use CP for integrated services like the application database (AWS RDS PostgreSQL) and search service (AWS OpenSearch).
Both Find MoJ data and DataHub deployments have RDS PostgreSQL databases deployed via CP and so get automated daily backups and the last 7 are retained.
Note: due to an implementation bug, as of 2024-09-03 the snapshot backups are taken every other morning, meaning the oldest backup is 12-13 days old, and that the most recent backup can be up to 48 hours old.
Following loss or corruption of database data, we can either re-ingest metadata through GitHub Actions workflows, or restore from one of these backups by following CP’s RDS snapshot restore guidance, depending on the scenario.
- The process for restoring from snapshot should be tested every 6-12 months to ensure the team is familiar with the process, and that there are no unexpected issues.
- The Find MoJ data RDS database contains:
- app usage data, which would not be overly detrimental to operation of the service if it were to be lost
- survey data and reported issues, which should be retained where possible
- The DataHub RDS database contains catalogue metadata aggregated from other data-handling services along with application data (user data, usage data). (As of 2024-09-03) The current aims for our DataHub are to aggregate metadata as-is from data sources, so total loss of this metadata would not result in loss of information for the business, as it could all be re-ingested into the catalogue application.
The Find MoJ data service is not considered to be a critical service, and as such we have single RDS instances for each of our deployed environments. We do not have multi-Availability Zone (AZ) deployments and/or read replicas for our databases, as the business will be able to tolerate reasonable outages due to aws-level disruptions (e.g. AZ- or region-level outages).
In the medium term, we do not expect more than 50 simultaneous users for any length of time. At full operation, we could expect to see hundred-to-thousands of unique monthly users, and at this time we may need to re-visit our recovery objectives.
Scope
- Application: Django-based Python application
- Backend services: DataHub
- Hosting: MoJ’s Cloud Platform: AWS EKS
- Database: AWS RDS PostgreSQL
- Key Components:
- Django application code
- DataHub application
- AWS RDS PostgreSQL database
- AWS Elasticache for Redis
- Docker images stored in Amazon ECR
Business appetite for time to recovery
- Our service has not yet reached full deployment within the organisation, and there are no services dependant upon its function. As such, our end-users would likely not be overly affected if the service was non-functional for extended periods.
- We aim to be able to restore all service within 2 working days of disruption.
Disaster Scenarios
Disaster | Severity | Likelihood | Recovery/Mitigation |
---|---|---|---|
Disruption to hosting services | High | Low | - Following recovery of the platform, we may need to re-deploy our applications |
AWS Region outage | High | Low | - Wait - This may lead to Disruption to hosting services (above) |
Loss of data due to misconfiguration of AWS resources in cloud-platform-environments repo | Medium | Low | - Restore database from RDS snapshot |
Intentional disruption by malicious actor e.g. Distributed Denial of Service (DDoS) | High | Medium | - Horizontal auto-scaling with sensible scaling ceiling - [Rate limiting on NGINX ingress] - AWS Shield Standard safeguards applications running on AWS. CP are looking into rolling out AWS Shield Advanced for more comprehensive integrated protection |
Deletion / corruption of application database data | Medium | Low | - see Find MoJ data loss section. Briefly: loss is currently acceptable, can choose not to restore. Prefer restoring by simplest method (migration > re-ingestion > snapshot restore) |
Dependency poisoning attacks | High | Low | - Using Dependabot - Trivy scanning on pipeline - npm audit - Container scanning (Cloud Platform do this) |
Accidental misconfiguration of code leading to data loss | High | Low | - Find MoJ data: write a database migration and roll out via the deploy pipeline. DataHub: use DataHub CLI to repair. - Restore from snapshot |
Entra ID disruption (loss of authentication function) | High | Low | - Wait |
GitHub Actions disruption / outage (no deployments or ingestions) | Medium | Low | - Wait - ingestions can be triggered from DataHub’s UI if they are needed urgently. It is likely that they would be re-triggered once GitHub Actions is restored, so this is not recommended |
Backup Procedures
Find MoJ data (Django application)
- Application code:
- All application code is pushed to the public GitHub repository.
- Custom metadata ingestions are found in the data-catalogue repo
Databases
PostgreSQL database in AWS RDS (Find MoJ data, DataHub):
- RDS automated daily snapshots are enabled by default in CP.
- Ensure point-in-time recovery is active.
TODO
Schedule regular manual snapshots and store them in a different AWS region.
Note manual snapshots set up this way must also be deleted, as there is an account-level snapshot quota.
OpenSearch (DataHub):
- OpenSearch index snapshots are handled by Cloud Platform as part of OpenSearch deployment
Recovery Procedures
Note: If recovering both Find MoJ data and DataHub, the DataHub instance must be restored and tested for correct operation before attempting to restore Find MoJ data.
Find MoJ data (Django application)
The fastest way to recover Find MoJ data (if there has not been loss of user metadata that needs restoring from database snapshots) is simply to re-deploy it.
- First check that the relevant Cloud Platform environments are correctly configured,
- re-run the latest run of the Staged deploy to Test, Preprod, and Prod workflow to the desired environment.
- If this is not possible, a pull request (PR) merged into main will be required to kick off the workflow. Such a PR can consist of an empty commit.
Data loss
As of 2024-09-03, any data loss from the Find MoJ data database is acceptable and we could choose not to restore.
If data is not completely lost from Find MoJ data, recovery via a database migration will probably be quicker than a full restore.
If the application database (not the backend DataHub metadata database) needs to be restored, see the Database recovery section below.
DataHub
Re-deploy
If the DataHub application is taken down with no other changes or loss of function, it can be redeployed similarly to the Find MoJ data application above:
- First check that the relevant Cloud Platform environments are correctly configured,
- re-run the latest run of the Deploy Staged workflow to the desired environment.
- If this is not possible, a pull request (PR) merged into main will be required to kick off the workflow. Such a PR can consist of an empty commit.
Note: Whenever the DataHub application is deployed from scratch (where there is no existing deployment), the API tokens that Find MoJ data uses to communicate with DataHub must be recreated and stored in the secret store for Find MoJ data.
Restore content
There are currently 2 ways to restore content for DataHub if it is lost: 1. If no DataHub-specific metadata needs restoring (i.e. DataHub usage stats, user records): - Re-trigger metadata ingestions. This is likely the simplest solution requiring the least effort from the team. Where this will solve the problem, it should be prefered. Depending on the ingestions needed to be re-run, this may take some time. 1. Restore all DataHub content to previous state: restore from an RDS snapshot - This will require refreshing any API keys for DataHub for that environment, e.g. used in Find MoJ data or for local development.
Rollback after disruptive upgrade / deployment
If DataHub needs restoring due to issues with a DataHub version upgrade, or some issue with changes to the helm deployment instructions, the deployment can be rolled back via helm. Use the following commands to identify the previous desired release revision, and roll back to that revision.
# get list of historical revisions for a given release, along with release status
helm history datahub -n [NAMESPACE]
# rollback to desired release revision
helm rollback datahub [release] -n [NAMESPACE]
Note: the datahub` ‘release’ term comes from the name of the helm chart being deployed in the namespace.
Microsoft Entra ID integration
The Azure EntraID tenant is not managed by this team, and any significant outage would cause much wider disruption than just to the Data Catalogue. The only course of action in this event would be to monitor communications on the issue.
The more likely scenario - which is in scope for this team - is that the API keys/secrets which the applications use to communicate with the EntraID tenant expire or are deleted. In this event, follow the guidance in the secrets management runbook.
Database recovery
- Restore PostgreSQL database from RDS snapshots (Find MoJ data, DataHub):
- If needed, restore the database to a specific point in time following CP’s RDS snapshot restore guidance.
- Restore OpenSearch indices:
- via re-index job:
- In most scenarios, it will be more straightforward to run a re-index job to recreate the indices from the metadata database.
- from snapshots in S3 (DataHub):
- Note: if other restore processes have been run due to the same outage, it will likely be simpler to re-index rather than restoring from index, to ensure DataHub does not get out of sync
- Recover OpenSearch indices from the latest snapshots stored in S3 (AWS instructions)
- Verify search functionality is restored in the application.
- via re-index job:
Issues at Cloud Platform level
- Redeploy environments:
- Once Cloud Platform have confirmed that their platform is available, re-deploy affected environment applications
- Trigger GitHub Actions deployments either through manual re-trigger of the latest workflow run, or through merging a PR to
main
and running the relevant deploy pipelines.
Communication Plan
- Internal Communication:
- Notify the Data Catalogue team immediately when a disaster is identified via either of #data-catalogue or [#data-catalogue-alerts-prod] Slack channels.
- Create an indicent thread in one of the above channels and tag the team using the
@data-catalogue-team
tag
- External Communication:
- Inform users about any relevant service disruptions via the #ask-data-catalogue Slack channel and the Ask Data Catalogue Teams channel. Provide updates until full restoration.
Post-Disaster Review
- Incident Analysis:
- Conduct an incident review to identify contributing factors to the disaster.
- Assess the effectiveness of the recovery process and document lessons learned.
- Plan Update:
- Update the DR plan based on the incident analysis and any new risks identified.