Skip to main content

Disaster recovery plans: Find MoJ Data

Overview

This runbook provides step-by-step guidance for recovering the ‘Find MoJ Data’ Django application hosted on the Ministry of Justice’s internal Cloud Platform (CP) using AWS EKS. It covers the actions required by the Data Catalogue team in the event of a disaster, focusing on minimising downtime and data loss.

Steps to take

If an incident which causes data loss is identified, follow these steps:

  • Communicate the issue - see the section below
  • Identify the cause
  • Agree on recovery action(s)
  • Update communications
  • Perform recovery actions
  • Verify that operations are restored
  • Update communications
  • Perform a post-incident review and identify actions (this can be done later)

Summary

  • Find MoJ Data is a Django application that calls a backend service DataHub for all of its data. Both services are hosted on the MoJ’s Cloud Platform, and use CP for integrated services like the application database (AWS RDS PostgreSQL) and search service (AWS OpenSearch).

  • Both Find MoJ Data and DataHub deployments have RDS PostgreSQL databases deployed via CP and so get automated daily backups and the last 7 are retained.

  • Following loss or corruption of database data, we can either re-ingest metadata through GitHub Actions workflows, or restore from one of these backups by [d], depending on the scenario.

    • The process for restoring from snapshot should be tested every 6-12 months to ensure the team is familiar with the process, and that there are no unexpected issues.
    • The Find MoJ Data RDS database contains:
    • app usage data, which would not be overly detrimental to operation of the service if it were to be lost (we also have Google Analytics data)
    • survey data and reported issues, which should be retained where possible
    • The DataHub RDS database contains catalogue metadata aggregated from other data-handling services along with application data (user data, usage data)
  • The Find MoJ Data service is not considered to be a critical service, and as such we have single RDS instances for each of our deployed environments. We do not have multi-Availability Zone (AZ) deployments and/or read replicas for our databases, as the business will be able to tolerate reasonable outages due to aws-level disruptions (e.g. AZ- or region-level outages).

Scope

  • Application: Django-based Python application
  • Backend services: DataHub
  • Hosting: MoJ’s Cloud Platform: AWS EKS
  • Database: AWS RDS PostgreSQL
  • Key Components:

Business appetite for time to recovery

  • Our service has not yet reached full deployment within the organisation, and there are no services dependant upon its function. As such, our end-users would likely not be overly affected if the service was non-functional for a few hours.
  • We aim to be able to restore all service within 2 working days of disruption, and ideally much faster

Disaster Scenarios

Disaster Severity Likelihood Recovery/Mitigation
Disruption to hosting services High Low - Following recovery of the platform, we may need to re-deploy our applications
AWS Region outage High Low - Wait
- This may lead to disruption to hosting services (above)
Loss of data due to misconfiguration of AWS resources in cloud-platform-environments repo Medium Low - Restore database from RDS snapshot
Intentional disruption by malicious actor e.g. Distributed Denial of Service (DDoS) High Medium - Horizontal auto-scaling with sensible scaling ceiling
- [Rate limiting on NGINX ingress]
- AWS Shield Standard safeguards applications running on AWS. CP are looking into rolling out AWS Shield Advanced for more comprehensive integrated protection
Deletion / corruption of application database data Medium Low - see Find MoJ Data loss section. Briefly: loss is currently acceptable, can choose not to restore. Prefer restoring by simplest method (migration > re-ingestion > snapshot restore)
Dependency poisoning attacks High Low - Using Dependabot
- Trivy scanning on pipeline
- npm audit
- Container scanning (Cloud Platform do this)
Accidental misconfiguration of code leading to data loss High Low - Find MoJ Data: write a database migration and roll out via the deploy pipeline. DataHub: use DataHub CLI to repair.
- Restore from snapshot
Entra ID disruption (loss of authentication function) High Low - Wait
GitHub Actions disruption / outage (no deployments or ingestions) Medium Low - Wait
- ingestions can be triggered from DataHub’s UI if they are needed urgently. It is likely that they would be re-triggered once GitHub Actions is restored, so this is not recommended

Backup Procedures

Find MoJ Data (application and metadata ingestion code)

Databases

  • PostgreSQL database in AWS RDS (Find MoJ Data, DataHub):

    • RDS automated daily snapshots are enabled by default in CP.
    • Ensure point-in-time recovery is active.
    • TODO Schedule regular manual snapshots and store them in a different AWS region.

    Note manual snapshots set up this way must also be deleted, as there is an account-level snapshot quota.

  • OpenSearch (DataHub):

Recovery Procedures

Note: If recovering both Find MoJ Data and DataHub, the DataHub instance must be restored and tested for correct operation before attempting to restore Find MoJ Data.

Find MoJ Data (Django application)

The fastest way to recover Find MoJ Data (if there has not been loss of user metadata that needs restoring from database snapshots) is simply to re-deploy it.

  • First check that the relevant Cloud Platform environments are correctly configured,
  • Re-run the latest run of the Staged deploy to Preprod, and Prod workflow to the desired environment.
    • If this is not possible, a pull request (PR) merged into main will be required to kick off the workflow. Such a PR can consist of an empty commit.

DataHub

Re-deploy

If the DataHub application is taken down with no other changes or loss of function, it can be redeployed similarly to the Find MoJ Data application above:

  • First check that the relevant Cloud Platform environments are correctly configured,
  • Re-run the latest run of the Deploy Staged workflow to the desired environment.
    • If this is not possible, a pull request (PR) merged into main will be required to kick off the workflow. Such a PR can consist of an empty commit.

Note: Whenever the DataHub application is deployed from scratch (where there is no existing deployment), the API tokens that Find MoJ Data uses to communicate with DataHub must be recreated and stored in the secret store for Find MoJ Data.

Restore content

There are currently 2 ways to restore content for DataHub if it is lost:

  1. If no DataHub-specific metadata needs restoring (i.e. DataHub usage stats, user records):
    • Re-trigger metadata ingestions. This is likely the simplest solution requiring the least effort from the team. Where this will solve the problem, it should be prefered. Depending on the ingestions needed to be re-run, this may take some time.
  2. Restore all DataHub content to previous state: restore from an RDS snapshot
    • This will require refreshing any API keys for DataHub for that environment, e.g. used in Find MoJ Data or for local development.

Rollback after disruptive upgrade / deployment

If DataHub needs restoring due to issues with a DataHub version upgrade, or some issue with changes to the helm deployment instructions, the deployment can be rolled back via helm. Use the following commands to identify the previous desired release revision, and roll back to that revision.

# get list of historical revisions for a given release, along with release status
helm history datahub -n [NAMESPACE]

# rollback to desired release revision
helm rollback datahub [release] -n [NAMESPACE]

Note: the datahub [release] term comes from the name of the helm chart being deployed in the namespace.

Microsoft Entra ID integration

The Azure EntraID tenant is not managed by this team, and any significant outage would cause much wider disruption than just to the Data Catalogue. The only course of action in this event would be to monitor communications on the issue.

The more likely scenario - which is in scope for this team - is that the API keys/secrets which the applications use to communicate with the EntraID tenant expire or are deleted. In this event, follow the guidance in the secrets management runbook.

Database recovery

In many cases, restoring metadata by re-running the ingestion process will be faster than an RDS restore. At the time of writing, the ingestion takes around one hour to run, and RDS restoration requires PR approval from the Cloud Platform team and could take around two hours.

If more than just asset metadata is affected - for example DataHub user account role assignment, or manual glossary entries, then an RDS restore is the better option.

  • Restore PostgreSQL database from RDS snapshots (Find MoJ Data, DataHub):
    • If needed, restore the database to a specific point in time following CP’s RDS snapshot restore guidance, but also note:
    • Your first PR should set RDS deletion_protection to false or the restore will fail - example
    • A subsequent PR to specify the snapshot identifier - example
    • Once the restore is complete, raise a final PR to set deletion_protection to true and remove the snapshot identifier - example
  • Restore OpenSearch indices:
    • via re-index job:
    • In most scenarios, it will be more straightforward to run a re-index job to recreate the indices from the metadata database.
    • from snapshots in S3 (DataHub):
      • Note: if other restore processes have been run due to the same outage, it will likely be simpler to re-index rather than restoring from index, to ensure DataHub does not get out of sync
      • Recover OpenSearch indices from the latest snapshots stored in S3 (AWS instructions)
      • Verify search functionality is restored in the application.

Issues at Cloud Platform level

  • Redeploy environments:
    • Once Cloud Platform have confirmed that their platform is available, re-deploy affected environment applications
    • Trigger GitHub Actions deployments either through manual re-trigger of the latest workflow run, or through merging a PR to main and running the relevant deploy pipelines.

Communication plan

  • Internal Communication:
    • Notify the Data Catalogue team immediately when a disaster is identified via either of #data-catalogue or [#data-catalogue-alerts-prod] Slack channels.
    • Create an incident thread in one of the above channels and tag the team using the @data-catalogue-team tag
  • External Communication:

Post-Incident Review

  • Incident Analysis:
    • Conduct an incident review to identify contributing factors to the disaster.
    • Assess the effectiveness of the recovery process and document lessons learned.
  • Plan Update:
    • Update the DR plan based on the incident analysis and any new risks identified.
This page was last reviewed on 23 April 2025. It needs to be reviewed again on 23 October 2025 by the page owner #data-catalogue .
This page was set to be reviewed before 23 October 2025 by the page owner #data-catalogue. This might mean the content is out of date.