Skip to main content

House Keeping

Datahub

dataProcessInstance cleanup

Datahub creates a dataProcessInstance entity for each ingestion run. These entities are not automatically deleted after the ingestion is complete, which can lead to a large number of these entities accumulating over time. Approximately 60,000 dataProcessInstance entities are created per ingestion run. Over time, this can lead to performance degradation and increased storage usage in the Datahub instance.

There is a garbage collection process to cleanup dataProcessInstance entities amongst others however despite being enabled, it does not appear to be functioning as expected in our environment. As a result I have been periodically pruning old dataProcessInstance entities manually using the steps below.

This task should be carried out during a maintenance window or during off-peak hours, as it may impact database performance, every couple of months. It should be repeated across all ebvironments where Datahub is deployed (e.g. production, preprod, development).

1) Obtain & decode the RDS endpoints and credentials for the relevant environment (e.g. production) via kubectl.

kubectl get secret -n data-platform-datahub-catalogue-prod rds-postgresql-instance-output -o json | jq '.data | map_values(@base64d)'

2) Launch the port forward pod in the appropriate environment (e.g. production) by specifiying the namespace -n data-platform-datahub-catalogue-prod and replacing <rds_instance_address from above> with the RDS instance address obtained in step 1:

   kubectl \
    -n data-platform-datahub-catalogue-prod \
    run port-forward-pod \
    --image=ministryofjustice/port-forward \
    --port=5432 \
    --env="REMOTE_HOST=<rds_instance_address from above>" \
    --env="LOCAL_PORT=5432" \
    --env="REMOTE_PORT=5432"

3) Port forward the local port to the pod:

    kubectl -n data-platform-datahub-catalogue-prod \
    port-forward \
    port-forward-pod 1234:5432

4) Connect to the database using psql:

    psql postgresql://<username>:<password>@localhost:1234/<database_name>

5) Validate the number of dataProcessInstance entities present grouped by date:

    SELECT  DATE_TRUNC('day', createdon) AS day, COUNT(*) AS event_count 
    FROM public.metadata_aspect_v2 
    WHERE urn like 'urn:li:dataProcessInstance%' GROUP BY day ORDER BY day;

6) Example output:

db2182d9f2ebfb645b=> SELECT  DATE_TRUNC('day', createdon) AS day, COUNT(*) AS event_count
FROM public.metadata_aspect_v2
WHERE urn like 'urn:li:dataProcessInstance%' GROUP BY day ORDER BY day;
         day         | event_count
---------------------+-------------
 2025-11-03 00:00:00 |        6384
 2025-11-04 00:00:00 |        3686
 2025-11-05 00:00:00 |        4038
 2025-11-06 00:00:00 |        3686
 2025-11-07 00:00:00 |       68529
 2025-11-10 00:00:00 |       77815
 2025-11-11 00:00:00 |       66573
 2025-11-12 00:00:00 |       72944
 2025-11-13 00:00:00 |       63479
 2025-11-14 00:00:00 |       65498
 2025-11-17 00:00:00 |       69269
 2025-11-18 00:00:00 |       68473
 2025-11-19 00:00:00 |       75186
 2025-11-20 00:00:00 |       55133
 2025-11-21 00:00:00 |       77601
 2025-11-24 00:00:00 |       68353
 2025-11-25 00:00:00 |       69553
 2025-11-26 00:00:00 |       68125
 2025-11-27 00:00:00 |       72161
 2025-11-28 00:00:00 |       74241
 2025-12-01 00:00:00 |       56933
 2025-12-02 00:00:00 |       95288
 2025-12-03 00:00:00 |       95531
 2025-12-04 00:00:00 |       60142
 2025-12-05 00:00:00 |       82654
 2025-12-08 00:00:00 |       69524
 2025-12-09 00:00:00 |       78643
 2025-12-10 00:00:00 |       84827

7) Delete dataProcessInstance entities older than a specific date (e.g. 2025-11-10). Usually I retain approximately one to two months of data:

BE VERY CAREFUL WITH THIS COMMAND. Ensure that you have the syntax and correct date format before running it. You have been warned!:

DELETE FROM public.metadata_aspect_v2 WHERE urn like 'urn:li:dataProcessInstance%' AND createdon::date < '2025-11-10';

Example output:

db2182d9f2ebfb645b=> DELETE FROM public.metadata_aspect_v2 WHERE urn like 'urn:li:dataProcessInstance%' AND createdon::date < '2025-11-10';

DELETE 86323

8) Re-run the validation query from step 5 to confirm the deletion:

    SELECT  DATE_TRUNC('day', createdon) AS day, COUNT(*) AS event_count 
    FROM public.metadata_aspect_v2 
    WHERE urn like 'urn:li:dataProcessInstance%' GROUP BY day ORDER BY day;

9) Example output after deletion:

db2182d9f2ebfb645b=> SELECT  DATE_TRUNC('day', createdon) AS day, COUNT(*) AS event_count
    FROM public.metadata_aspect_v2
    WHERE urn like 'urn:li:dataProcessInstance%' GROUP BY day ORDER BY day;
         day         | event_count
---------------------+-------------
 2025-11-10 00:00:00 |       77815
 2025-11-11 00:00:00 |       66573
 2025-11-12 00:00:00 |       72944
 2025-11-13 00:00:00 |       63479
 2025-11-14 00:00:00 |       65498
 2025-11-17 00:00:00 |       69269
 2025-11-18 00:00:00 |       68473
 2025-11-19 00:00:00 |       75186
 2025-11-20 00:00:00 |       55133
 2025-11-21 00:00:00 |       77601
 2025-11-24 00:00:00 |       68353
 2025-11-25 00:00:00 |       69553
 2025-11-26 00:00:00 |       68125
 2025-11-27 00:00:00 |       72161
 2025-11-28 00:00:00 |       74241
 2025-12-01 00:00:00 |       56933
 2025-12-02 00:00:00 |       95288
 2025-12-03 00:00:00 |       95531
 2025-12-04 00:00:00 |       60142
 2025-12-05 00:00:00 |       82654
 2025-12-08 00:00:00 |       69524
 2025-12-09 00:00:00 |       78643
 2025-12-10 00:00:00 |       84827
(23 rows)

10) Run a vacuum to reclaim storage space:

CAUTION: This operation can be resource-intensive and may impact database performance. It is advisable to perform this during off-peak hours.

    db2182d9f2ebfb645b=> VACUUM FULL public.metadata_aspect_v2;

11) Once complete, exit the psql session and terminate the port-forward pod:

    kubectl -n data-platform-datahub-catalogue-prod delete pod port-forward-pod

Find MoJ Data

Login session cleanup

The Find MoJ Data application creates login sessions for users when they authenticate. These sessions are stored in the database and are not automatically deleted after a certain period of time, which can lead to a large number of these sessions accumulating over time.

1) Obtain & decode the RDS endpoints and credentials for the relevant environment (e.g. production) via kubectl.

kubectl get secret -n data-platform-find-moj-data-prod rds-postgresql-instance-output -o json | jq '.data | map_values(@base64d)'

2) Launch the port forward pod in the appropriate environment (e.g. production) by specifiying the namespace -n data-platform-find-moj-data-prod and replacing <rds_instance_address from above> with the RDS instance address obtained in step 1:

   kubectl \
    -n data-platform-find-moj-data-prod \
    run port-forward-pod \
    --image=ministryofjustice/port-forward \
    --port=5432 \
    --env="REMOTE_HOST=<rds_instance_address from above>" \
    --env="LOCAL_PORT=5432" \
    --env="REMOTE_PORT=5432"

3) Port forward the local port to the pod:

    kubectl -n data-platform-find-moj-data-prod \
    port-forward \
    port-forward-pod 1234:5432

4) Connect to the database using psql:

    psql postgresql://<username>:<password>@localhost:1234/<database_name>

5) Display the tables present in the database:

    \dt

example output:

db3a14acc6ad1a9fb6=> \dt
                        List of relations
 Schema |               Name                | Type  |   Owner
--------+-----------------------------------+-------+------------
 public | auth_group                        | table | cpVy6xNtU9
 public | auth_group_permissions            | table | cpVy6xNtU9
 public | auth_permission                   | table | cpVy6xNtU9
 public | django_admin_log                  | table | cpVy6xNtU9
 public | django_content_type               | table | cpVy6xNtU9
 public | django_migrations                 | table | cpVy6xNtU9
 public | django_session                    | table | cpVy6xNtU9
 public | feedback_feedbackno               | table | cpVy6xNtU9
 public | feedback_feedbackreport           | table | cpVy6xNtU9
 public | feedback_feedbackyes              | table | cpVy6xNtU9
 public | feedback_issue                    | table | cpVy6xNtU9
 public | users_customuser                  | table | cpVy6xNtU9
 public | users_customuser_groups           | table | cpVy6xNtU9
 public | users_customuser_user_permissions | table | cpVy6xNtU9
 public | waffle_flag                       | table | cpVy6xNtU9
 public | waffle_flag_groups                | table | cpVy6xNtU9
 public | waffle_flag_users                 | table | cpVy6xNtU9
 public | waffle_sample                     | table | cpVy6xNtU9
 public | waffle_switch                     | table | cpVy6xNtU9
(19 rows)

5) Validate the number of login sessions present before a cleanup :

   db3a14acc6ad1a9fb6=> select count(*) from django_session where expire_date::date < '2025-12-05';
 count
--------
 775502
(1 row)

6) Delete login sessions with expire_date older than a specific date (e.g. 2025-12-05). Usually I retain approximately one month of data:

   DELETE FROM django_session WHERE expire_date::date < '2025-12-05';

Example output:

db3a14acc6ad1a9fb6=> DELETE FROM django_session WHERE expire_date::date < '2025-12-05';
DELETE 775502

7) Re-run the validation query from step 5 to confirm the deletion:

   db3a14acc6ad1a9fb6=> select count(*) from django_session where expire_date::date < '2025-12-05';
 count
-------
     0
(1 row)

8) Run a vacuum to reclaim storage space:

CAUTION: This operation can be resource-intensive and may impact database performance. It is advisable to perform this during off-peak hours.

    db3a14acc6ad1a9fb6=> VACUUM FULL django_session;

9) Once complete, exit the psql session and terminate the port-forward pod:

    kubectl -n data-platform-find-moj-data-prod delete pod port-forward-pod
This page was last reviewed on 24 October 2025. It needs to be reviewed again on 24 April 2026 by the page owner #find-moj-data .