Disaster recovery

Directory services are critical to authentication, session management, authorization, and more. When directory services are broken, quick recovery is a must.

In DS directory services, a disaster is a serious data problem affecting the entire replication topology. Replication can’t help you recover from a disaster because it replays data changes everywhere.

Disaster recovery comes with a service interruption, the loss of recent changes, and a reset for replication. It is rational in the event of a real disaster. It’s unnecessary to follow the disaster recovery procedure for a hardware failure or a server that’s been offline too long and needs reinitialization. Even if you lose most of your DS servers, you can still rebuild the service without interruption or data loss.

For disaster recovery to be quick, you must prepare in advance.

Don’t go to production until you have successfully tested your disaster recovery procedures.

The following example helps prepare to recover from a disaster. It shows the following tasks:

Back up a DS directory service.
Restore the service to a known state.
Validate the procedure.

Tasks

The following tasks demonstrate a disaster recovery procedure on a single computer two replicated DS servers set up for evaluation.

In deployment, the procedure involves multiple computers, but the order and content of the tasks remain the same. Before you perform the procedure in production, make sure you have copies of the following:

The deployment description, documentation, plans, runbooks, and scripts.
The system configuration and software, including the Java installation.
The DS software and any customizations, plugins, or extensions.
A recent backup of any external secrets required, such as an HSM or a CA key.
A recent backup of each server’s configuration files, matching the production configuration.
The deployment ID and password.

This procedure applies to DS versions providing the dsrepl disaster-recovery command.

For deployments with any earlier DS servers that don’t provide the command, you can’t use this procedure. Instead, refer to How do I perform disaster recovery steps in DS?

Disaster recovery has these characteristics:

You perform disaster recovery on a stopped server, one server at a time.
Disaster recovery is per base DN, like replication.
On each server you recover, you use the same disaster recovery ID, a unique identifier for this recovery.

To minimize the service interruption, this example recovers the servers one by one. It is also possible to perform disaster recovery in parallel by stopping and starting all servers together.

Task 1: Back up directory data

Back up data while the directory service is running smoothly. For additional details, refer to Backup and restore.

Back up the directory data.

The following command backs up directory data created for evaluation:

$ /path/to/opendj/bin/dsbackup \
 create \
 --start 0 \
 --backupLocation /path/to/opendj/bak \
 --hostname localhost \
 --port 4444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin

The command returns, and the DS server runs the backup task in the background.

When adapting the recovery process for deployment, schedule a backup task to run regularly for each database backend.

Check the backup task finishes successfully:

$ /path/to/opendj/bin/manage-tasks \
 --summary \
 --hostname localhost \
 --port 4444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

The status of the backup task is "Completed successfully" when it is done.

Recovery from disaster means stopping the directory service and losing the latest changes. The more recent the backup, the fewer changes you lose during recovery. Backup operations are cumulative, so you can schedule them regularly without using too much disk space as long as you purge outdated backup files. As you script your disaster recovery procedures for deployment, schedule a recurring backup task to have safe, current, and complete backup files for each backend.

Task 2: Recover from a disaster

This task restores the directory data from backup files created before the disaster. Adapt this procedure as necessary if you have multiple directory backends to recover.

All changes since the last backup operation are lost.

Subtasks:

Prepare for recovery
Recover the first directory server
Recover remaining servers

Prepare for recovery

If you have lost DS servers, replace them with servers configured as before the disaster.

In this example, no servers were lost. Reuse the existing servers.

On each replica, prevent applications from making changes to the backend for the affected base DN. Changes made during recovery would be lost or could not be replicated:

$ /path/to/opendj/bin/dsconfig \
 set-backend-prop \
 --backend-name dsEvaluation \
 --set writability-mode:internal-only \
 --hostname localhost \
 --port 4444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

$ /path/to/replica/bin/dsconfig \
 set-backend-prop \
 --backend-name dsEvaluation \
 --set writability-mode:internal-only \
 --hostname localhost \
 --port 14444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

In this example, the first server’s administrative port is 4444. The second server’s administrative port is 14444.

Recover the first directory server

DS uses the disaster recovery ID to set the generation ID, an internal, shorthand form of the initial replication state. Replication only works when the data for the base DN share the same generation ID on each server.

There are two approaches to using the dsrepl disaster-recovery command. Use one or the other:

(Recommended) Let DS generate the disaster recovery ID on a first replica. Use the generated ID on all other servers you recover.

When you use the generated ID, the dsrepl disaster-recovery command verifies each server you recover has the same initial replication state as the first server.
Use the recovery ID of your choice on all servers.

Don’t use this approach if the replication topology includes one or more standalone replication servers. It won’t work.

This approach works when you can’t define a "first" replica, for example, because you’ve automated the recovery process in an environment where the order of recovery is not deterministic.

When you choose the recovery ID, the dsrepl disaster-recovery command doesn’t verify the data match. The command uses your ID as the random seed when calculating the new generation ID. For the new generation IDs to match, your process must have restored the same data on each server. Otherwise, replication won’t work between servers whose data does not match.

If you opt for this approach, skip these steps. Instead, proceed to Recover remaining servers.

Don’t mix the two approaches in the same disaster recovery procedure. Use the generated recovery ID or the recovery ID of your choice, but do not use both.

This process generates the disaster recovery ID to use when recovering the other servers.

Stop the directory server you use to start the recovery process:
```
$ /path/to/opendj/bin/stop-ds
```

Restore the affected data on this directory server:

$ /path/to/opendj/bin/dsbackup \
 restore \
 --offline \
 --backendName dsEvaluation \
 --backupLocation /path/to/opendj/bak

Changes to the affected data that happened after the backup are lost. Use the most recent backup files prior to the disaster.

This approach to restoring data works in deployments with the same DS server version. When all DS servers share the same DS version, you can restore all the DS directory servers from the same backup data.

Backup archives are not guaranteed to be compatible across major and minor server releases. Restore backups only on directory servers of the same major or minor version.

Run the command to begin the disaster recovery process.

When this command completes successfully, it displays the disaster recovery ID:
```
$ /path/to/opendj/bin/dsrepl \
 disaster-recovery \
 --baseDn dc=example,dc=com \
 --generate-recovery-id \
 --no-prompt
Disaster recovery id: <generatedId>
```
Record the <generatedId>. You will use it to recover all other servers.
Start the recovered server:
```
$ /path/to/opendj/bin/start-ds
```
Test the data you restored is what you expect.
Start backing up the recovered directory data.

As explained in New backup after recovery, you can no longer rely on pre-recovery backup data after disaster recovery.

Allow external applications to make changes to directory data again:

$ /path/to/opendj/bin/dsconfig \
 set-backend-prop \
 --backend-name dsEvaluation \
 --set writability-mode:enabled \
 --hostname localhost \
 --port 4444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

You have recovered this replica and begun to bring the service back online. To enable replication with other servers to resume, recover the remaining servers.

Recover remaining servers

Make sure you have a disaster recovery ID. Use the same ID for all DS servers in this recovery procedure:

(Recommended) If you generated the ID as described in Recover the first directory server, use it.
If not, use a unique ID of your choosing for this recovery procedure.

For example, you could use the date at the time you begin the procedure.

You can perform this procedure in parallel on all remaining servers or on one server at a time. For each server:

Stop the server:
```
$ /path/to/replica/bin/stop-ds
```

Unless the server is a standalone replication server, restore the affected data:

$ /path/to/replica/bin/dsbackup \
 restore \
 --offline \
 --backendName dsEvaluation \
 --backupLocation /path/to/opendj/bak

Run the recovery command.

The following command uses a generated ID. It verifies this server’s data matches the first server you recovered:
```
$ export DR_ID=<generatedId>
$ /path/to/replica/bin/dsrepl \
 disaster-recovery \
 --baseDn dc=example,dc=com \
 --generated-id ${DR_ID} \
 --no-prompt
```
If the recovery ID is a unique ID of your choosing, use dsrepl disaster-recovery --baseDn <base-dn> --user-generated-id <recoveryId> instead. This alternative doesn’t verify the data on each replica match and won’t work if the replication topology includes one or more standalone replication servers.
Start the recovered server:
```
$ /path/to/replica/bin/start-ds
```
If this is a directory server, test the data you restored is what you expect.

If this is a directory server, allow external applications to make changes to directory data again:

$ /path/to/replica/bin/dsconfig \
 set-backend-prop \
 --backend-name dsEvaluation \
 --set writability-mode:enabled \
 --hostname localhost \
 --port 14444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

After completing these steps for all servers, you have restored the directory service and recovered from the disaster.

Validation

After recovering from the disaster, validate replication works as expected. Use the following steps as a simple guide.

Modify an entry on one replica.

The following command updates Babs Jensen’s description to Post recovery:

$ /path/to/opendj/bin/ldapmodify \
 --hostname localhost \
 --port 1636 \
 --useSsl \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --bindDn uid=bjensen,ou=People,dc=example,dc=com \
 --bindPassword hifalutin <<EOF
dn: uid=bjensen,ou=People,dc=example,dc=com
changetype: modify
replace: description
description: Post recovery
EOF
# MODIFY operation successful for DN uid=bjensen,ou=People,dc=example,dc=com

Read the modified entry on another replica:

$ /path/to/replica/bin/ldapsearch \
 --hostname localhost \
 --port 11636 \
 --useSsl \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --bindDN uid=bjensen,ou=People,dc=example,dc=com \
 --bindPassword hifalutin \
 --baseDn dc=example,dc=com \
 "(cn=Babs Jensen)" \
 description
dn: uid=bjensen,ou=People,dc=example,dc=com
description: Post recovery

You have shown the recovery procedure succeeded.

Before deployment

When planning to deploy disaster recovery procedures, take these topics into account.

Recover before the purge delay

When recovering from backup, you must complete the recovery procedure while the backup is newer than the replication delay.

If this is not possible for all servers, recreate the remaining servers from scratch after recovering as many servers as possible and taking a new backup.

New backup after recovery

Disaster recovery resets the replication generation ID to a different format than you get when importing new directory data.

After disaster recovery, you can no longer use existing backup files for the recovered base DN. Directory servers can only replicate data under a base DN with directory servers having the same generation ID. The old backups no longer have the right generation IDs.

Instead, immediately after recovery, back up data from the recovered base DN and use the new backups going forward.

You can purge older backup files to prevent someone accidentally restoring from a backup with an outdated generation ID.

Change notifications reset

Disaster recovery clears the changelog for the recovered base DN.

If you use change number indexing for the recovered base DN, disaster recovery resets the change number.

Standalone servers

If you have standalone replication servers and directory servers, you might not want to recover them all at once.

Instead, in each region, alternate between recovering a standalone directory server then a standalone replication server to reduce the time to recovery.

Reference material

Reference

Description

About replication

In-depth introduction to replication concepts

Backup and restore

The basics, plus backing up to the cloud and using filesystem snapshots

Cryptographic keys

About keys, including those for encrypting and decrypting backup files

Data storage

Details about exporting and importing LDIF, common data stores