Disaster recovery
Directory services are critical to authentication, session management, authorization, and more. When directory services are broken, quick recovery is a must.
In DS directory services, a disaster is a serious data problem affecting the entire replication topology. Replication can’t help you recover from a disaster because it replays data changes everywhere.
Disaster recovery comes with a service interruption, the loss of recent changes, and a reset for replication. It is rational in the event of a real disaster. It’s unnecessary to follow the disaster recovery procedure for a hardware failure or a server that’s been offline too long and needs reinitialization. Even if you lose most of your DS servers, you can still rebuild the service without interruption or data loss.
For disaster recovery to be quick, you must prepare in advance. Don’t go to production until you have successfully tested your disaster recovery procedures. |
The following example helps prepare to recover from a disaster. It shows the following tasks:
-
Back up a DS directory service.
-
Restore the service to a known state.
-
Validate the procedure.
Tasks
The following tasks demonstrate a disaster recovery procedure on a single computer two replicated DS servers set up for evaluation.
In deployment, the procedure involves multiple computers, but the order and content of the tasks remain the same. Before you perform the procedure in production, make sure you have copies of the following:
-
The deployment description, documentation, plans, runbooks, and scripts.
-
The system configuration and software, including the Java installation.
-
The DS software and any customizations, plugins, or extensions.
-
A recent backup of any external secrets required, such as an HSM or a CA key.
-
A recent backup of each server’s configuration files, matching the production configuration.
-
The deployment ID and password.
This procedure applies to DS versions providing the For deployments with any earlier DS servers that don’t provide the command, you can’t use this procedure. Instead, refer to How do I perform disaster recovery steps in DS? |
Disaster recovery has these characteristics:
-
You perform disaster recovery on a stopped server, one server at a time.
-
Disaster recovery is per base DN, like replication.
-
On each server you recover, you use the same disaster recovery ID, a unique identifier for this recovery.
To minimize the service interruption, this example recovers the servers one by one. It is also possible to perform disaster recovery in parallel by stopping and starting all servers together.
Task 1: Back up directory data
Back up data while the directory service is running smoothly. For additional details, refer to Backup and restore.
-
Back up the directory data.
The following command backs up directory data created for evaluation:
$ /path/to/opendj/bin/dsbackup \ create \ --start 0 \ --backupLocation /path/to/opendj/bak \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin
The command returns, and the DS server runs the backup task in the background.
When adapting the recovery process for deployment, schedule a backup task to run regularly for each database backend.
-
Check the backup task finishes successfully:
$ /path/to/opendj/bin/manage-tasks \ --summary \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
The status of the backup task is "Completed successfully" when it is done.
Recovery from disaster means stopping the directory service and losing the latest changes. The more recent the backup, the fewer changes you lose during recovery. Backup operations are cumulative, so you can schedule them regularly without using too much disk space as long as you purge outdated backup files. As you script your disaster recovery procedures for deployment, schedule a recurring backup task to have safe, current, and complete backup files for each backend.
Task 2: Recover from a disaster
This task restores the directory data from backup files created before the disaster. Adapt this procedure as necessary if you have multiple directory backends to recover.
All changes since the last backup operation are lost. |
Subtasks:
Prepare for recovery
-
If you have lost DS servers, replace them with servers configured as before the disaster.
In this example, no servers were lost. Reuse the existing servers.
-
On each replica, prevent applications from making changes to the backend for the affected base DN. Changes made during recovery would be lost or could not be replicated:
$ /path/to/opendj/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:internal-only \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt $ /path/to/replica/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:internal-only \ --hostname localhost \ --port 14444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
In this example, the first server’s administrative port is
4444
. The second server’s administrative port is14444
.
Recover the first directory server
DS uses the disaster recovery ID to set the generation ID, an internal, shorthand form of the initial replication state. Replication only works when the data for the base DN share the same generation ID on each server. There are two approaches to using the
Don’t mix the two approaches in the same disaster recovery procedure. Use the generated recovery ID or the recovery ID of your choice, but do not use both. |
This process generates the disaster recovery ID to use when recovering the other servers.
-
Stop the directory server you use to start the recovery process:
$ /path/to/opendj/bin/stop-ds
-
Restore the affected data on this directory server:
$ /path/to/opendj/bin/dsbackup \ restore \ --offline \ --backendName dsEvaluation \ --backupLocation /path/to/opendj/bak
Changes to the affected data that happened after the backup are lost. Use the most recent backup files prior to the disaster.
This approach to restoring data works in deployments with the same DS server version. When all DS servers share the same DS version, you can restore all the DS directory servers from the same backup data.
Backup archives are not guaranteed to be compatible across major and minor server releases. Restore backups only on directory servers of the same major or minor version.
-
Run the command to begin the disaster recovery process.
When this command completes successfully, it displays the disaster recovery ID:
$ /path/to/opendj/bin/dsrepl \ disaster-recovery \ --baseDn dc=example,dc=com \ --generate-recovery-id \ --no-prompt Disaster recovery id: <generatedId>
Record the <generatedId>. You will use it to recover all other servers.
-
Start the recovered server:
$ /path/to/opendj/bin/start-ds
-
Test the data you restored is what you expect.
-
Start backing up the recovered directory data.
As explained in New backup after recovery, you can no longer rely on pre-recovery backup data after disaster recovery.
-
Allow external applications to make changes to directory data again:
$ /path/to/opendj/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:enabled \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
You have recovered this replica and begun to bring the service back online. To enable replication with other servers to resume, recover the remaining servers.
Recover remaining servers
Make sure you have a disaster recovery ID. Use the same ID for all DS servers in this recovery procedure:
|
You can perform this procedure in parallel on all remaining servers or on one server at a time. For each server:
-
Stop the server:
$ /path/to/replica/bin/stop-ds
-
Unless the server is a standalone replication server, restore the affected data:
$ /path/to/replica/bin/dsbackup \ restore \ --offline \ --backendName dsEvaluation \ --backupLocation /path/to/opendj/bak
-
Run the recovery command.
The following command uses a generated ID. It verifies this server’s data matches the first server you recovered:
$ export DR_ID=<generatedId> $ /path/to/replica/bin/dsrepl \ disaster-recovery \ --baseDn dc=example,dc=com \ --generated-id ${DR_ID} \ --no-prompt
If the recovery ID is a unique ID of your choosing, use
dsrepl disaster-recovery --baseDn <base-dn> --user-generated-id <recoveryId>
instead. This alternative doesn’t verify the data on each replica match and won’t work if the replication topology includes one or more standalone replication servers. -
Start the recovered server:
$ /path/to/replica/bin/start-ds
-
If this is a directory server, test the data you restored is what you expect.
-
If this is a directory server, allow external applications to make changes to directory data again:
$ /path/to/replica/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:enabled \ --hostname localhost \ --port 14444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
After completing these steps for all servers, you have restored the directory service and recovered from the disaster.
Validation
After recovering from the disaster, validate replication works as expected. Use the following steps as a simple guide.
-
Modify an entry on one replica.
The following command updates Babs Jensen’s description to
Post recovery
:$ /path/to/opendj/bin/ldapmodify \ --hostname localhost \ --port 1636 \ --useSsl \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --bindDn uid=bjensen,ou=People,dc=example,dc=com \ --bindPassword hifalutin <<EOF dn: uid=bjensen,ou=People,dc=example,dc=com changetype: modify replace: description description: Post recovery EOF # MODIFY operation successful for DN uid=bjensen,ou=People,dc=example,dc=com
-
Read the modified entry on another replica:
$ /path/to/replica/bin/ldapsearch \ --hostname localhost \ --port 11636 \ --useSsl \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --bindDN uid=bjensen,ou=People,dc=example,dc=com \ --bindPassword hifalutin \ --baseDn dc=example,dc=com \ "(cn=Babs Jensen)" \ description dn: uid=bjensen,ou=People,dc=example,dc=com description: Post recovery
You have shown the recovery procedure succeeded.
Before deployment
When planning to deploy disaster recovery procedures, take these topics into account.
Recover before the purge delay
When recovering from backup, you must complete the recovery procedure while the backup is newer than the replication delay.
If this is not possible for all servers, recreate the remaining servers from scratch after recovering as many servers as possible and taking a new backup.
New backup after recovery
Disaster recovery resets the replication generation ID to a different format than you get when importing new directory data.
After disaster recovery, you can no longer use existing backup files for the recovered base DN. Directory servers can only replicate data under a base DN with directory servers having the same generation ID. The old backups no longer have the right generation IDs.
Instead, immediately after recovery, back up data from the recovered base DN and use the new backups going forward.
You can purge older backup files to prevent someone accidentally restoring from a backup with an outdated generation ID.
Change notifications reset
Disaster recovery clears the changelog for the recovered base DN.
If you use change number indexing for the recovered base DN, disaster recovery resets the change number.