Disaster recovery
Directory services are critical to authentication, session management, authorization, and more. When directory services are broken, quick recovery is a must.
In DS directory services, a disaster is a serious data problem affecting the entire replication topology. Replication can’t help you recover from a disaster because it replays data changes everywhere.
Disaster recovery comes with a service interruption, the loss of recent changes, and a reset for replication. It is rational in the event of a real disaster. It’s unnecessary to follow the disaster recovery procedure for a hardware failure or a server that’s been offline too long and needs reinitialization. Even if you lose most of your DS servers, you can still rebuild the service without a service interruption or data loss.
For disaster recovery to be quick, you must prepare in advance. Don't go to production until you have successfully tested your disaster recovery procedures.
Estimated time to complete: 30 minutes
In this use case, you:
Back up a DS directory service.
Simulate a disaster.
Restore the service to a known state.
Validate the procedure.
In completing this use case, you learn to:
Back up and restore directory data.
Restart cleanly from backup files to recover from a disaster.
Example scenario
Pat has learned how to install and configure replicated directory services and recognizes broken directory services could bring identity and access management services to a halt, too.
Pat understands replication protects directory services from single points of failure. However, what happens if a misbehaving application or a mistaken operator deletes all the user accounts, for example? Pat realizes replication replays the operations everywhere. In the case of an error like this, replication could amplify a big mistake into a system-wide disaster. (For smaller mistakes, refer to Recover from user error.)
Pat knows the pressure on the people maintaining directory services to recover quickly would be high. It would be better to plan for the problem in advance and to provide a scripted and tested response. No one under pressure should have to guess how to recover a critical service.
Pat decides to demonstrate a safe, scripted procedure for recovering from disaster:
Start with a smoothly running, replicated directory service.
Cause a "disaster" by deleting all the user accounts.
Recover from the disaster by restoring the data from a recent backup.
Verify the results.
Pat knows this procedure loses changes between the most recent backup operation and the disaster. Losing some changes is still better than a broken directory service. If Pat can discover the problem and repair it quickly, the procedure minimizes lost changes.
Before you start, bring yourself up to speed with Pat:
Pat is familiar with the command line and command-line scripting on the target operating system, a Linux distribution in this example. Pat uses shell scripts to automate administrative tasks.
Pat knows how to use basic LDAP commands, having worked examples to learn LDAP.
Pat has already scripted and automated the directory service installation and setup procedures. Pat already saves copies of the following items:
The deployment description, documentation, plans, runbooks, and scripts.
The system configuration and software, including the Java installation.
The DS software and any customizations, plugins, or extensions.
A recent backup of any external secrets required, such as an HSM or a CA key.
A recent backup of each server’s configuration files, matching the production configuration.
The deployment ID and password.
This example scenario focuses on the application and user data, not the directory setup and configuration. For simplicity, Pat chooses to demonstrate disaster recovery with two replicated DS servers set up for evaluation.
Pat has a basic understanding of DS replication, including how replication makes directory data eventually consistent.
Before you try this example, set up two replicated DS directory servers on your computer as described in Install DS and Learn replication.
Pat demonstrates this recovery procedure on a single computer. In deployment, the procedure involves multiple computers, but the order and content of the tasks remain the same.
This procedure applies to DS versions providing the For deployments with any earlier DS servers that don't provide the command, you can't use this procedure. Instead, refer to How do I perform disaster recovery steps in DS (All versions)?
You perform disaster recovery on a stopped server, one server at a time.
Disaster recovery is per base DN, like replication.
On each server you recover, you use the same disaster recovery ID, a unique identifier for this recovery.
To minimize the service interruption, this example recovers the servers one by one. It is also possible to perform disaster recovery in parallel by stopping and starting all servers together.
Task 1: Back up directory data
Back up data while the directory service is running smoothly.
Back up the directory data created for evaluation:
$ /path/to/opendj/bin/dsbackup \ create \ --start 0 \ --backupLocation /path/to/opendj/bak \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin
The command returns, and the DS server runs the backup task in the background.
When adapting the recovery process for deployment, you will schedule a backup task to run regularly for each database backend.
Check the backup task finishes successfully:
$ /path/to/opendj/bin/manage-tasks \ --summary \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
The status of the backup task is "Completed successfully" when it is done.
Recovery from disaster means stopping the directory service and losing the latest changes. The more recent the backup, the fewer changes you lose during recovery. Backup operations are cumulative, so you can schedule them regularly without using too much disk space as long as you purge outdated backup files. As you script your disaster recovery procedures for deployment, schedule a recurring backup task to have safe, current, and complete backup files for each backend.
Task 2: Simulate a disaster
Delete all user entries in the evaluation backend:
$ /path/to/opendj/bin/ldapdelete \ --deleteSubtree \ --hostname localhost \ --port 1636 \ --useSsl \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --bindDN uid=admin \ --bindPassword password \ ou=people,dc=example,dc=com
This command takes a few seconds to remove over 100,000 user entries. It takes a few seconds more for replication to replay all the deletions on the other DS replica.
Why is this a disaster? Suppose you restore a DS replica from the backup to recreate the missing user entries. After the restore operation finishes, replication replays each deletion again, ensuring the user entries are gone from all replicas.
Although this example looks contrived, it is inspired by real-world outages. You cannot restore the entries permanently without a recovery procedure.
Task 3: Recover from the disaster
This task restores the directory data from backup files created before the disaster. Adapt this procedure as necessary if you have multiple directory backends to recover.
All changes since the last backup operation are lost.
Prepare for recovery
If you have lost DS servers, replace them with servers configured as before the disaster.
In this example, no servers were lost. Reuse the existing servers.
On each replica, prevent applications from making changes to the backend for the affected base DN. Changes made during recovery would be lost or could not be replicated:
$ /path/to/opendj/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:internal-only \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt $ /path/to/replica/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:internal-only \ --hostname localhost \ --port 14444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
In this example, the first server’s administrative port is
. The second server’s administrative port is14444
Recover the first directory server
DS uses the disaster recovery ID to set the generation ID, an internal, shorthand form of the initial replication state. Replication only works when the data for the base DN share the same generation ID on each server. There are two approaches to using the
Don't mix the two approaches in the same disaster recovery procedure. Use the generated recovery ID or the recovery ID of your choice, but do not use both.
This process generates the disaster recovery ID to use when recovering the other servers.
Stop the directory server you use to start the recovery process:
$ /path/to/opendj/bin/stop-ds
Restore the affected data on this directory server:
$ /path/to/opendj/bin/dsbackup \ restore \ --offline \ --backendName dsEvaluation \ --backupLocation /path/to/opendj/bak
Changes to the affected data that happened after the backup are lost. Use the most recent backup files prior to the disaster.
This approach to restoring data works in deployments with the same DS server version. When all DS servers share the same DS version, you can restore all the DS directory servers from the same backup data.
Backup archives are not guaranteed to be compatible across major and minor server releases. Restore backups only on directory servers of the same major or minor version.
Run the command to begin the disaster recovery process.
When this command completes successfully, it displays the disaster recovery ID:
$ /path/to/opendj/bin/dsrepl \ disaster-recovery \ --baseDn dc=example,dc=com \ --generate-recovery-id \ --no-prompt Disaster recovery id: <generatedId>
Record the <generatedId>. You will use it to recover all other servers.
Start the recovered server:
$ /path/to/opendj/bin/start-ds
Test the data you restored is what you expect.
Start backing up the recovered directory data.
As explained in New backup after recovery, you can no longer rely on pre-recovery backup data after disaster recovery.
Allow external applications to make changes to directory data again:
$ /path/to/opendj/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:enabled \ --hostname localhost \ --port 4444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
You have recovered this replica and begun to bring the service back online. To enable replication with other servers to resume, recover the remaining servers.
Recover remaining servers
Make sure you have a disaster recovery ID. Use the same ID for all DS servers in this recovery procedure:
You can perform this procedure in parallel on all remaining servers or on one server at a time. For each server:
Stop the server:
$ /path/to/replica/bin/stop-ds
Unless the server is a standalone replication server, restore the affected data:
$ /path/to/replica/bin/dsbackup \ restore \ --offline \ --backendName dsEvaluation \ --backupLocation /path/to/opendj/bak
Run the recovery command.
The following command uses a generated ID. It verifies this server’s data match the first server you recovered:
$ export DR_ID=<generatedId> $ /path/to/replica/bin/dsrepl \ disaster-recovery \ --baseDn dc=example,dc=com \ --generated-id ${DR_ID} \ --no-prompt
If the recovery ID is a unique ID of your choosing, use
dsrepl disaster-recovery --baseDn <base-dn> --user-generated-id <recoveryId>
instead. This alternative doesn’t verify the data on each replica match and won’t work if the replication topology includes one or more standalone replication servers. -
Start the recovered server:
$ /path/to/replica/bin/start-ds
If this is a directory server, test the data you restored is what you expect.
If this is a directory server, allow external applications to make changes to directory data again:
$ /path/to/replica/bin/dsconfig \ set-backend-prop \ --backend-name dsEvaluation \ --set writability-mode:enabled \ --hostname localhost \ --port 14444 \ --bindDN uid=admin \ --bindPassword password \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --no-prompt
After completing these steps for all servers, you have restored the directory service and recovered from the disaster.
After recovering from the disaster, validate replication works as expected. Use the following steps as a simple guide.
Modify a user entry on one replica.
The following command updates Babs Jensen’s description to
Post recovery
:$ /path/to/opendj/bin/ldapmodify \ --hostname localhost \ --port 1636 \ --useSsl \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --bindDn uid=bjensen,ou=People,dc=example,dc=com \ --bindPassword hifalutin <<EOF dn: uid=bjensen,ou=People,dc=example,dc=com changetype: modify replace: description description: Post recovery EOF # MODIFY operation successful for DN uid=bjensen,ou=People,dc=example,dc=com
Read the modified entry on another replica:
$ /path/to/replica/bin/ldapsearch \ --hostname localhost \ --port 11636 \ --useSsl \ --usePkcs12TrustStore /path/to/opendj/config/keystore \ --trustStorePassword:file /path/to/opendj/config/keystore.pin \ --bindDN uid=bjensen,ou=People,dc=example,dc=com \ --bindPassword hifalutin \ --baseDn dc=example,dc=com \ "(cn=Babs Jensen)" \ description dn: uid=bjensen,ou=People,dc=example,dc=com description: Post recovery
You have shown the recovery procedure succeeded.
What’s next
Example scenario
With the plan for disaster recovery off to a good start, Pat’s next steps are to:
Develop tests and detailed procedures for recovering from a disaster in deployment.
Put in place backup plans for directory services.
The backup plans address these more routine maintenance cases and keep the directory service running smoothly.
Document the procedures in the deployment runbook.
Explore further
This use case can serve as a template for DS test and production deployments. Adapt this example for deployment:
Back up files as a regularly scheduled task to ensure you always have a recent backup of each backend.
Regularly export the data to LDIF from at least one DS replica in case all backups are lost or corrupted. This LDIF serves as a last resort when you can’t recover the data from backup files.
Store the backup files remotely with multiple copies in different locations.
Purge old backup files to avoid filling up the disk space.
Be ready to restore each directory database backend.
Before deployment
When planning to deploy disaster recovery procedures, take these topics into account.
Recover before the purge delay
When recovering from backup, you must complete the recovery procedure while the backup is newer than the replication delay.
If this is not possible for all servers, recreate the remaining servers from scratch after recovering as many servers as possible and taking a new backup.
New backup after recovery
Disaster recovery resets the replication generation ID to a different format than you get when importing new directory data.
After disaster recovery, you can no longer use existing backup files for the recovered base DN. Directory servers can only replicate data under a base DN with directory servers having the same generation ID. The old backups no longer have the right generation IDs.
Instead, immediately after recovery, back up data from the recovered base DN and use the new backups going forward.
You can purge older backup files to prevent someone accidentally restoring from a backup with an outdated generation ID.
Change notifications reset
Disaster recovery clears the changelog for the recovered base DN.
If you use change number indexing for the recovered base DN, disaster recovery resets the change number.
