DS 7.5.0

Disaster recovery

Directory services are critical to authentication, session management, authorization, and more. When directory services are broken, quick recovery is a must.

In DS directory services, a disaster is a serious data problem affecting the entire replication topology. Replication can’t help you recover from a disaster because it replays data changes everywhere.

Disaster recovery comes with a service interruption, the loss of recent changes, and a reset for replication. It is rational in the event of a real disaster. It’s unnecessary to follow the disaster recovery procedure for a hardware failure or a server that’s been offline too long and needs reinitialization. Even if you lose most of your DS servers, you can still rebuild the service without a service interruption or data loss.

For disaster recovery to be quick, you must prepare in advance.

Don’t go to production until you have successfully tested your disaster recovery procedures.

Description

Estimated time to complete: 30 minutes

In this use case, you:

  • Back up a DS directory service.

  • Simulate a disaster.

  • Restore the service to a known state.

  • Validate the procedure.

Goals

In completing this use case, you learn to:

  • Back up and restore directory data.

  • Restart cleanly from backup files to recover from a disaster.

Example scenario

Pat has learned how to install and configure replicated directory services and recognizes broken directory services could bring identity and access management services to a halt, too.

Pat understands replication protects directory services from single points of failure. However, what happens if a misbehaving application or a mistaken operator deletes all the user accounts, for example? Pat realizes replication replays the operations everywhere. In the case of an error like this, replication could amplify a big mistake into a system-wide disaster. (For smaller mistakes, refer to Recover from user error.)

Pat knows the pressure on the people maintaining directory services to recover quickly would be high. It would be better to plan for the problem in advance and to provide a scripted and tested response. No one under pressure should have to guess how to recover a critical service.

Pat decides to demonstrate a safe, scripted procedure for recovering from disaster:

  • Start with a smoothly running, replicated directory service.

  • Cause a "disaster" by deleting all the user accounts.

  • Recover from the disaster by restoring the data from a recent backup.

  • Verify the results.

Pat knows this procedure loses changes between the most recent backup operation and the disaster. Losing some changes is still better than a broken directory service. If Pat can discover the problem and repair it quickly, the procedure minimizes lost changes.

Prerequisites

Knowledge

Before you start, bring yourself up to speed with Pat:

  • Pat is familiar with the command line and command-line scripting on the target operating system, a Linux distribution in this example. Pat uses shell scripts to automate administrative tasks.

  • Pat knows how to use basic LDAP commands, having worked examples to learn LDAP.

  • Pat has already scripted and automated the directory service installation and setup procedures. Pat already saves copies of the following items:

    • The deployment description, documentation, plans, runbooks, and scripts.

    • The system configuration and software, including the Java installation.

    • The DS software and any customizations, plugins, or extensions.

    • A recent backup of any external secrets required, such as an HSM or a CA key.

    • A recent backup of each server’s configuration files, matching the production configuration.

    • The deployment ID and password.

    This example scenario focuses on the application and user data, not the directory setup and configuration. For simplicity, Pat chooses to demonstrate disaster recovery with two replicated DS servers set up for evaluation.

  • Pat has a basic understanding of DS replication, including how replication makes directory data eventually consistent.

Actions

Before you try this example, set up two replicated DS directory servers on your computer as described in Install DS and Learn replication.

Tasks

Pat demonstrates this recovery procedure on a single computer. In deployment, the procedure involves multiple computers, but the order and content of the tasks remain the same.

This procedure applies to DS versions providing the dsrepl disaster-recovery command.

For deployments with any earlier DS servers that don’t provide the command, you can’t use this procedure. Instead, refer to How do I perform disaster recovery steps in DS (All versions)?

  • You perform disaster recovery on a stopped server, one server at a time.

  • Disaster recovery is per base DN, like replication.

  • On each server you recover, you use the same disaster recovery ID, a unique identifier for this recovery.

To minimize the service interruption, this example recovers the servers one by one. It is also possible to perform disaster recovery in parallel by stopping and starting all servers together.

Task 1: Back up directory data

Back up data while the directory service is running smoothly.

  1. Back up the directory data created for evaluation:

    $ /path/to/opendj/bin/dsbackup \
     create \
     --start 0 \
     --backupLocation /path/to/opendj/bak \
     --hostname localhost \
     --port 4444 \
     --bindDN uid=admin \
     --bindPassword password \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin

    The command returns, and the DS server runs the backup task in the background.

    When adapting the recovery process for deployment, you will schedule a backup task to run regularly for each database backend.

  2. Check the backup task finishes successfully:

    $ /path/to/opendj/bin/manage-tasks \
     --summary \
     --hostname localhost \
     --port 4444 \
     --bindDN uid=admin \
     --bindPassword password \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --no-prompt

    The status of the backup task is "Completed successfully" when it is done.

Recovery from disaster means stopping the directory service and losing the latest changes. The more recent the backup, the fewer changes you lose during recovery. Backup operations are cumulative, so you can schedule them regularly without using too much disk space as long as you purge outdated backup files. As you script your disaster recovery procedures for deployment, schedule a recurring backup task to have safe, current, and complete backup files for each backend.

Task 2: Simulate a disaster

  1. Delete all user entries in the evaluation backend:

    $ /path/to/opendj/bin/ldapdelete \
     --deleteSubtree \
     --hostname localhost \
     --port 1636 \
     --useSsl \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin  \
     --bindDN uid=admin \
     --bindPassword password \
     ou=people,dc=example,dc=com

    This command takes a few seconds to remove over 100,000 user entries. It takes a few seconds more for replication to replay all the deletions on the other DS replica.

Why is this a disaster? Suppose you restore a DS replica from the backup to recreate the missing user entries. After the restore operation finishes, replication replays each deletion again, ensuring the user entries are gone from all replicas.

Although this example looks contrived, it is inspired by real-world outages. You cannot restore the entries permanently without a recovery procedure.

Task 3: Recover from the disaster

This task restores the directory data from backup files created before the disaster. Adapt this procedure as necessary if you have multiple directory backends to recover.

All changes since the last backup operation are lost.

Subtasks:

Prepare for recovery

  1. If you have lost DS servers, replace them with servers configured as before the disaster.

    In this example, no servers were lost. Reuse the existing servers.

  2. On each replica, prevent applications from making changes to the backend for the affected base DN. Changes made during recovery would be lost or could not be replicated:

    $ /path/to/opendj/bin/dsconfig \
     set-backend-prop \
     --backend-name dsEvaluation \
     --set writability-mode:internal-only \
     --hostname localhost \
     --port 4444 \
     --bindDN uid=admin \
     --bindPassword password \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --no-prompt
    
    $ /path/to/replica/bin/dsconfig \
     set-backend-prop \
     --backend-name dsEvaluation \
     --set writability-mode:internal-only \
     --hostname localhost \
     --port 14444 \
     --bindDN uid=admin \
     --bindPassword password \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --no-prompt

    In this example, the first server’s administrative port is 4444. The second server’s administrative port is 14444.

Recover the first directory server

DS uses the disaster recovery ID to set the generation ID, an internal, shorthand form of the initial replication state. Replication only works when the data for the base DN share the same generation ID on each server.

There are two approaches to using the dsrepl disaster-recovery command. Use one or the other:

  • (Recommended) Let DS generate the disaster recovery ID on a first replica. Use the generated ID on all other servers you recover.

    When you use the generated ID, the dsrepl disaster-recovery command verifies each server you recover has the same initial replication state as the first server.

  • Use the recovery ID of your choice on all servers.

    Don’t use this approach if the replication topology includes one or more standalone replication servers. It won’t work.

    This approach works when you can’t define a "first" replica, for example, because you’ve automated the recovery process in an environment where the order of recovery is not deterministic.

    When you choose the recovery ID, the dsrepl disaster-recovery command doesn’t verify the data match. The command uses your ID as the random seed when calculating the new generation ID. For the new generation IDs to match, your process must have restored the same data on each server. Otherwise, replication won’t work between servers whose data does not match.

    If you opt for this approach, skip these steps. Instead, proceed to Recover remaining servers.

Don’t mix the two approaches in the same disaster recovery procedure. Use the generated recovery ID or the recovery ID of your choice, but do not use both.

This process generates the disaster recovery ID to use when recovering the other servers.

  1. Stop the directory server you use to start the recovery process:

    $ /path/to/opendj/bin/stop-ds
  2. Restore the affected data on this directory server:

    $ /path/to/opendj/bin/dsbackup \
     restore \
     --offline \
     --backendName dsEvaluation \
     --backupLocation /path/to/opendj/bak

    Changes to the affected data that happened after the backup are lost. Use the most recent backup files prior to the disaster.

    This approach to restoring data works in deployments with the same DS server version. When all DS servers share the same DS version, you can restore all the DS directory servers from the same backup data.

    Backup archives are not guaranteed to be compatible across major and minor server releases. Restore backups only on directory servers of the same major or minor version.

  3. Run the command to begin the disaster recovery process.

    When this command completes successfully, it displays the disaster recovery ID:

    $ /path/to/opendj/bin/dsrepl \
     disaster-recovery \
     --baseDn dc=example,dc=com \
     --generate-recovery-id \
     --no-prompt
    Disaster recovery id: <generatedId>

    Record the <generatedId>. You will use it to recover all other servers.

  4. Start the recovered server:

    $ /path/to/opendj/bin/start-ds
  5. Test the data you restored is what you expect.

  6. Start backing up the recovered directory data.

    As explained in New backup after recovery, you can no longer rely on pre-recovery backup data after disaster recovery.

  7. Allow external applications to make changes to directory data again:

    $ /path/to/opendj/bin/dsconfig \
     set-backend-prop \
     --backend-name dsEvaluation \
     --set writability-mode:enabled \
     --hostname localhost \
     --port 4444 \
     --bindDN uid=admin \
     --bindPassword password \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --no-prompt

You have recovered this replica and begun to bring the service back online. To enable replication with other servers to resume, recover the remaining servers.

Recover remaining servers

Make sure you have a disaster recovery ID. Use the same ID for all DS servers in this recovery procedure:

  • (Recommended) If you generated the ID as described in Recover the first directory server, use it.

  • If not, use a unique ID of your choosing for this recovery procedure.

    For example, you could use the date at the time you begin the procedure.

You can perform this procedure in parallel on all remaining servers or on one server at a time. For each server:

  1. Stop the server:

    $ /path/to/replica/bin/stop-ds
  2. Unless the server is a standalone replication server, restore the affected data:

    $ /path/to/replica/bin/dsbackup \
     restore \
     --offline \
     --backendName dsEvaluation \
     --backupLocation /path/to/opendj/bak
  3. Run the recovery command.

    The following command uses a generated ID. It verifies this server’s data match the first server you recovered:

    $ export DR_ID=<generatedId>
    $ /path/to/replica/bin/dsrepl \
     disaster-recovery \
     --baseDn dc=example,dc=com \
     --generated-id ${DR_ID} \
     --no-prompt

    If the recovery ID is a unique ID of your choosing, use dsrepl disaster-recovery --baseDn <base-dn> --user-generated-id <recoveryId> instead. This alternative doesn’t verify the data on each replica match and won’t work if the replication topology includes one or more standalone replication servers.

  4. Start the recovered server:

    $ /path/to/replica/bin/start-ds
  5. If this is a directory server, test the data you restored is what you expect.

  6. If this is a directory server, allow external applications to make changes to directory data again:

    $ /path/to/replica/bin/dsconfig \
     set-backend-prop \
     --backend-name dsEvaluation \
     --set writability-mode:enabled \
     --hostname localhost \
     --port 14444 \
     --bindDN uid=admin \
     --bindPassword password \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --no-prompt

After completing these steps for all servers, you have restored the directory service and recovered from the disaster.

Validation

After recovering from the disaster, validate replication works as expected. Use the following steps as a simple guide.

  1. Modify a user entry on one replica.

    The following command updates Babs Jensen’s description to Post recovery:

    $ /path/to/opendj/bin/ldapmodify \
     --hostname localhost \
     --port 1636 \
     --useSsl \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --bindDn uid=bjensen,ou=People,dc=example,dc=com \
     --bindPassword hifalutin <<EOF
    dn: uid=bjensen,ou=People,dc=example,dc=com
    changetype: modify
    replace: description
    description: Post recovery
    EOF
    # MODIFY operation successful for DN uid=bjensen,ou=People,dc=example,dc=com
  2. Read the modified entry on another replica:

    $ /path/to/replica/bin/ldapsearch \
     --hostname localhost \
     --port 11636 \
     --useSsl \
     --usePkcs12TrustStore /path/to/opendj/config/keystore \
     --trustStorePassword:file /path/to/opendj/config/keystore.pin \
     --bindDN uid=bjensen,ou=People,dc=example,dc=com \
     --bindPassword hifalutin \
     --baseDn dc=example,dc=com \
     "(cn=Babs Jensen)" \
     description
    dn: uid=bjensen,ou=People,dc=example,dc=com
    description: Post recovery

You have shown the recovery procedure succeeded.

What’s next

Example scenario

With the plan for disaster recovery off to a good start, Pat’s next steps are to:

  • Develop tests and detailed procedures for recovering from a disaster in deployment.

  • Put in place backup plans for directory services.

    The backup plans address these more routine maintenance cases and keep the directory service running smoothly.

  • Document the procedures in the deployment runbook.

Explore further

This use case can serve as a template for DS test and production deployments. Adapt this example for deployment:

  • Back up files as a regularly scheduled task to ensure you always have a recent backup of each backend.

  • Regularly export the data to LDIF from at least one DS replica in case all backups are lost or corrupted. This LDIF serves as a last resort when you can’t recover the data from backup files.

  • Store the backup files remotely with multiple copies in different locations.

  • Purge old backup files to avoid filling up the disk space.

  • Be ready to restore each directory database backend.

Before deployment

When planning to deploy disaster recovery procedures, take these topics into account.

Recover before the purge delay

When recovering from backup, you must complete the recovery procedure while the backup is newer than the replication delay.

If this is not possible for all servers, recreate the remaining servers from scratch after recovering as many servers as possible and taking a new backup.

New backup after recovery

Disaster recovery resets the replication generation ID to a different format than you get when importing new directory data.

After disaster recovery, you can no longer use existing backup files for the recovered base DN. Directory servers can only replicate data under a base DN with directory servers having the same generation ID. The old backups no longer have the right generation IDs.

Instead, immediately after recovery, back up data from the recovered base DN and use the new backups going forward.

You can purge older backup files to prevent someone accidentally restoring from a backup with an outdated generation ID.

Change notifications reset

Disaster recovery clears the changelog for the recovered base DN.

If you use change number indexing for the recovered base DN, disaster recovery resets the change number.

Standalone servers

If you have standalone replication servers and directory servers, you might not want to recover them all at once.

Instead, in each region, alternate between recovering a standalone directory server then a standalone replication server to reduce the time to recovery.

Reference material

Reference

Description

In-depth introduction to replication concepts

The basics, plus backing up to the cloud and using filesystem snapshots

About keys, including those for encrypting and decrypting backup files

Details about exporting and importing LDIF, common data stores

Examples you can use when scripting installation procedures

Examples you can use when scripting server configuration

Copyright © 2010-2024 ForgeRock, all rights reserved.