HTTP-based monitoring

This page covers the HTTP interfaces for monitoring DS servers. For the same capabilities over LDAP, refer to LDAP-based monitoring.

DS servers publish monitoring information at these HTTP endpoints:

/alive

Whether the server is currently alive, meaning its internal checks have not found any errors that would require administrative action.

/healthy

Whether the server is currently healthy, meaning it is alive, the replication server is accepting connections on the configured port, and any replication delays are below the configured threshold.

/metrics/prometheus

Monitoring information in Prometheus monitoring software format. For details, refer to Prometheus metrics reference.

The following example command accesses the Prometheus endpoint:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus

To give a regular user privileges to read monitoring data, refer to Monitor privilege.

Basic availability

Server is alive (HTTP)

The following example reads the /alive endpoint anonymously. If the DS server’s internal tests do not find errors that require administrative action, then it returns HTTP 200 OK:

$ curl --cacert ca-cert.pem --head https://localhost:8443/alive

HTTP/1.1 200 OK
...

If the server finds that it is subject to errors requiring administrative action, it returns HTTP 503 Service Unavailable.

If there are errors, anonymous users receive only the 503 error status. Error strings for diagnosis are returned as an array of "alive-errors" in the response body, but the response body is only returned to a user with the monitor-read privilege.

When a server returns "alive-errors", diagnose and fix the problem, and then either restart or replace the server.

Server health (HTTP)

The following example reads the /healthy endpoint anonymously. If the DS server is alive, as described in Server is alive (HTTP), any replication listener threads are functioning normally, and any replication delay is below the threshold configured as max-replication-delay-health-check (default: 5 seconds), then it returns HTTP 200 OK:

$ curl --cacert ca-cert.pem --head https://localhost:8443/healthy

HTTP/1.1 200 OK
...

If the server is subject to a replication delay above the threshold, then it returns HTTP 503 Service Unavailable. This result only indicates a problem if the replication delay is steadily high and increasing for the long term.

If there are errors, anonymous users receive only the 503 error status. Error strings for diagnosis are returned as an array of "ready-errors" in the response body, but the response body is only returned to a user with the monitor-read privilege.

When a server returns "ready-errors", route traffic to another server until the current server is ready again.

Server health (Prometheus)

In addition to the examples above, you can monitor whether a server is alive and able to handle requests as Prometheus metrics:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep health_status

# HELP ds_health_status_alive Indicates whether the server is alive
# TYPE ds_health_status_alive gauge
ds_health_status_alive 1.0
# HELP ds_health_status_healthy Indicates whether the server is able to handle requests
# TYPE ds_health_status_healthy gauge
ds_health_status_healthy 1.0

Disk space (Prometheus)

The following example shows monitoring metrics you can use to check whether the server is running out of disk space:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep disk

# HELP ds_disk_free_space_bytes The amount of free disk space (in bytes)
# TYPE ds_disk_free_space_bytes gauge
ds_disk_free_space_bytes{disk="<partition>",} <bytes>
# HELP ds_disk_free_space_full_threshold_bytes The effective full disk space threshold (in bytes)
# TYPE ds_disk_free_space_full_threshold_bytes gauge
ds_disk_free_space_full_threshold_bytes{disk="<partition>",} <bytes>
# HELP ds_disk_free_space_low_threshold_bytes The effective low disk space threshold (in bytes)
# TYPE ds_disk_free_space_low_threshold_bytes gauge
ds_disk_free_space_low_threshold_bytes{disk="<partition>",} <bytes>

In your monitoring software, compare free space with the disk low and disk full thresholds. For database backends, these thresholds are set using the configuration properties: disk-low-threshold and disk-full-threshold.

When you read from cn=monitor instead ,as described in LDAP-based monitoring, the relevant data are exposed on child entries of cn=disk space monitor,cn=monitor.

Certificate expiration (Prometheus)

The following example shows how you can use monitoring metrics to check whether the server certificate is due to expire soon:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep cert

# HELP ds_certificates_certificate_expires_at_seconds Certificate expiration date and time
# TYPE ds_certificates_certificate_expires_at_seconds gauge
ds_certificates_certificate_expires_at_seconds{alias="ssl-key-pair",key_manager="PKCS12",} <sec_since_epoch>

In your monitoring software, compare the expiration date with the current date.

When you read from cn=monitor instead, as described in LDAP-based monitoring, the relevant data are exposed on child entries of cn=certificates,cn=monitor.

Activity

Active users (Prometheus)

DS server connection handlers respond to client requests. The following example uses the default monitor user account to read active connections on each connection handler:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep "active_[cp]"

Request statistics (Prometheus)

DS server connection handlers respond to client requests. The following example uses the default monitor user account to read statistics about client operations on each of the available connection handlers:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep connection_handlers

Work queue (Prometheus)

DS servers have a work queue to track request processing by worker threads, and whether the server has rejected any requests due to a full queue. If enough worker threads are available, then no requests are rejected. The following example uses the default monitor user account to read statistics about the work queue:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep work_queue

To adjust the number of worker threads, refer to the settings for Traditional Work Queue.

Counts

ACIs (Prometheus)

DS maintains counts of ACIs:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep _aci

Database size (Prometheus)

DS servers maintain counts of the number of entries in each backend. The following example uses the default monitor user account to read the counts:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep backend_entry_count

Entry caches (Prometheus)

DS servers maintain entry cache statistics:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep entry_cache

Groups (Prometheus)

The following example reads counts of static, dynamic, and virtual static groups, and statistics on the distribution of static group size:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep -i group

At startup time, DS servers log a message showing the number of different types of groups and the memory allocated to cache static groups.

Subentries (Prometheus)

DS maintains counts of LDAP subentries:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep subentries

Indexing

Index cost (Prometheus)

DS maintains metrics about index cost. The metrics count the number of updates and how long they took since the DS server started.

The following example demonstrates how to read the metrics for all monitored indexes:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep index_cost

Index use (Prometheus)

DS maintains metrics about index use. The metrics indicate how often an index was accessed since the DS server started.

The following example demonstrates how to read the metrics for all monitored indexes:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep index_uses

Replication

Monitor the following to ensure replication runs smoothly. Take action as described in these sections and in the troubleshooting documentation for replication problems.

Replication delay (Prometheus)

The following example reads a metric to check the delay in replication:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep receive_delay

# HELP ds_replication_replica_remote_replicas_receive_delay_seconds Current local delay in receiving replicated operations
# TYPE ds_replication_replica_remote_replicas_receive_delay_seconds gauge
ds_replication_replica_remote_replicas_receive_delay_seconds{<labels>} <delay>

DS replicas measure replication delay as the local delay when receiving and replaying changes. A replica calculates these local delays based on changes received from other replicas. Therefore, a replica can only calculate delays based on changes it has received. Network outages cause inaccuracy in delay metrics.

A replica calculates delay metrics based on times reflecting the following events:

t₀: the remote replica records the change in its data
t₁: the remote replica sends the change to a replica server
t₂: the local replica receives the change from a replica server
t₃: the local replica applies the change to its data

This figure illustrates when these events occur:

Replication keeps track of changes using change sequence numbers (CSNs), opaque and unique identifiers for each change that indicate when and where each change first occurred. The t_n values are CSNs.

When the CSNs for the last change received and the last change replayed are identical, the replica has applied all the changes it has received. In this case, there is no known delay. The receive and replay delay metrics are set to 0 (zero).

When the last received and last replayed CSNs differ:

Receive delay is set to the time t₂ - t₀ for the last change received.

Another name for receive delay is current delay.
Replay delay is approximately t₃ - t₂ for the last change replayed. In other words, it is an approximation of how long it took the last change to be replayed.

As long as replication delay tends toward zero regularly and over the long term, temporary spikes and increases in delay measurements are normal. When all replicas remain connected and yet replication delay remains high and increases over the long term, the high replication delay indicates a problem. Steadily high and increasing replication delay shows that replication is not converging, and the service is failing to achieve eventual consistency.

For a current snapshot of replication delays, you can also use the dsrepl status command. For details, refer to Replication status.

Replication status (Prometheus)

The following example checks the replication status metrics:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep replica_status

The effective replica status is the gauge whose value is 1.0. For example, this output shows normal status:

ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="BAD_DATA",} 0.0
ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="DEGRADED",} 0.0
ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="FULL_UPDATE",} 0.0
ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="INVALID",} 0.0
ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="NORMAL",} 1.0
ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="NOT_CONNECTED",} 0.0
ds_replication_replica_status{domain_name="dc=example,dc=com",server_id="evaluation-only",status="TOO_LATE",} 0.0

The DEGRADED status is for backwards compatibility only.

If the status is not Normal, how you react depends on the value of the ds-mon-status attribute for LDAP, or ds_replication_replica_status{status} for Prometheus:

Status Explanation Actions to Take

Status	Explanation	Actions to Take
`Bad data`	Replication is broken. Internally, DS replicas store a shorthand form of the initial state called a generation ID. The generation ID is a hash of the first 1000 entries in a backend. If the replicas' generation IDs match, the servers can replicate data without user intervention. If the replicas' generation IDs do not match for a given backend, you must manually initialize replication between them to force the same initial state on all replicas. This status arises for one of the following reasons: The replica and the replication server have different generation IDs for the data because the replica was not initialized with the same data as its peer replicas. The fractional replication configuration for this replica does not match the backend data. For example, you reconfigured fractional replication to include or exclude different attributes, or you configured fractional replication in an incompatible way on different peer replicas. DS 7.3 introduced this status. Earlier releases included this state as part of the `Bad generation id` status.	Whenever this status displays: If fractional replication is configured, make sure the configuration is compatible on all peer replicas. For details, refer to Fractional replication (advanced). Reinitialize replication to fix the bad generation IDs. For details, refer to Manual initialization.
`Full update`	Replication is operating normally. You have chosen to initialize replication over the network. The time to complete the operation depends on the network bandwidth and volume of data to synchronize.	Monitor the server output and wait for initialization to complete.
`Invalid`	This status arises for one of the following reasons: The replica has encountered a replication protocol error. This status can arise due to faulty network communication between the replica and the replication server. The replica has just started, and is initializing.	If this status happens during normal operation: Review the replica and replication server error logs, described in About logs, for network-related replication error messages. Independently verify network communication between the replica and the replication server systems.
`Normal`	Replication is operating normally.	Nothing to do.
`Not connected`	This status arises for one of the following reasons: The replica has just started and is not yet connected to the replication server. The replica cannot connect to a replication server.	If this status happens during normal operation: Review the replica and replication server error logs for network-related replication error messages. Independently verify network communication between the replica and the replication server systems.
`Too late`	The replica has fallen further behind the replication server than allowed by the replication-purge-delay. In other words, the replica is missing too many changes, and lacks the historical information required to synchronize with peer replicas. The replica no longer receives updates from replication servers. Other replicas that recognize this status stop returning referrals to this replica. DS 7.3 introduced this status. Earlier releases included this state as part of the `Bad generation id` status.	Whenever this status displays: Reinitialize replication for the replica that is too late. For details, refer to Manual initialization.

Bad data

Replication is broken.

Internally, DS replicas store a shorthand form of the initial state called a generation ID. The generation ID is a hash of the first 1000 entries in a backend. If the replicas' generation IDs match, the servers can replicate data without user intervention. If the replicas' generation IDs do not match for a given backend, you must manually initialize replication between them to force the same initial state on all replicas.

This status arises for one of the following reasons:

The replica and the replication server have different generation IDs for the data because the replica was not initialized with the same data as its peer replicas.
The fractional replication configuration for this replica does not match the backend data. For example, you reconfigured fractional replication to include or exclude different attributes, or you configured fractional replication in an incompatible way on different peer replicas.

DS 7.3 introduced this status. Earlier releases included this state as part of the Bad generation id status.

Whenever this status displays:

If fractional replication is configured, make sure the configuration is compatible on all peer replicas.

For details, refer to Fractional replication (advanced).
Reinitialize replication to fix the bad generation IDs.

For details, refer to Manual initialization.

Full update

Replication is operating normally.

You have chosen to initialize replication over the network.

The time to complete the operation depends on the network bandwidth and volume of data to synchronize.

Monitor the server output and wait for initialization to complete.

Invalid

This status arises for one of the following reasons:

The replica has encountered a replication protocol error. This status can arise due to faulty network communication between the replica and the replication server.
The replica has just started, and is initializing.

If this status happens during normal operation:

Review the replica and replication server error logs, described in About logs, for network-related replication error messages.
Independently verify network communication between the replica and the replication server systems.

Normal

Replication is operating normally.

Nothing to do.

Not connected

This status arises for one of the following reasons:

The replica has just started and is not yet connected to the replication server.
The replica cannot connect to a replication server.

If this status happens during normal operation:

Review the replica and replication server error logs for network-related replication error messages.
Independently verify network communication between the replica and the replication server systems.

Too late

The replica has fallen further behind the replication server than allowed by the replication-purge-delay. In other words, the replica is missing too many changes, and lacks the historical information required to synchronize with peer replicas.

The replica no longer receives updates from replication servers. Other replicas that recognize this status stop returning referrals to this replica.

DS 7.3 introduced this status. Earlier releases included this state as part of the Bad generation id status.

Whenever this status displays:

Reinitialize replication for the replica that is too late.

For details, refer to Manual initialization.

Change number indexing (Prometheus)

DS replication servers maintain a changelog database to record updates to directory data. The changelog database serves to:

Replicate changes, synchronizing data between replicas.
Let client applications get change notifications.

DS replication servers purge historical changelog data after the replication-purge-delay in the same way replicas purge their historical data.

Client applications can get changelog notifications using cookies (recommended) or change numbers.

To support change numbers, the servers maintain a change number index to the replicated changes. A replication server maintains the index when its configuration properties include changelog-enabled:enabled. (Cookie-based notifications do not require a change number index.)

The change number indexer must not be interrupted for long. Interruptions can arise when, for example, a DS server:

Stays out of contact, not sending any updates or heartbeats.
Gets removed without being shut down cleanly.
Gets lost in a system crash.

Interruptions prevent the change number indexer from advancing. When a change number indexer cannot advance for almost as long as the purge delay, it may be unable to recover as the servers purge historical data needed to determine globally consistent change numbers.

The following example checks the state of change number indexing:

$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep change_number_
# HELP ds_change_number_indexing_state Automatically generated
# TYPE ds_change_number_indexing_state gauge
ds_change_number_indexing_state{indexing_state="BLOCKED_BY_REPLICA_NOT_IN_TOPOLOGY",} 0.0
ds_change_number_indexing_state{indexing_state="INDEXING",} 1.0
ds_change_number_indexing_state{indexing_state="WAITING_ON_UPDATE_FROM_REPLICA",} 0.0
# HELP ds_change_number_time_since_last_indexing_seconds Duration since the last time a change was indexed
# TYPE ds_change_number_time_since_last_indexing_seconds gauge
ds_change_number_time_since_last_indexing_seconds 0.0
# HELP ds_replication_changelog_purge_waiting_for_change_number_indexing Indicates whether changelog purging is waiting for change number indexing to advance. If true, check the ds-mon-indexing-state and ds-mon-replicas-preventing-indexing metrics
# TYPE ds_replication_changelog_purge_waiting_for_change_number_indexing gauge
ds_replication_changelog_purge_waiting_for_change_number_indexing 0.0

When ds_change_number_indexing_state has BLOCKED_BY_REPLICA_NOT_IN_TOPOLOGY or WAITING_ON_UPDATE_FROM_REPLICA greater than 0, refer to ds_change_number_time_since_last_indexing_seconds for the wait time in seconds and to the LDAP ds-mon-replicas-preventing-indexing metric for the list of problem servers.

Filtering results (Prometheus)

By default, DS servers return all Prometheus metrics. To limit what the server returns, set one of these HTTP endpoint properties for the /metrics/prometheus:

Set these properties to valid Java regular expression patterns.

The following configuration change causes the server to return only ds_connection_handlers_ldap_requests_* metrics. As mentioned in the reference documentation, "The metric name prefix must not be included in the filter." Notice that the example uses connection_handlers_ldap_requests, not including the leading ds_:

$ dsconfig \
 set-http-endpoint-prop \
 --endpoint-name /metrics/prometheus \
 --set included-metric-pattern:'connection_handlers_ldap_requests' \
 --hostname localhost \
 --port 4444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

The following configuration change causes the server to exclude metrics whose names start with ds_jvm_. Notice that the example uses the regular expression jvm_.*:

$ dsconfig \
 set-http-endpoint-prop \
 --endpoint-name /metrics/prometheus \
 --set excluded-metric-pattern:'jvm_.*' \
 --hostname localhost \
 --port 4444 \
 --bindDN uid=admin \
 --bindPassword password \
 --usePkcs12TrustStore /path/to/opendj/config/keystore \
 --trustStorePassword:file /path/to/opendj/config/keystore.pin \
 --no-prompt

Directory Services 7.4.3

HTTP-based monitoring

Basic availability

Server is alive (HTTP)

Server health (HTTP)

Server health (Prometheus)

Disk space (Prometheus)

Certificate expiration (Prometheus)

Activity

Active users (Prometheus)

Request statistics (Prometheus)

Work queue (Prometheus)

Counts

ACIs (Prometheus)

Database size (Prometheus)

Entry caches (Prometheus)

Groups (Prometheus)

Subentries (Prometheus)

Indexing

Index cost (Prometheus)

Index use (Prometheus)

Replication

Replication delay (Prometheus)

Replication status (Prometheus)

Change number indexing (Prometheus)

Filtering results (Prometheus)