HTTP-Based Monitoring
DS servers publish monitoring information at these HTTP endpoints:
/alive
Whether the server is currently alive, meaning that its internal checks have not found any errors that would require administrative action.
/healthy
Whether the server is currently healthy, meaning that it is alive and any replication delays are below a configurable threshold.
/metrics/api
Read-only, JSON-based view of
cn=monitor
and the monitoring backend.Each LDAP entry maps to a resource under
/metrics/api
./metrics/prometheus
Monitoring information for Prometheus monitoring software.
For details, see Prometheus Metrics Reference.
The following example command accesses the Prometheus endpoint:
$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus
To give a regular user privileges to read monitoring data, see "Monitor Privilege".
The following example reads the /alive
endpoint anonymously. If the DS server's internal tests do not find errors that require administrative action, then it returns HTTP 200 OK:
$curl --cacert ca-cert.pem --head https://localhost:8443/alive
HTTP/1.1 200 OK Content-Length: 0 Date: <date>
If the server finds that it is subject to errors requiring administrative action, it returns HTTP 503 Service Unavailable.
If there are errors, anonymous users receive only the 503 error status. Error strings for diagnosis are returned as an array of "alive-errors"
in the response body, but the response body is only returned to a user with the monitor-read
privilege.
When a server returns "alive-errors"
, diagnose and fix the problem, and then either restart or replace the server.
The following example reads the /healthy
endpoint anonymously. If the DS server is alive as described in "Server is Alive (HTTP)", and any replication delay is below the threshold configured as max-replication-delay-health-check
(default: 5 seconds), then it returns HTTP 200 OK:
$curl --cacert ca-cert.pem --head https://localhost:8443/healthy
HTTP/1.1 200 OK Content-Length: 0 Date: <date>
If the server is subject to a replication delay above the threshold, then it returns HTTP 503 Service Unavailable. This result only indicates a problem if the replication delay is steadily high and increasing for the long term.
If there are errors, anonymous users receive only the 503 error status. Error strings for diagnosis are returned as an array of "ready-errors"
in the response body, but the response body is only returned to a user with the monitor-read
privilege.
When a server returns "ready-errors"
, route traffic to another server until the current server is ready again.
In addition to the examples above, you can monitor whether a server is alive and able to handle requests as Prometheus metrics:
$curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep health_status
# HELP ds_health_status_alive Indicates whether the server is alive # TYPE ds_health_status_alive gauge ds_health_status_alive 1.0 # HELP ds_health_status_healthy Indicates whether the server is able to handle requests # TYPE ds_health_status_healthy gauge ds_health_status_healthy 1.0
The following example reads a metric to check the delay in replication:
$curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep receive_delay
# HELP ds_replication_replica_remote_replicas_receive_delay_seconds Current local delay in receiving replicated operations # TYPE ds_replication_replica_remote_replicas_receive_delay_seconds gauge ds_replication_replica_remote_replicas_receive_delay_seconds{<labels>} <delay>
DS replicas measure replication delay as the local delay when receiving and replaying changes. A replica calculates these local delays based on changes received from other replicas. Therefore, a replica can only calculate delays based on changes it has received. Network outages cause inaccuracy in delay metrics.
A replica calculates delay metrics based on times reflecting the following events:
t0: the remote replica records the change in its data
t1: the remote replica sends the change to a replica server
t2: the local replica receives the change from a replica server
t3: the local replica applies the change to its data
This figure illustrates when these events occur:
Replication keeps track of changes using change sequence numbers (CSNs), opaque and unique identifiers for each change that indicate when and where each change first occurred. The tn values are CSNs.
When the CSNs for the last change received and the last change replayed are identical, the replica has applied all the changes it has received. In this case, there is no known delay. The receive and replay delay metrics are set to 0 (zero).
When the last received and last replayed CSNs differ:
Receive delay is set to the time t2 - t0 for the last change received.
Another name for receive delay is current delay.
Replay delay is approximately t3 - t2 for the last change replayed. In other words, it is an approximation of how long it took the last change to be replayed.
As long as replication delay tends toward zero regularly and over the long term, temporary spikes and increases in delay measurements are normal. When all replicas remain connected and yet replication delay remains high and increases over the long term, the high replication delay indicates a problem. Steadily high and increasing replication delay shows that replication is not converging, and the service is failing to achieve eventual consistency.
For a current snapshot of replication delays, you can also use the dsrepl status command. For details, see "Replication Status".
The following example shows monitoring metrics you can use to check whether the server is running out of disk space:
$curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep disk
# HELP ds_disk_free_space_bytes The amount of free disk space (in bytes) # TYPE ds_disk_free_space_bytes gauge ds_disk_free_space_bytes{disk="<partition>",} <bytes> # HELP ds_disk_free_space_full_threshold_bytes The effective full disk space threshold (in bytes) # TYPE ds_disk_free_space_full_threshold_bytes gauge ds_disk_free_space_full_threshold_bytes{disk="<partition>",} <bytes> # HELP ds_disk_free_space_low_threshold_bytes The effective low disk space threshold (in bytes) # TYPE ds_disk_free_space_low_threshold_bytes gauge ds_disk_free_space_low_threshold_bytes{disk="<partition>",} <bytes>
In your monitoring software, compare free space with the disk low and disk full thresholds. For database backends, these thresholds are set using the configuration properties: disk-low-threshold and disk-full-threshold.
When you read from cn=monitor
instead as described in LDAP-Based Monitoring, the relevant data are exposed on child entries of cn=disk space monitor,cn=monitor
.
The following example shows how you can use monitoring metrics to check whether the server certificate is due to expire soon:
$curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep cert
# HELP ds_certificates_certificate_expires_at_seconds Certificate expiration date and time # TYPE ds_certificates_certificate_expires_at_seconds gauge ds_certificates_certificate_expires_at_seconds{alias="ssl-key-pair",key_manager="PKCS12",} <sec_since_epoch>
In your monitoring software, compare the expiration date with the current date.
When you read from cn=monitor
instead as described in LDAP-Based Monitoring, the relevant data are exposed on child entries of cn=certificates,cn=monitor
.
DS server connection handlers respond to client requests. The following example uses the default monitor user account to read statistics about client operations on each of the available connection handlers:
$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep connection_handlers
DS servers have a work queue to track request processing by worker threads, and whether the server has rejected any requests due to a full queue. If enough worker threads are available, then no requests are rejected. The following example uses the default monitor user account to read statistics about the work queue:
$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep work_queue
To adjust the number of worker threads, see the settings for "Traditional Work Queue".
DS servers maintain counts of the number of entries in each backend. The following example uses the default monitor user account to read the counts:
$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep backend_entry_count
DS server connection handlers respond to client requests. The following example uses the default monitor user account to read active connections on each connection handler:
$ curl --cacert ca-cert.pem --user monitor:password https://localhost:8443/metrics/prometheus 2>/dev/null | grep "active_[cp]"