High CPU and unresponsive servers when DS (All versions) is running behind an F5 Load Balancer
The purpose of this article is to provide assistance if you encounter high CPU or notice that DS is consuming increasing CPU and/or memory until the server becomes unresponsive. This issue only occurs when DS is behind a F5® load balancer.
Symptoms
You will notice one or more of the following symptoms:
- High CPU and/or memory usage.
- Unresponsive servers.
- An increasing number of connections.
- Running out of Java heap space.
- A restart restores service for a short period only.
DS logs and commands
You might see errors similar to the following when this happens:
[07/Jul/2020:14:14:56 +0200] category=SYNC severity=INFORMATION msgID=105 msg=Replication server accepted a connection from ds.example.com/203.0.113.0:9989 to local address 203.0.113.10:8989 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake [07/Jul/2020:14:15:07 +0200] category=SYNC severity=WARNING msgID=97 msg=Directory server DS(1234) is closing its connection to replication server RS(5678) at ds.example.com/203.0.113.0:9989 for domain "dc=example,dc=com" because it could not detect a heart beat ... [07/Jul/2020:14:15:19 +0200] category=SYNC severity=ERROR msgID=178 msg=Directory server 1234 was attempting to connect to replication server 5678 but has disconnected in handshake phase. Error: SocketTimeoutException(Read timed out)If you try to run a command against the server, for example, the status command, you will see a response similar to the following:
Connect Error: The connection attempt to server ds.example.com/203.0.113.0:4444 has failed because the connection timeout period of 30000 ms was exceededHeap dumps
If you capture a heap dump, you will see that a grizzly.Buffer
object is consuming nearly all the memory, for example:
You can capture a heap dump as described in: How do I collect JVM data for troubleshooting DS? or alternatively, this is captured when you run the Support Extract (How do I use the Support Extract tool in DS 6.5.x and 7.x to capture troubleshooting data?).
Recent Changes
Enabled the F5 OneConnect feature.
Causes
The F5 OneConnect feature is designed to optimize HTTP/HTTPS traffic. When it is enabled for other protocols with long-lived connections such as LDAP, you will see unexpected behavior and performance issues such as the symptoms listed above.
Solution
This issue can be resolved by switching off the F5 OneConnect feature. See On Load Balancers for further recommendations on using a load balancer.
Note
There are other external factors that can result in similar symptoms, such as:
- Running DS on VMware - if you are running on VMware, you should refer to Very high CPU seen on ForgeRock products running on VMware for ForgeRock's advice on hosting DS on VMware.
- Running any antivirus or network security scans (particularly if these issues occur on a regular schedule) - if you have antivirus or intrusion detection software running, you should refer to Antivirus interference for advice on preventing interference.
Additionally, there is a known issue related to the grizzly.Buffer object: OPENDJ-6681 (Build up of Grizzly TCPNIOConnection objects lead to a FilterChain Exception), which is fixed in DS 6.5.3 and later.
See Also
How do I enable Garbage Collector (GC) Logging for DS?
AskF5 - Overview of the OneConnect profile
Related Training
N/A
Related Issue Tracker IDs
N/A