Solutions

Backend goes offline due to Latch timeouts in DS (All versions) and OpenDJ 3.5.2, 3.5.3

Last updated Oct 7, 2019

The purpose of this article is to provide assistance if the backend database goes offline due to "Latch timeouts" in DS/OpenDJ. You may experience high CPU usage, general performance issues or replication problems when this occurs. Login and registration failures will also be noticed in AM/OpenAM if you are using DS/OpenDJ as the Core Token Service (CTS) store and/or Identity store.


Symptoms

Symptoms may vary depending on your system and the type of operations that were occurring at the same time the Latch errors were written to the server logs:

AM/OpenAM

Users are not able to log in or register when DS/OpenDJ is being used as the CTS store and/or Identity store. 

DS/OpenDJ

The DS or RS hosts may stop responding (hang), exhibit high CPU, replication problems or other disruption of normal operations may occur.

An error similar to one of the following is shown in the Errors log when the backend goes offline:

  • Could not replay operation error:
    [27/Aug/2019:10:02:57 -1100] category=SYNC severity=ERROR msgID=67 msg=Could not replay operation ModifyOperation(connID=-1, opID=966061, dn=coreTokenId=TJpjX56064598qEljRNF0,dc=cts,dc=openam,dc=forgerock,dc=org) with ChangeNumber 0000016bf727b3fb000f077b8bb1 error Other com.sleepycat.je.EnvironmentFailureException: (JE 7.3.7) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) /opt/opendj/db/userRoot Latch timeout. 
    com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[Cleaner-1,5,main] currentTime: 1560370712007 exclusiveOwner: -none- UNEXPECTED_STATE_FATAL: Unexpected internal state,
    unable to continue. Environment is invalid and must be closed.
    
  • Error Other when updating server state error:
    [27/Aug/2019:10:02:57 -1100] category=SYNC severity=ERROR msgID=14 msg=Error Other when updating server state ModifyRequest(dn=dc=openam,dc=forgerock,dc=org, changes=[Modification(modificationType=replace, attributeDescription=ds-sync-state, attributeValues={000001662f86f849251601fa03b2, 00000163b830cc087df9000111ff})], controls=[]) : com.sleepycat.je.EnvironmentFailureException: (JE 7.3.7) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.3.7) /opt/opendj/db/userRoot Latch timeout. com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[Worker Thread 7,5,Directory Server Thread Group] currentTime: 1527804006763 exclusiveOwner: Thread[Replica replay thread 6,5,Directory Server Thread Group] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed. base dn: dc=openam,dc=forgerock,dc=org
    

Additionally, you will see a corresponding error in the je.info.0 file from the backend itself, which also shows the FSync time exceeding the 5 second timeout limit:

Suppressed: com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.4.5) /opt/opendj/db/userRoot Latch timeout.
com.sleepycat.je.log.LogBufferPool_FullLatch currentThread: Thread[Replica replay thread 14,5,Directory Server Thread Group] currentTime: 1560197365050 exclusiveOwner: Thread[Replica replay thread 13,5,Directory Server Thread Group] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
   at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:228)
   at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766)
   at com.sleepycat.je.dbi.CursorImpl.checkEnv(CursorImpl.java:3869)
   at com.sleepycat.je.Cursor.checkEnv(Cursor.java:5394)
   at com.sleepycat.je.Cursor.close(Cursor.java:540)
   at com.sleepycat.je.Database.put(Database.java:1575)
   ... 40 more
...
2019-08-27 10:03:11.956 UTC WARNING [/opt/opendj/db/userRoot] FSync time of 897286 ms exceeds limit (5000 ms)

Recent Changes

Upgraded to, or installed DS 5 and later.

Upgraded to, or installed OpenDJ 3.5.2 or 3.5.3.

Causes

The Latch Timeout is a feature of the Berkeley JE, which was introduced in OpenDJ 3.5.2 when ForgeRock upgraded the JE version to 7.3.7.

A latch timeout is similar to a network timeout; the instance times out waiting on the File Syncs (FSync) of the backend files and consequently disables the backend for its own safety. Since the backend is disabled, the instance cannot access the backend database files.

Latch timeouts and slow FSyncs typically occur for the following reasons:

  • Slow or bad disks that are being saturated by other processes. This can be evidenced by long FSync times shown in the je.info.0 file that exceed the 5 second limit (the log snippet example above shows 897286 ms, which is nearly 15 minutes).
  • Poorly tuned DS/OpenDJ instances.

Solution

This issue can be resolved in the first instance by restarting or re-enabling the backend. You can re-enable the backend using dsconfig, for example:

$ ./dsconfig set-backend-prop --backend-name userRoot --set enabled:true --hostname ds1.example.com --port 4444 --trustAll --bindDN "cn=Directory Manager" --bindPassword password --no-prompt

Longer term, you need to address the root cause to prevent reoccurrence:

  • Check if you have any slow or bad disks, and upgrade or replace as necessary.
  • If this is a Virtual Machine (VM), ensure the resources you have are not impacted by other processes running on the same VM.
  • Avoid using network file systems such as NFS for database storage or backups as stated in the release notes: Release Notes › Fulfilling Storage Requirements.
  • Tune DS/OpenDJ appropriately.

Tuning DS/OpenDJ 

You should tune DS/OpenDJ per the recommendations in the documentation: Administration Guide › Tuning Servers For Performance, paying particular attention to the Java Settings section.

Although you should follow all the recommendations, it is crucial that you apply the following JVM settings at the very least:

  • Add the -XX:MaxTenuringThreshold=1 option to create only short lived objects. This option also helps with Garbage Collection (GC) performance by only promoting genuinely long lived objects to the old generation.
  • Add the -XX:+UseCompressedOops option if you have a 64-bit JVM and your heap size is less than 32GB.
  • Add either the -XX:+UseConcMarkSweepGC or -XX:+UseG1GC option to specify which Garbage Collector should be used. If you do not explicitly set one, the JVM uses the default Serial Garbage Collector, which is very inefficient and uses the Stop The World GC (meaning it pauses the process while it runs a full GC).
  • Add the -XX:+DisableExplicitGC option if you using JMX to avoid hourly full GC taking place.

See Also

How do I tune DS/OpenDJ (All versions) process sizes: JVM heap and database cache?

Best practice for JVM Tuning

Troubleshooting DS/OpenDJ

How do I use the Support Extract tool in DS 5.x, 6 and OpenDJ 3.x to capture troubleshooting data?

Performance tuning and monitoring ForgeRock products

Installation Guide › CTS Tuning Considerations

Related Training

N/A

Related Issue Tracker IDs

N/A



Copyright and TrademarksCopyright © 2019 ForgeRock, all rights reserved.
Loading...