A warning similar to the following is shown repeatedly in the logs for all nodes:
2019-02-11 17:04:32 INFO org.forgerock.openidm.cluster.ClusterManager$ClusterManagerThread$1 run Attempting recovery 2019-02-11 17:04:32 WARNING org.forgerock.openidm.cluster.ClusterManager recoverFailedInstance Failed to update repo instance state during failed instance recovery. Failed cluster node instance1 will not be recovered by node instance2. org.forgerock.json.resource.PreconditionFailedException: Update rejected as current Object revision 17249 is different than expected by caller (17247), the object has changed since retrieval. at org.forgerock.openidm.repo.jdbc.impl.GenericTableHandler.update(GenericTableHandler.java:522) at org.forgerock.openidm.repo.jdbc.impl.JDBCRepoService.update(JDBCRepoService.java:436) at org.forgerock.openidm.cluster.ClusterManager.updateInstanceState(ClusterManager.java:431) at org.forgerock.openidm.cluster.ClusterManager.recoverFailedInstance(ClusterManager.java:611) at org.forgerock.openidm.cluster.ClusterManager.access$900(ClusterManager.java:84) at org.forgerock.openidm.cluster.ClusterManager$ClusterManagerThread$1.run(ClusterManager.java:841) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Restarting nodes does not resolve the issue.
Querying the clusterobjects table will show all nodes are running (state 1), but you may notice the nodes cycling through state 1, state 2, state 3 and then back to state 1 repeatedly without actually being shutdown.
You can also observe this change of state using the following REST call:
$ curl -X GET -H "X-OpenIDM-Username: openidm-admin" -H "X-OpenIDM-Password: openidm-admin" "http://localhost:8080/openidm/cluster?_queryId=query-cluster-instances&_prettyPrint=true"
If you only see the "Update rejected as current Object revision 1417149 is different than expected by caller (1417147), the object has changed since retrieval" part of the message during synchronization, see FAQ: Clusters in IDM/OpenIDM (Q. Why am I seeing errors about current Object revision being different than expected during synchronization?) for further information.
In a cluster, each node checks in every 5 seconds (default setting for instanceCheckInInterval in cluster.json) and records a timestamp for the check in. This is recorded in the clusterobjects table as state 1 to indicate the node is running. If this timestamp is not updated for 30 seconds (default setting for instanceTimeout in cluster.json), the state is changed to 2 and recovery is attempted. If recovery fails, then the node is marked as down (state 3).
These failures occur when node X detects node Y is down, but the state of node Y changes before node X can set it to state 2. This can happen if node Y checks in late (now back to state 1 as it's running); in which case this is expected behavior as the node is not actually down. Possible reasons for node Y checking in late are IDM is busy doing something else, (e.g. backups or batch jobs, etc.) there are issues with the database repository, or environmental issues such as system or network.
If node Y is down, then you would expect node X to change it to state 2, fail to recover it and then change it to state 3 to indicate it is down.
This issue can be resolved as follows:
- Check what might be causing this bottleneck and resolve if possible. For example, do you have any scheduled jobs that are causing the issue or a sudden increase in load that corresponds to the failures?
- Increase the instanceTimeout as follows:
- Update the instanceTimeout value in the cluster.json file from 30000 to 60000 to give you a 60 second timeout.
- Restart IDM to apply these changes.
- Monitor your logs to see if this timeout value resolves the issue. You might need to increase it further if not.
Related Issue Tracker IDs