"Failed to peek." and "failuresCount: 200/200" Errors in JIRA DATA CENTER and low performance

Hamid Gholami January 27, 2019

Hello,

We have JIRA DATA CENTER with three nodes and with 200,000 issues. Also we use some of useful add-ons such as : Structure, Scriptrunner, eazybi, . . .

 At some peak hours of consumption abnormally CPU usage goes up and number of established sessions on 8443 port without reason increase and does not decrease, so JIRA is very slow and not responding, so we must restart application and this happens repeatedly.

 

In JIRA log file we found two errors.

Can anyone help me on this?

 


ERROR NUM1:


019-01-27 15:30:51,033 localq-reader-3 ERROR      [c.a.j.c.distribution.localq.LocalQCacheOpReader] Critical state of local cache replication queue - cannot peek from queue: [queueId=queue_NODE1_1_4b1b4dc8cf38b3c64b1d657da8f5ac8c, queuePath=/opt/atlassian/application-data/jira/localq/queue_NODE1_1_4b1b4dc8cf38b3c64b1d657da8f5ac8c], error: Failed to peek.
com.squareup.tape.FileException: Failed to peek.
        at com.squareup.tape.FileObjectQueue.peek(FileObjectQueue.java:59)
        at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.peek(TapeLocalQCacheOpQueue.java:198)
        at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.peekOrBlock(TapeLocalQCacheOpQueue.java:216)
        at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpQueueWithStats.peekOrBlock(LocalQCacheOpQueueWithStats.java:198)
        at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.peekOrBlock(LocalQCacheOpReader.java:157)
        at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.run(LocalQCacheOpReader.java:71)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.lang.ClassNotFoundException: com.codebarrel.jira.plugin.automation.store.CachingAutomationConfigStore$TenantRuleId
        at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpConverter.from(TapeLocalQCacheOpConverter.java:25)
        at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpConverter.from(TapeLocalQCacheOpConverter.java:16)
        at com.squareup.tape.FileObjectQueue.peek(FileObjectQueue.java:57)
        ... 10 more
Caused by: java.lang.ClassNotFoundException: com.codebarrel.jira.plugin.automation.store.CachingAutomationConfigStore$TenantRuleId
        ... 1 filtered
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:628)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
        at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpConverter.from(TapeLocalQCacheOpConverter.java:23)
        ... 12 more

 

ERROR NUM2:

 

2019-01-27 15:23:30,039 localq-reader-11 ERROR      [c.a.j.c.distribution.localq.LocalQCacheOpReader] Abandoning sending: LocalQCacheOp{cacheName='org.marvelution.jji.releasereport.CiBuildReleaseReportColumn', action=REMOVE, key=ERP-2285, value=null, creationTimeInMillis=1548602413120} from cache replication queue: [queueId=queue_NODE2_0_31fe71b00865ae60db401068d5159de9, queuePath=/opt/atlassian/application-data/jira/localq/queue_NODE2_0_31fe71b00865ae60db401068d5159de9], failuresCount: 200/200. Removing from queue. Error: java.rmi.NotBoundException: org.marvelution.jji.releasereport.CiBuildReleaseReportColumn
com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpSender$UnrecoverableFailure: java.rmi.NotBoundException: org.marvelution.jji.releasereport.CiBuildReleaseReportColumn
        at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:88)
        at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.run(LocalQCacheOpReader.java:83)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.rmi.NotBoundException: org.marvelution.jji.releasereport.CiBuildReleaseReportColumn
        at sun.rmi.registry.RegistryImpl.lookup(RegistryImpl.java:166)
        at sun.rmi.registry.RegistryImpl_Skel.dispatch(Unknown Source)
        at sun.rmi.server.UnicastServerRef.oldDispatch(UnicastServerRef.java:411)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:272)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:276)
        at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:253)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:379)
        at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
        at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.lookupRemoteCachePeer(BasicRMICachePeerProvider.java:64)
        at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.create(BasicRMICachePeerProvider.java:39)
        at com.atlassian.jira.cluster.distribution.localq.rmi.CachingRMICachePeerManager.getCachePeerFor(CachingRMICachePeerManager.java:58)
        at com.atlassian.jira.cluster.distribution.localq.rmi.CachingRMICachePeerManager.withCachePeer(CachingRMICachePeerManager.java:91)
        at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:63)
        ... 6 more

 

1 answer

1 accepted

0 votes
Answer accepted
Andy Heinzer
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 28, 2019

The first error seems to be calling the automation for Jira plugin code base:

Caused by: java.io.IOException: java.lang.ClassNotFoundException: com.codebarrel.jira.plugin.automation.store.CachingAutomationConfigStore$TenantRuleId

I'm not sure why exactly this would happen just yet though.

 

The second error appears to be referring to a different plugin:

Caused by: java.rmi.NotBoundException: org.marvelution.jji.releasereport.CiBuildReleaseReportColumn

Specifically this is the Jenkins Integration for Jira.  I noticed that this second plugin does not appear to have a Data Center supported tag for it.  Which likely means that the vendor has not tested this plugin against a data center environment.   It's possible they might not support the use of this plugin in a data center deployment yet.

 

What version of Jira Data Center are you using?

I ask because there were a number of known replication problems that could happen for earlier data center versions.  There have been lots of bug fixes and improvements to data center since the 7.1.x and 7.2.x versions for example.

Hamid Gholami January 28, 2019

Hello @Andy Heinzer

 

Thank you for your advices.

We use JIRA DATA CENTER version 7.9.2.

Have you any idea for first error?

Also we have 32G Memory totally with 32Core CPU and I set 6G JVM in jira application.

Is it possible JVM value is cause for increase CPU usage?

I have JMX monitoring. What is the suitable JVM amount?

Andy Heinzer
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 29, 2019

@Hamid GholamiI think that the first error is an example of this specific bug https://jira.atlassian.com/browse/JRASERVER-65246

The error message is very similar in nature to the description of that bug. And your first error message is showing a ClassNotFoundException error in relation to a plugin.  As such, that bug wasn't fixed until the 7.9.3, 7.10.1, and 7.11.0 versions.  So your data center instance is affected by this problem with replicating caches in relation to 3rd party plugins.

Upgrading to one of those versions or higher should help prevent this specific error.  It is still possible that you might be encountering another performance problem underlying this one, but from the information I have so far, it's difficult to say for sure.   This cache replication problem was known to cause a large number of support cases for other data center customers as well, so upgrading to a version with this fix would be a good start to avoid this problem so far.

 

To answer your other questions, 6GB of heap for Jira seems like a good amount for most enterprise level customers like your instance of Jira seems to be.   Lots of admins believe that increasing java heap is a good way to solve performance problems.  While that can work in some cases, in other cases, having a larger and larger heap can cause very long garbage collection times for java applications like Jira.  Depending on your GC method, the performance of that node can be affected.  So there can be a trade off here. If Java has to do a 'stop-the-world' collection on a very large heap, it tends to cause applications like Jira to completely hang until that GC event is complete.  On small heaps, that doesn't take quite so long, but on very large heaps, that can tend to take much longer to finish.   It might help to take a look at the Using Garbage Collection Logs to Analyze JIRA Application Performance.  This would just be helpful to understand if your node(s) are seeing long GC times.  And if that is a problem, we do have a guide on .

I still have the same error with my plugin (Requirement Yogi), on any fresh installation of Jira DC 8.5.4 with 2 nodes.

The class is serializable (in Json, Xml and Serializable), I'm starting to suspect that TapeLocalQCacheOpQueue.java needs to be able to deserialize cache contents without accessing plugin code, which is a serious new constraint.

Peter Kollar July 12, 2021

Hello @Adrien Ragot _Requirement Yogi_ 

 

Its more than year ago. Do you solved this problem. If yes, how?

 

thanks

Adrien Ragot _Requirement Yogi_ July 12, 2021

I don't remember. I note that I still have serializable classes in my code, so I must have resolved the problem another way, perhaps the load on the cluster was unreasonable anyway and reducing it removed the error. I think I remember ensuring that all our cache keys were simple Java classes (String), but I don't think it has an impact on the specific problem.

Suggest an answer

Log in or Sign up to answer