java.rmi.NotBoundException and "Abandoning sending because cache does not exist on destinatio" error

Denis Malyshkin January 30, 2019

One of our objects has a cache (com.atlassian.cache.Cache) which is initialized on the object creation.

When the object is created on one node -- it sends OBJ_CREATED event to other nodes to let them know and to create the same objects on them as well.

Sometimes the object is changed and this leads to the cache invalidation which is done with "cache.removeAll()" call.

The issue is that when object update is performed just after the object creation then we get the next errors in logs:

====================================================

2019-01-30 12:53:57,961 localq-reader-14 WARN anonymous 636x48x1 1k8m4mi 92.255.190.27,172.31.59.203 /secure/SetupLicense.jspa [c.a.j.c.distribution.localq.LocalQCacheOpReader] Runtime exception: UnrecoverableFailure occurred when processing: LocalQCacheOp{cacheName='com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache', action=REMOVE_ALL, key=null, value=null, creationTimeInMillis=1548852837937} from cache replication queue: [queueId=queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264, queuePath=/var/atlassian/application-data/jira/localq/queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264], failuresCount: 1/2, error: java.rmi.NotBoundException: com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache

2019-01-30 12:53:57,973 localq-reader-14 WARN anonymous 636x48x1 1k8m4mi 92.255.190.27,172.31.59.203 /secure/SetupLicense.jspa [c.a.j.c.distribution.localq.LocalQCacheOpReader] Abandoning sending because cache does not exist on destination node: LocalQCacheOp{cacheName='com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache', action=REMOVE_ALL, key=null, value=null, creationTimeInMillis=1548852837937} from cache replication queue: [queueId=queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264, queuePath=/var/atlassian/application-data/jira/localq/queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264], failuresCount: 2/2. Removing from queue. Error: com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache
java.rmi.NotBoundException: com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache
        at sun.rmi.registry.RegistryImpl.lookup(RegistryImpl.java:227)
        at sun.rmi.registry.RegistryImpl_Skel.dispatch(RegistryImpl_Skel.java:115)
        at sun.rmi.server.UnicastServerRef.oldDispatch(UnicastServerRef.java:468)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:300)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:283)
        at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:260)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:375)
        at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:119)
        at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.lookupRemoteCachePeer(BasicRMICachePeerProvider.java:64)
        at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.create(BasicRMICachePeerProvider.java:39)
        at com.atlassian.jira.cluster.distribution.localq.rmi.CachingRMICachePeerManager.getCachePeerFor(CachingRMICachePeerManager.java:58)
        at com.atlassian.jira.cluster.distribution.localq.rmi.CachingRMICachePeerManager.withCachePeer(CachingRMICachePeerManager.java:91)
        at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:63)
        at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.run(LocalQCacheOpReader.java:84)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

====================================================

How to avoid these errors?  Am I right that the cache syncing is stopped in that case (see "Abandoning sending" message)?

What we are doing wrong? How to proper invalidate the cache?

 

Thank you.

1 answer

1 accepted

0 votes
Answer accepted
Andy Heinzer
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 1, 2019

I am concerned that your stack trace errors here are making reference to the page

/secure/SetupLicense.jspa

To me, this indicate that this particular node might not yet have an appropriate license for Jira Software Data center and/or Jira Service Desk Data Center applied to it.  Which is very strange, because if you're running multiple nodes in a data center cluster, each of these are expected to utilize the same database settings which contain the license.

Can you try to navigate directly to that Jira node and go to /secure/SetupLicense.jspa in a web browser to see if there are any errors/prompts etc.

Also what version of Jira Data Center is this?

Denis Malyshkin February 2, 2019

Andrew, thank you for the answer.

It was a test instance of Jira Software Data Center on AWS, created according to https://www.atlassian.com/enterprise/data-center/aws and deployed to the existing our VPC.

I used the "10 user Jira Software Data Center license, expires in 3 hours" license taken from https://developer.atlassian.com/platform/marketplace/timebomb-licenses-for-testing-server-apps/ to test our plugin.

During the testing the license was not expired yet.

I'll setup the test environment again and try to perform the action you recommended.

 

Thank you for the help.

Denis Malyshkin February 2, 2019

Andrew, I navigated to the /secure/SetupLicense.jspa URL on the node. The result is the next -- https://monosnap.com/file/xw7SkRmsCxeg9LaUCKcCz06sSzZcZW

I've investigated the issue a bit more. According to the logs and our code logic, it seems that the issue is the next.

Our code performs 2 sequential operations:

  1. create cache
  2. cache.removeAll()

It seems that the error occurs when the interval between the operations is too short. I tested that by adding an explicit 10 seconds delay between these operations in our code. The errors completely disappeared.

But. Such a solution is not acceptable on a production system.

My concerns/questions are:

  1. Does "Abandoning sending" message in errors mentioned in my question mean that cache replication is not performed for that cache?
  2. If yes -- is it a permanent disabling of the replication for the particular cache or a disabling just for the specified operation? I.e. will subsequent cache changes be replicated?
  3. Are there any ways to avoid these warnings without explicit delays? I'm afraid that such errors in customer logs will cause useless stress to our and Atlassian support.

 

Thank you for your help.

Maciej Swinarski
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 7, 2019

Hi @Denis Malyshkin

java.rmi.NotBoundException exception means that the other node is up but this cache is not available on the other node; it can be a temporary state (lets cover this later) or permanent (there is something wrong - let's cover this later).

Jira assumes remote caches are (and should) be created when the plugin starts.

So usually the code will look something like this:

@Component
public class MyComponent {

private final Cache<Serializable, Serializable> myRemoteCache;

public MyComponent(@ComponentImport final CacheManager cacheManager) {
this.myRemoteCache = cacheManager.getCache(
"my.remote.cache.name",
null,
new CacheSettingsBuilder()
.replicateViaInvalidation()
.build());
}

}

Let's have a 2 node cluster and a cache replication event on Node1. Let's see when could we see this exception.

  1. Both nodes are up and on both nodes the plugin is initialised and the cache is created (on node start): java.rmi.NotBoundException - can not happen
  2. Node1 is up and Node2 is down (by down I mean Node1 can't connect to RMI port on Node2): java.rmi.NotBoundException - can not happen
  3. Node1 is up and Node2 is starting - RMI port is open before all plugins are up (so before MyComponent was created on Node2). java.rmi.NotBoundException - can happen but this is a temporary situation. Node1 will retry n-times (information in log) to deliver this cache replication messages and assume Node2 is starting and that this cache will be available soon when this plugin is fully up. 
  4. Both nodes are up but on node2 the cache was not created: java.rmi.NotBoundException - can happen
    Possible explanation:
    1. different versions of Jira running on both nodes and one version is missing this plugin) - this can happen when doing "zero downtime upgrade" and is usually not a problem (in case of using recommended remote caches replicated via invalidation)
    2. remote cache not created on node start but during the lifetime of the node - this is an error in the code or would need another mechanism (cluster lock) to synchronise the creation of the cache on all nodes; not recommended anyway; 

 

First I would like to understand which case (from above) is what you are seeing. 

You are saying that you are doing the following:

  1. create cache
  2. cache.removeAll()

 

  1. When this sequence happens in the lifetime of your application?
  2. You are saying that delaying those operations affect you seeing this bug or not. I can't see why this would happen but I need you to answer the previous question. "create cache" creates a cache on nodeX, then "cache.removeAll()" will synchronously clean this cache on nodeX and asynchronously clean this on other node; "create cache" on nodeX has nothing to do with creating caches on other nodes (and this exception is about other nodes) so adding a delay should not have any effect;
  3. If you are doing those 2 operations in sequence as I understand you are using caches replicated via copy - I would strongly advise you to reconsider the design and switch to caches replicated via invalidation; caches replicated via copy are generally not recommended;

Please also have a look at this page to see the description of asynchronous cache replication in Jira: https://confluence.atlassian.com/enterprise/jira-data-center-cache-replication-954262828.html

Like Denis Malyshkin likes this
Maciej Swinarski
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 7, 2019

Replying to your questions:

  1. Does "Abandoning sending" message in errors mentioned in my question mean that cache replication is not performed for that cache?

No this means nodeX abandoned sending this particular message to nodeY. NodeX abandons sending message to NodeY only when it assumes this plugin has already started. I will still try to deliver next replication messages to this cache.

  1. If yes -- is it a permanent disabling of the replication for the particular cache or a disabling just for the specified operation? I.e. will subsequent cache changes be replicated?

Yes, disabling just for the specified operation.

  1. Are there any ways to avoid these warnings without explicit delays? I'm afraid that such errors in customer logs will cause useless stress to our and Atlassian support.

explicit delays are definitely not needed; more details in answer above; 

Like Denis Malyshkin likes this
Denis Malyshkin February 7, 2019

Maciej, thank you a lot for the detailed description of the Jira cache behavior.

It seems that our case is 4.2 in your list above.

As I wrote in the initial question description, the cache is bound to our object which is created at runtime. When the object is created it sends our OBJ_CREATED event to other nodes to let them know and to create the same objects on them as well (and so the objects on other nodes create their own caches). Thus the detailed list of operations for a 2-node cluster is:

  1. Node 1 creates an object with the cache
  2. Node 1 sends an OBJ_CREATED event
  3. Node 2 receives the OBJ_CREATED event and creates the corresponding object with the cache
  4. Node 1 invalidates the cache by removeAll() call

The error is on step 4. A delay after the step 2 eliminates the error.

Thus it seems that our issue is on the step 3 which may occur after the step 4 due to the async nature of events and cache replication. Are the events and cache replications send through the same queue?  Our assumption was that event and subsequent cache replications are received on another node in the same order as they generated on a transmitting node. Is it a wrong assumption?

If you are doing those 2 operations in sequence as I understand you are using caches replicated via copy

Our cache creation code:

cacheFactory.getCache(<Unique cache name generation>,
new DataLoader(),
new CacheSettingsBuilder()
.expireAfterAccess(1, TimeUnit.DAYS)
.flushable()
.maxEntries(cacheSize)
.build());

CacheSettingsBuilder by default creates a cache with replicateViaInvalidation, so we do not use replication via copy.

Summarizing your answer we should:

  • Either use a single cache created on the plugin startup (instead of the cache bound to a particular object).
  • Or use some syncing mechanism (like cluster locks) to be sure that cache has been created on all nodes before any changes to the cache.

Is the above summary correct?

Maciej Swinarski
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 7, 2019
  1. Node 1 creates an object with the cache
  2. Node 1 sends an OBJ_CREATED event
  3. Node 2 receives the OBJ_CREATED event and creates the corresponding object with the cache
  4. Node 1 invalidates the cache by removeAll() call

 

There is one missing step. How do you guarantee that step 3 is done before step 4?

Those operation are asynchronous. In step 2 you send OBJ_CREATED event (via ClusterMessagingService?). If you would like to be sure in step 4 that step 3 was done you would need to send a OBJ_CREATED_ACK from node 2 to node 1. 

Then take into account having more nodes. You would need to track the number of nodes. Now on top of it add handling misbehaving nodes (long GC, networking problems, etc...) With all those non-trivial mechanism in place you would basically re-introduce synchronous operations across all nodes in the cluster which simply can't work on a production cluster.

Our assumption was that event and subsequent cache replications are received on another node in the same order as they generated on a transmitting node. Is it a wrong assumption?

cache replications from a single thread are guaranteed to be received on another node in the same order; but cluster events (via ClusterMessagingService?) are send via another channel (DB) so there is no guarantee in the ordering of cluster messages and cache replication messages. 

 

  • Either use a single cache created on the plugin startup (instead of the cache bound to a particular object).
  • Or use some syncing mechanism (like cluster locks) to be sure that cache has been created on all nodes before any changes to the cache.

I really discourage option 2. Synchronous operations across cluster will eventually fail in a way it will affect the whole cluster (High Availability is gone).

Definitely go with option 1. 

I don't really know your use case but I am curious about why would you want to run cache.removeAll() on a cache replicated via invalidation.This is a cache of values from a shared resource between nodes (database, shared folder file?)

Like Denis Malyshkin likes this
Denis Malyshkin February 7, 2019

via ClusterMessagingService?

Yes, of course.

I don't really know your use case but I am curious about why would you want to run cache.removeAll() on a cache replicated via invalidation.This is a cache of values from a shared resource between nodes (database, shared folder file?)

Yes, this cache contains values from a shared resource. When the resource parameters are changed -- we call cache.removeAll() to be sure that there are no incorrect values in the cache.

The error with NotBoundException we caught when the resource parameters were changed just after the resource creation.

We may try to avoid such cases on creation but it does not guarantee that in some circumstances the error comes back again. For example, if a user updates the resource parameters too quickly after creation while the creation event delivery will be delayed for some reasons.

Thus I've asked here to find the most appropriate way of the cache use.

Thank you a lot for the help in this case understanding!

Like Maciej Swinarski likes this

Suggest an answer

Log in or Sign up to answer