What Jira operations cause synchronous activity in RMISynchronousCacheReplicator?

dmitri October 24, 2016

We suffered a brief network partition in our JIRA Datacenter setup. Among other exceptions, we noticed the usage of RMISynchronousCacheReplicator in some stacktraces in http threads. When do user facing threads cause synchronous activity? Can this be configured to be asynchronous?

Stacktrace from our incident:

java.net.ConnectException: Connection timed out
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:130)
        at net.sf.ehcache.distribution.RMICachePeer_Stub.remove(Unknown Source)
        at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.replicateRemovalNotification(RMISynchronousCacheReplicator.java:243)
        at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.notifyElementRemoved(RMISynchronousCacheReplicator.java:229)
        at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementRemoved(RegisteredEventListeners.java:144)
        at net.sf.ehcache.event.RegisteredEventListeners.notifyElementRemoved(RegisteredEventListeners.java:124)
        at net.sf.ehcache.Cache.notifyRemoveInternalListeners(Cache.java:2322)
        at net.sf.ehcache.Cache.removeInternal(Cache.java:2305)
        at net.sf.ehcache.Cache.remove(Cache.java:2207)
        at net.sf.ehcache.Cache.remove(Cache.java:2125)
        at net.sf.ehcache.constructs.EhcacheDecoratorAdapter.remove(EhcacheDecoratorAdapter.java:154)
        at com.atlassian.cache.ehcache.LoadingCache.remove(LoadingCache.java:208)
        at com.atlassian.cache.ehcache.DelegatingCache.remove(DelegatingCache.java:135)
        at com.atlassian.jira.avatar.CachingTaggingAvatarStore.getByIdTagged(CachingTaggingAvatarStore.java:104)
        at com.atlassian.jira.avatar.AvatarManagerImpl.getByIdTagged(AvatarManagerImpl.java:117)
        at com.atlassian.jira.avatar.AvatarManagerImpl.lambda$transformToJIRAAvatar$480(AvatarManagerImpl.java:421)
        at com.atlassian.fugue.Option$Some.fold(Option.java:421)
        at com.atlassian.fugue.Option.flatMap(Option.java:287)
        at com.atlassian.jira.avatar.AvatarManagerImpl.transformToJIRAAvatar(AvatarManagerImpl.java:421)
        at com.atlassian.jira.avatar.AvatarManagerImpl.processAvatarData(AvatarManagerImpl.java:376)
        at com.atlassian.jira.avatar.AvatarManagerImpl.readAvatarData(AvatarManagerImpl.java:488)
        at com.atlassian.jira.web.servlet.AvatarToStream.sendAvatar(AvatarToStream.java:29)
        at com.atlassian.jira.web.servlet.AbstractAvatarServlet.defaultDoGet(AbstractAvatarServlet.java:82)
        at com.atlassian.jira.web.servlet.ViewUserAvatarServlet.defaultDoGet(ViewUserAvatarServlet.java:39)
        at com.atlassian.jira.web.servlet.AbstractAvatarServlet.doGet(AbstractAvatarServlet.java:39)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:622)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)

1 answer

1 accepted

0 votes
Answer accepted
crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
October 24, 2016

When do user facing threads cause synchronous activity?

Essentially, any write operation to a synchronous cache or the initial access to a replicated cache that has not yet bootstrapped values from another node.  At one point in time the Users and Groups caches behaved this way, but I think that was fixed in 7.0.  I have 7.1.0 checked out and can tell by the code that is definitely fixed there, so the only cache I know would do this is the TaskManager's.

Can this be configured to be asynchronous?

It definitely can in the code:

    https://docs.atlassian.com/atlassian-cache-api/2.11.3/atlassian-cache-api/apidocs/index.html

I could be wrong, but my understanding is that we use asynchronous replication by default and that only caches that explicitly opt in to being synchronous will replicate in that manner.  The only such cache I'm aware of offhand is the one for the TaskManager, which is one of the few value-replicating caches in JIRA (most replicate invalidation only).

You should be able to tell from a thread dump which cache was doing synchronous replication.  Without knowing which caches were involved, it isn't really possible to fairly evaluate what the potential consequences of changing them from synchronous to asynchronous might be.  It could be that the cache's author was just being paranoid, or it could be that asynchronous replication destabilized behaviour for their code, so they specified synchronous replication as a workaround.

 

dmitri October 25, 2016

thanks. indeed it looked to be from a removal from a cache related to avatars.

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
October 25, 2016

Hrmmmm... That isn't a cache I would expect to need to be synchronous, so I may be mistaken about the default?  I don't work on JDC anymore, so this is mostly from memory.  I do know that avatars were moved from the core to a plugin, so there may be something that changed during that process as well.  I'd probably need a stack trace to figure out whether it's peculiar to the avatar cache or not.

dmitri October 25, 2016

i added the full stacktrace.

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
October 25, 2016

From the stack trace a bit of digging, my guesses are:

  1. I don't see any custom settings on this cache, so we seem to be using synchronous replication by default.  There may be an opportunity to improve performance here by making cache replication asynchronous instead, but that could also cause instability if parts of the system don't respond as well to eventual consistency semantics, so it would need a lot of testing.
  2. This particular cache is doing a removal (and thus a replicated event) on every read request, which is just plain dumb.  You should report this to support.  The nasty code is this (from CachingTaggingAvatarStore):

@Override
public Avatar getByIdTagged(final Long avatarId) {
    try {
        return taggedAvatars.get(avatarId, () -> tagAndRetrieve(avatarId)).orElse(null);
    } finally {
        avatars.remove(avatarId);
    }
}

 

That `remove` probably isn't necessary at all.  The point is that once an avatar is "tagged" with the metadata that confirms it came from JIRA, we don't need to hold onto the untagged version of it anymore.  But on a cache hit, that untagged one is already gone, so the remove on the untagged avatar is just wasteful.  It could be moved inside the loading function or possibly removed altogether, depending on what the surrounding code does (I didn't dig in further than that).

 

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
October 25, 2016

In short, I think you saw these stack up not because they are synchronous, but because they are unnecessarily and excessively common replications.

dmitri October 25, 2016

can you suggest the right support address for this?

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
October 25, 2016

I think https://support.atlassian.com/servicedesk/customer/portal/22 is the correct place.  (And if it isn't, they'll know what is and forward it on).

dmitri October 25, 2016

thanks so much.

Suggest an answer

Log in or Sign up to answer