JIRA's GC problems

Dmitrii Apanasevich
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 16, 2015

Hi!

We have JIRA 5.0.4 instance with lot of issues (~500K).

Today I have faced with concurrent mode failure:

430135.067: [CMS2015-02-17T10:37:10.868+0300: 430138.228: [CMS-concurrent-mark: 5.335/13.843 secs] [Times: user=53.87 sys=1.92, real=13.84 secs] 
 (concurrent mode failure): 8388608K->8388607K(8388608K), 39.5041120 secs]430174.572: [Class Histogram
 num     #instances         #bytes  class name
----------------------------------------------
   1:      38454698     2420770448  [C
   2:      19789035     2315535216  [Ljava.lang.Object;
   3:      38469185     1231013920  java.lang.String
   4:      23047100     1106260800  org.apache.lucene.document.Field
   5:      34057895     1089852640  java.util.HashMap$Entry
   6:      17937773      430506552  java.util.ArrayList
   7:       4308072      364018032  [Ljava.util.HashMap$Entry;
   8:       4240606      203549088  java.util.HashMap
   9:       5898413      141561912  java.lang.Long
  10:        193169      112915568  [B
  11:        148400      108985416  [I
  12:       1770331       99138536  org.ofbiz.core.entity.GenericValue
  13:            37       79666304  [Ljava.util.Collection;
  14:       1790656       57300992  java.util.Vector
  15:        348579       57289648  <constMethodKlass>
  16:       2021766       48522384  java.util.LinkedList$Entry
  17:        348579       47422664  <methodKlass>
  18:         38630       42971760  <constantPoolKlass>
  19:        560925       31411800  com.atlassian.jira.issue.DocumentIssueImpl
  20:       1881905       30110480  java.lang.Integer
  21:         38630       28972744  <instanceKlassKlass>
  22:       1784142       28546272  java.util.HashMap$EntrySet
  23:        414247       25311624  <symbolKlass>
  24:         33303       24114552  <constantPoolCacheKlass>
  25:        474338       15178816  java.lang.ThreadLocal$ThreadLocalMap$Entry
  26:        584090       14018160  org.apache.lucene.search.ScoreDoc
  27:        560926       13462224  org.apache.lucene.document.Document
  28:         23258       13071016  <methodDataKlass>
  29:        360203       11526496  java.util.concurrent.ConcurrentHashMap$HashEntry
  30:        278817       11152680  java.lang.ref.SoftReference
  31:        217163       10423824  org.apache.velocity.runtime.parser.Token
  32:        236301        9452040  org.apache.lucene.index.TermInfo
  33:        367843        8828232  java.util.concurrent.locks.ReentrantReadWriteLock$Sync$HoldCounter
  34:         89345        7862360  java.lang.reflect.Method
  35:         60222        7186440  [S
  36:          2003        7022208  [J
  37:        164389        6575560  java.util.LinkedHashMap$Entry
  38:        250402        6009648  org.apache.lucene.index.Term
  39:        249530        5988720  java.util.LinkedList
...

This happened several times before I restarted JIRA. 

gclog.jpg

JVM settings are:

JAVA_OPTS="$JAVA_OPTS "-server" "-Xincgc" "-XX:NewSize=2048m" "-XX:MaxNewSize=2048m" "-XX:PermSize=512M" "-XX:+CMSClassUnloadingEnabled" "-XX:+CMSClassUnloadingEnabled


JVM_MINIMUM_MEMORY="10g"


JVM_MAXIMUM_MEMORY="10g"


JVM_EXTRA_ARGS="-XX:+PrintGCDateStamps -XX:ReservedCodeCacheSize=128M -Xloggc:/var/log/jira/gc.log -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC -XX:SurvivorRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=16 -XX:ParallelCMSThreads=12 -XX:+CMSScavengeBeforeRemark -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60"

 

What happened with Old Gen? Why there are so many lucene objects?

atlassian-jira.log doesn't show anything, there are usual warnings and errors, like "RHS of #set statement is null". No info about indexing. No exceptions.

 

Can anybody help to find where is the problem?

 

Thanks in advance!

2 answers

1 accepted

1 vote
Answer accepted
crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 17, 2015
1:      38454698     2420770448  [C
   2:      19789035     2315535216  [Ljava.lang.Object;
   3:      38469185     1231013920  java.lang.String
   4:      23047100     1106260800  org.apache.lucene.document.Field

 

This indicates to me that there is a very large search running that is likely trying to pull your entire Lucene database into memory at once.  This usually comes from poorly behaved plugins, but there are ways to construct very nasty JQL that could also do this.

The old gen fills because there is a soft values cache that tries to hold onto Lucene data for as long as a particular state of the Lucene index is still in use.  This is a huge performance benefit, but the fact that it is a soft values cache means that the JVM will only release the values when the last reference closes or when memory is otherwise exhausted and the GC is forced to release some of it.  Normally this is not a problem because in-flight searches don't last long enough for it to build up.

Some things you can try:

  • Check atlassian-jira-slow-queries.log to see if there are any JQL logged as taking an unreasonably long time.  It's hard to say exactly what is reasonable, but certainly any value over 30000 ms is going to cause serious problems.
  • Check your activity log for the requests around the time memory started increasing to see if you can find a culprit.
  • Generate a thread dump to see if there is a particular thread or set of threads that seems to be stuck in Lucene for a particularly long time.  You may be able to track down a plugin or JQL function that is triggering this from the stack traces.
  • Generate a heap dump to see if a single thread is holding on to the memory.

 

 

Dmitrii Apanasevich
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 18, 2015

Thanks for your answer Chris! slow-queries log shows me lot of queries from suspicious plugin. There are about 15 queries a duration of 500 ms for only one http request. There was 3 such requests at the time memory started increasing. And some other slow queries. I have disabled that plugin. Could you please explain how to see which thread is holding on to the memory? I have created heap dump and open it in visualvm. What is the next step?

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 18, 2015

The slow query log triggers for anything 400ms or longer, but while a 500ms query is "slow", a single one on a page probably would not be all that noticeable. Did disabling the plugin actually improve things? The heap dump tools vary considerably. I usually use Eclipse MAT and don't really know my way around VisualVM. However, there should be some options to find the largest objects by the size of heap that they retain. MAT calls this "dominators" but I don't think VisualVM has any special name for it. When there is a single thread performing a long-running and excessively large search, it usually shows up here.

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 18, 2015

Just re-read your comment... 15 * 500ms would indeed be very suspicious. If the plugin has a configuration page where you can enter the JQL queries it uses, then you may be able to improve the actual queries instead of disabling the plugin. For example, there are some expressions like "filter IN (...)" or "parent(...)" that are particularly expensive because they implicitly perform nested searches. Sometimes these questions can be asked another way that avoids the problem.

Dmitrii Apanasevich
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 19, 2015

Yes, that plugin is configurable. I'll rewrite search logic and also change JQL queries used in them. But that was nor the real cause of problem. Thanks a lot for the tip about Eclipse MAT. Powerful tool. Today CMS fails again and using MAT i definitely found a JQL function which leads to increase memory. Your answers were very helpful!

crf
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 19, 2015

Happy to help! Eclipse MAT is very powerful, but it isn't very user-friendly, which may make it harder to get started with. Most of the developers at Atlassian tend to prefer YourKit for this kind of thing. I would be interested in knowing which JQL function was causing the problem in this case.

Dmitrii Apanasevich
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 19, 2015

It was our custom JQL function to perform JQL subquery. Nested search was used unlimited filter. Foolish mistake.

0 votes
philleicht
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 17, 2015

Hey Dmitrii,

the following might not be the solution to your problem, but my frank idea of what would make sense in this situation.
Considering the size of your instance I assume it belongs to a larger corporation and has by now gained significant business value and importance. Due to this, I would suggest to involve an Atlassian Expert specialized in performance tuning to thoroughly analyse your instance and its resource footprint to give a proper consultation on how to adjust the configuration in a sustainable manner. Fighting the symptom might work for now, but it might not guarantee stability over the long run.

Being part of a platinum expert with several experts in this field that also have experience with instances of such volume, feel free to ping me when this is an approach you want to consider.

I hope I don't come off as greedy for the sales opportunity, but genuinely supportive with your troublesome situation.

Regards,
Philipp





MattS
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 19, 2015

Philipp, I think it's ok to remind people of what Atlassian Experts can help with. I tell people who I work (ServiceRocket) or or put it in my name field and then add a link to http://experts.atlassian.com That way we don't all have to chime and say "me too, me too!"

MattS
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 19, 2015

And ideas from a core JIRA dev trump most experts anyway :)

philleicht
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 19, 2015

Hey Matt, thanks for your support. For me, personally, it's a bit of a balancing act to mention experts and the possible benefits on getting one on board but not give the impression of using this platform to hunt for leads. Of course there are always Devs from Atlassian looking into this and would naturally be able to answer/fix the problem. However, as you said, I think sometimes it makes sense to suggest to seek an expert to get bespoke service to not only tackle the problem but also avoid it from re-occuring in the future and unneccessarily risking tremendous business impacts (if applicable). I intentionally didn't mention my company as I only wanted to make the suggestion of considering getting an expert on board (and that we *would* be capable :)) instead of - as you say - shout "come to <company name>, you get cookies on top!". Using the company name in the name has some side-effects that I'm not sure I would like to take. But it's a thing I'm still considering. :) Regards, Philipp

Suggest an answer

Log in or Sign up to answer