So here's an interesting observation I've been seeing. If we leave our host running for about three months, the system will eventually run out of memory and crash. (30GB sized instance with Xmx of 20GB with off-server DB).
The memory usage climbs to 29GB and floats there until the linux out-of-memory killer process selects Jira and kills it. (I've since configured our service to restart automatically)
Here's the investigation steps I used in case this is helpful:
First, let's print out the heap histogram on the live running service using jcmd. This runs really quickly: <PATH_TO_JDK>/bin/jcmd <JAVA_PID> GC.class_histogram |more
num #instances #bytes class name (module) 1: 13413784 643861632 java.lang.ThreadGroup (java.base@11.0.16.1) 2: 6956154 577988424 [B (java.base@11.0.16.1)
Looks who's #1, java.lang.ThreadGroup
Next, we can perform a heap dump of all live objects using jcmd as well, and utilize Eclipse Memory Analyzer on it. We use a 20gb XMX heap size and it took about a minute to finish writing the heap dump, which is not bad at all.
Eclipse Memory Analyzer points to java.lang.ThreadGroup as the suspect as well, but when we drill into it, we see tons of references inside to ClassGraph.
Now, to see where ClassGraph is being used, I just do a grep on my plugins:
grep -R "ClassGraph" *
Binary file installed-plugins/plugin.7509544786347732491.groovyrunner-7.7.0.jar matches
Thinking maybe it's some custom scriptrunner code that's causing it, I setup a brand new Jira locally without any custom scriptrunner code (just using built-in ones).
I was able to generate the same leak by accessing ScriptRunner's Listener settings page, and printing out the live heap histogram...sure enough, java.lang.ThreadGroup keeps growing.
I've sent this info to Adaptavist Support and if I hear any updates, I'll share. If you've experienced something similar as well, I'd love to hear about it!
Update from Adaptavist: Bug is fixed! Once it ships, you can check for release version here.
Hey @David Yu
I'm Reece, technical lead for ScriptRunner for Jira.
I admire your investigation skills, this indeed appears to be a very slow leak, slow enough that nobody has probably made the connection before, kudos to you!
I believe I have reproduced this locally and I have a heap dump, my initial hunch is that this is a bug in the ClassGraph library, it appears to not be explicitly destroying thread groups, neither setting such groups as daemons.
ScriptRunner repeatedly makes calls to ClassGraph in one part of the codebase, which is not the standard usage pattern, and may explain an unidentified bug in the library.
I see you have raised a support ticket, I'll chat with the rest of the team tomorrow and get that escalated for you.
Thank you once again for your persistence in finding reproduction steps for us, you've likely saved me days/weeks of effort.
Cheers!
Reece
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.