Confluence stops completely. Not finding significant information in the logs

Tracy January 23, 2023

Our Confluence has been going down 1-2 times per week for awhile now.  I'm not seeing much in the logs that would pinpoint why.  Today I checked the catalina.out, conf_access_log, atlassian-synchrony-proxy log files but nothing jumped out at me.  

One interesting thing I noted is that there is consistently no logging for a span of time between the time it went down and the time we started Confluence back up.

As an example, today in the catalina.2023-01-23.log file there was an entry at 23-Jan-2023 19:41:13.830 and not another entry until 23-Jan-2023 20:04:37. Our log monitor alerted us to the outage at 2.57pm EST.  Confluence was brought back up at 23-Jan-2023 20:04:37.062.

Any thoughts as to why this is happening? Is there something specific in the logs I should be looking for? 

Thanks in advance for any assistance you can provide.

2 answers

2 votes
Robert Wen_ReleaseTEAM_
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
January 23, 2023

Hello @Tracy ! Welcome to the Atlassian Community!

This is only a hunch, but if you're on Linux, have you checked /var/log/messages to see if oom-killer stopped the Confluence process to recoup memory?

If not, feel free to detail specifics of your system.

Tracy January 25, 2023

I have checked and I do not have a /var/log/messages path

We just had an outage again about an hour ago.  These are the last entries in the catalina.out file prior to the outage:

 

25-Jan-2023 18:17:34.115 WARNING [http-nio-8090-exec-13] com.sun.jersey.spi.container.servlet.WebComponent.filterFormParameters A servlet request, to the URI xxxx, contains form parameters in the request body but the request body has been consumed by the servlet or a servlet filter accessing the request parameters. Only resource methods using @FormParam will work as expected. Resource methods consuming the request body by other means will not work as expected.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f672068f000, 16384, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid645.log

 

I checked the hs_err_pid645.log file but there were no timestamps to correlate to the outage.

1 vote
Nic Brough -Adaptavist-
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
January 24, 2023

Welcome to the Atlassian Community!

Confluence is written in Java, and although it is better at memory handling than C-based languages, it still has many (frankly <swear word> annoying) quirks.  As a coder, I don't believe I should think about memory most of the time (so I never try to code in C-based rubbish), but Java, and the other "more modern" languages still make me have to think.  (Yes, Python, I'm looking at you.  Stop letting me break stuff.)

To diagnose this problem, I would have to guess a bit, and take a gamble on: 

  • 70%: What @Robert Wen_ReleaseTEAM_ said (and the pointer to "read the logs for the application, the application server, and what the operating system was doing" is definitely the next step)
  • 29%: Java is doing a big "garbage collection"  If a JVM runs out of memory, it looks for any memory allocation that is no longer in use and tries to grab it.  But while it does that, everything else it is doing stops dead.  Very dead.  It's not "I'm busy, go away", it's "stop the world"
  • 1%: It's something else

 

As Robert said, we really do need the logs, especially when the service was not responding.

Nic Brough -Adaptavist-
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
January 25, 2023

"There is insufficient memory" is the "smoking gun".

The process simply ran out of working space in the server's memory.

There is a huge amount of underlying investigations you need to do here, but a good starting point is to perform a very simple test and do a little monitoring.

You say your server is dying once or twice a week - that's a very fuzzy metric, but it is good enough for your first round of testing.

Start with a quick look at Admin -> System -> System info.  Look for the memory usage, specifically the maximum heap size. 

You are going to need to increase that maximum.  Talk to the server admins - you are going to need to get them to change the maximum heap, while not breaking the server operating system or other stuff running on it.  

People often make the mistake of going too far too fast.  "Our Jira was running with 2Gb heap, it went wrong, so we increased it to 8Gb" - whilst that will probably solve, or at least help with, "out of memory" errors, there is a very good chance it will introduce others.

Increase the heap by a moderate amount, remembering to leave room for everything else, and monitor the results.   

The monitoring is then "does the gap between crashes get longer, or even does it stop crashing?".  Even just "hey, it now feels like it crashes every 8-10 days, rather than 1-2 a week" tells us we are looking at the right thing.

("Leaving room" is not a simple subject either but an oversimplified example:  Imagine a machine with 8Gb of physical RAM.  The only thing installed on it is Jira, even the database is on another machine.  If it is a Unix-like operating system, you can go up to a 7Gb heap size because the OS can run fine in 1Gb with a bit of tuning.  But I'd recommend giving it at least 2Gb, so you don't have to worry.  Never try to run Windows with less than 4Gb - do not allow Jira more than 4Gb heap on an 8Gb Windows system)

Without your numbers for the current heap size, I can't give you a real number, but I will say "never go up more than 25% of what you have now".  As a system gets bigger, reduce that % a bit.  If your heap is currently 2Gb, go to 2.5Gb, then 3, then 3.5.  If it's 4Gb, go to 5Gb, then 6.  If your heap is 16 GB, go to 18 GB, then 20.

(Also compare the -Xms and -Xmx settings, ideally these should be set to the same number, it helps start the service faster and reduces the work the JVM does in some places)

Like Robert Wen_ReleaseTEAM_ likes this
Tracy January 26, 2023

First let me thank you for the very detailed answer above! I'm rather new at Confluence administration so the details are very helpful.

Here's a snapshot of several pieces of information I got from the command line. The ones at the bottom of the page are the output of 'top' at several different times.1-26-2023 11-54-20 AM.jpg

Here's the config page (blanked out the server base url just in case I shouldn't be sharing that)
1-26-2023 11-57-20 AM.jpg

And here are 2 different points in time ss of the memory stats in confluence:1-26-2023 12-09-50 PM.jpg1-26-2023 12-09-06 PM.jpg

 

Anything else you need that might be helpful?

Like Robert Wen_ReleaseTEAM_ likes this

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events