We have 2 nodes in our Jira cluster. Most of the time I see uneven load between the nodes. The first node always has more load than the second node (e.g. 200% and 10%), even though they are in the same subnet and have the same configurations. The load balancer uses "sticky sessions". Does any of you have the same problem?
The question is rather what causes that load. Are you sure it is due to the load balancer distributing users to that node and "pretty much ignoring" the second node?
All things being equal, even if we assumed that it is due to user distribution, you would still end up over 100% on both nodes if the users were spread evenly.
Rather, I would suspect something else is eating up the cpu - e.g. some background jobs, automations, buggy/looped code, terrible plugin, etc. I truly doubt that you will get over 200% purely because of the amount of users (if so then that node is severely underspecced especially if you account for the other node shutting down and shifting all of the users to the one extra redundancy node).
Sticky sessions are kind of problematic with the user distribution in general (coupled with the "remember me" token often missing from users so they get logged out when you shift them to another node). Seeing a higher traffic on one of the nodes because it contains majority of the users is relatively common, but not so common to see that high of reported load.
You might look into lowering the session timeout duration (so sticky gets invalidated and the user balanced to some node again after re-logging), or see if you can get any stats from your load balancer, maybe you might find some tweaks for it to distribute the users more evenly, but still I highly doubt it is due to the users alone, I think you might have something rotting in there and eating up that cpu.
Hi. Thanks for the detailed answer.
You say something is eating up the cpu on the first node, but wouldn't the load balancer then direct new users to the second node if the first one was heavily loaded? Very few users are directed to the second node.
We have background jobs, but they mostly run at night. We also have a lot of automation rules. Do they always run on the same node?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
wouldn't the load balancer then direct new users to the second node if the first one was heavily loaded?
It depends on how your load balancer is set up. It's not as much about Jira as much as it is about the individual LB solution and configuration. E.g. if you browse through this here https://confluence.atlassian.com/enterprise/traffic-distribution-with-atlassian-data-center-895912660.html there are many factors. Especially sticky sessions - once you have a session, you generally will only be routed to one individual node, unless that node becomes unavailable for the LB to "have to" route you elsewhere. Otherwise it just sees sticky and that's all it wants to hear. It could as well be your LB does not do any balancing logic at all and just routes everyone to the first node from the list, it depends entirely on it's configuration.
Rather than guessing the user ratios you likely can get better stats/monitoring about user distribution from the LB itself.
We also have a lot of automation rules. Do they always run on the same node?
I can't tell for sure since that's more of Atlassian dev area, but to my knowledge they mostly are on the same node as the trigger. The "execute this rule immediately" even runs inside the user http thread. As for background triggers, I think they will mostly run on the same node (and each node has it's own threadpool for rule execution), but I also think there is a possibility for it to run on another node entirely (such as scheduled rules). So I think it's "mostly" but they definitely can run on any node.
All in all, I would start focusing on learning about which threads consume a lot of cpu, and then based off the stacktraces you can usually get a suspect or two, narrow things down a bit. Though it really comes down to recognizing packages and locks to get an idea sometimes, it's not always exactly clear and you might draw a false idea about something that is a symptom rather than a cause.
Ideally, you might want to generate a support zip for each of the nodes and raise a support ticket with Atlassian - they know where and what to look for and explain what and why eats that cpu and help you smoothen it out. They're generally pretty good in that as they dig through the insane amount of data on daily basis.
Thread dumps + top are also pretty easy to get with the script here https://bitbucket.org/atlassianlabs/atlassian-support/src/master/ and that will also be really useful to look into under high load. Support zip's native thread dumps are somewhat lacking so I usually prefer to use the tdump/top script separately from the support zip, it's just better data for thread analysis (such as watson: https://drauf.github.io/watson - but again you might draw false conclusions if you haven't dug through jira thread dumps before, ATL ticket will be a faster and more comprehensive health check).
As long as the nodes are within the sizing recommendations (https://confluence.atlassian.com/enterprise/jira-sizing-guide-461504623.html), that would be the first thing to check.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.