Highlighted

Want to know how Atlassians monitor their enterprise deployments?

Don Domingo Atlassian Team Nov 13, 2018

At Atlassian, we believe in our own products – that's why we use them, even at the enterprise level. Doing so gives us first-hand experience of how they perform at scale; to do this, we monitor each instance closely.

We also believe in transparency. That's why we published reference architectures that describe how we monitor some of our Data Center deployments. As you start forming your own monitoring strategies, you can use these references to guide your decisions. https://confluence.atlassian.com/enterprise/how-atlassians-monitor-their-enterprise-deployments-947849816.html

How do you monitor your enterprise deployments? Use this thread to share your best practices with the community.

2 comments

We have a number of different ways we monitor Jira.

1. We have a script that creates, updates and deletes an issue every minute on each node of our 4 node cluster. The time for each of those tells us what users are experiencing

2. We use the Health Check REST endpoint to monitor all of those health checks every minute 
https://confluence.atlassian.com/jirakb/how-to-retrieve-health-check-results-using-rest-api-867195158.html

3. We use Jolokia to provide REST resources for JVM JMX metrics so we can monitor metrics such as Full GC

4. We have various services that check Jira log files for certain critical messages, including the number of ERROR and WARN messages per minute

5. We have custom scripts that extract metrics from the database, and also the database server metrics

6. We monitor the Jira node server metrics as for any other machine. CPU utilization, free disk space, disk IO, network IO

7. We have a scraping script to monitor the outgoing mail queue size, and flush as necessary.

Like 3 people like this

Then we have a dashboard that combines all the metrics, and alerts that are triggered by levels and duration. The alerts are processed by a custom service that decides who to contact depending on on-call schedules and personal preferences. The alert info also contains links to Wiki pages about handling the problem.

Which software you use for your dashboard that combine all the metrics ?  I have been playing with Tableau but not sure it will be final dashboard as I have hundreds of millions of rows to process (accesslogs/app logs/etc)

Example : on my confluence dashboard (Tableau) I have:

- a TOP slow pages that get looked out each day, 

- user that consumed the most CPU with thread id,

- page that consume the most server side that were viewed the most by day to see who use automation to refresh pages too often.

All those data get sent from Apache NIFI to Mysql

Martin

We use a custom app for combining the metrics on different pages

Ah ok.

Us for example tableau (one of the metric for confluence) look like this, we have all the ipaddress + name associated with that computer.  Just parsing the accesslog save us lot of performance problems.  

image.png

Don Domingo Atlassian Team Nov 15, 2018

Thanks for sharing, Matt! Also, I found this part of your strategy interesting:

1. We have a script that creates, updates and deletes an issue every minute on each node of our 4 node cluster. The time for each of those tells us what users are experiencing

I'll check with our team what they think about it, and whether we're doing something similar on any of our production or test instances. For one of our Confluence DC instances, we monitor the site's Apdex and alert support whenever its response time drops to 4 seconds.

Yes, it's not a common approach but we like it because it gives us a sense of what our customers are experiencing Jira. We use an internal Jira user as the service account, with the internal user directory first in the list of directories. So the absolute values for that metric are probably a little better than our AD users experience.

One odd thing about creating and deleting an issue each minute for years on end is that our issue keys have grown large, e.g. JPT-4612542. But that hasn't broken anything yet. OpenJDK has even larger numbers in their issue keys.

Comment

Log in or Sign up to comment