At Atlassian, we believe in our own products – that's why we use them, even at the enterprise level. Doing so gives us first-hand experience of how they perform at scale; to do this, we monitor each instance closely.
We also believe in transparency. That's why we published reference architectures that describe how we monitor some of our Data Center deployments. As you start forming your own monitoring strategies, you can use these references to guide your decisions. https://confluence.atlassian.com/enterprise/how-atlassians-monitor-their-enterprise-deployments-947849816.html
How do you monitor your enterprise deployments? Use this thread to share your best practices with the community.
Then we have a dashboard that combines all the metrics, and alerts that are triggered by levels and duration. The alerts are processed by a custom service that decides who to contact depending on on-call schedules and personal preferences. The alert info also contains links to Wiki pages about handling the problem.
Which software you use for your dashboard that combine all the metrics ? I have been playing with Tableau but not sure it will be final dashboard as I have hundreds of millions of rows to process (accesslogs/app logs/etc)
Example : on my confluence dashboard (Tableau) I have:
- a TOP slow pages that get looked out each day,
- user that consumed the most CPU with thread id,
- page that consume the most server side that were viewed the most by day to see who use automation to refresh pages too often.
All those data get sent from Apache NIFI to Mysql
Martin
We use a custom app for combining the metrics on different pages
Ah ok.
Us for example tableau (one of the metric for confluence) look like this, we have all the ipaddress + name associated with that computer. Just parsing the accesslog save us lot of performance problems.
Thanks for sharing, Matt! Also, I found this part of your strategy interesting:
1. We have a script that creates, updates and deletes an issue every minute on each node of our 4 node cluster. The time for each of those tells us what users are experiencing
I'll check with our team what they think about it, and whether we're doing something similar on any of our production or test instances. For one of our Confluence DC instances, we monitor the site's Apdex and alert support whenever its response time drops to 4 seconds.
Yes, it's not a common approach but we like it because it gives us a sense of what our customers are experiencing Jira. We use an internal Jira user as the service account, with the internal user directory first in the list of directories. So the absolute values for that metric are probably a little better than our AD users experience.
One odd thing about creating and deleting an issue each minute for years on end is that our issue keys have grown large, e.g. JPT-4612542. But that hasn't broken anything yet. OpenJDK has even larger numbers in their issue keys.