Maximizing business value from your log files (insights in to app/add-on usage)

July 8, 2019

At the Atlassian Summit 2019 in Las Vegas, I was asked to present during the the technical track for TAM (Technical Account Manager) day. The topic - the solution we've put together to centralize numerous logs, and extract significant business value from those logs. There seemed to be reasonable interest from a number of attendees on the day and general interest in sharing the solution from other sources, so I was asked to share it with the wider community. Apologies for the delay, but here goes!

Maximizing business value from your log files

Topics:

Intro to WPP/Uhub
Logs. Logs of logs.
Business Benefits
ELK
What's next?

Intro to WPP/Uhub

To help set some context, a quick introduction to WPP - our parent/holding company.

WPP

WPP is the world's largest advertising holding company, with 130,000+ staff in 112 countries/3,600 offices. It consists of 21 major brands with 400+ sub-brands.

Uhub

Uhub is an internal WPP brand that tightly integrates a set of industry-leading products (corner stoned by the Atlassian family of products) into a single, WPP-specific platform, supported by and for agency staff

9 dedicated full time and 4 ancillary staff providing 24x5 support for the entire environment including business analysis.

Applications are all self-hosted (AWS) and data center where available.

Given WPP's size and complexity, Conway's Law is quite fitting

Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.

While a single project being handled by one brand/office happens, there is a significant amount of collaboration between offices and brands, to accomplish this, the Uhub Atlassian platform is a massively multi-tenanted environment allowing projects to be run in a manner that reflects the needs of the businesses.

Each combination of a brand and geography, Uhub refers to as an "OpCo" (operating company) - sometimes this is one brand in one city, sometimes this will be a brand covering an entire region (APAC/NA/LATAM/EMEA etc) - which could in itself be a collection of 10-15 offices and hundreds of staff. Uhub currently supports over 300 OpCos from practically every corner of the world. The OpCo is the basis of many parts of our environment, from group/project naming conventions to billing management. Each project/space has a home OpCo (usually the office that owns the core relationship with the client) and then staff from that OpCo and/or many others may be granted access to work on that project.

To highlight the huge amount of collaboration that is enabled in the Uhub platform, we undertook a process (approx. Sept, 2018) where we took every Jira project (~ 10,000 at that time) and determined who was the "home" OpCo, then expanded the permissions (roles/schemes) and all groups to determine what users were working on each project. As every user also belongs to an OpCo, we were able to show where there was crossover between projects and users. Plotting this on a chord graph (using https://observablehq.com/@d3/chord-diagram ) we were able to visualize how much collaboration was actually happening (each line represents a user working on a different teams project).

Each of the marks around the outside indicates an OpCo - these had to be blurred for this post.

And isolating just one OpCo shows how many different teams are working with them.

.... stick with me, this is going somewhere!!!!

Logs. Lots of logs.

With practically every Atlassian product in production, and Data Center where available, and multiple environments (prod/uat/migration platforms etc) - there's quite a few servers, and each server has multiple log files that contain valuable information (nginx as our reverse proxy, various application logs, GC logs, auth/syslogs as well as your standard system metrics - disc, cpu, ram, etc).

When trying to investigate an issue, SSHing in to the relevant server to then grep/cut/awk your way through logs to find the piece of information that you needed was both time consuming, inaccurate and compliance consideration if we granted SSH access to everyone.

We wanted to go from logs for a few, to logs for all!

We identified 3 categories of personas who could benefit from the information in the logs:

Support Team
- Easy access to logs from all sources to provide better ad-hoc support and proactive monitoring
- Improved audit/compliance support
OpCo/office leads
- Monitor their teams usage by product/addon/user
C Suite
- Monitor overall return on platform investment
- Easy to view/consume dashboards that can be dropped in to reports

Business Benefits

Below are a sample of the dashboards we use daily.

While practically every monitoring platform will track these key metrics, we originally had CPU/RAM etc in one system and the rest of our logs in another system - this made correlation difficult - even the variations in graph sizes meaning time scales weren't 1:1.

By parsing the URIs from nginx, we were able to identify various actions/add-ons and tag requests. This then allowed us to get a very valuable understanding of the usage patterns of our add-ons. On the lead up to renewal time, we used this information to understand the ROI from each add-on, and were able to promote use cases to teams where functions were being under-utilized compared to other similar teams.

Looking for aberrations like the above - you can identify possible problems. This one was from a round of planned patching, so was nothing to worry about, but we have identified some unexpected behavior from similar spikes.

A common check during audit time is showing who logged in to what server/when/from where - by ingesting the auth logs, this is all now available from a single dashboard. The circled area is where I was on leave, so should not have had any sessions during the middle of that period - if there was, these would be easy to spot.

The above dashboard is a combination of nginx and jira application logs - combining multiple data sources makes it easier to spot possible cause and effects

These types of logs can help you validate changes you expect, not just identify problems. One of our projects was to ensure that our Crowd DC nodes were getting relatively well balanced traffic (so we didn't have one working hard and the other sitting there doing not much). The recommendation is to use sticky sessions on the load balancer in front of the application nodes, but with each application constantly talking to Crowd, these sessions often "stuck" to the node that came online first and rarely swapped to another node causing significant imbalance. We identified that only the admin functions of Crowd needed sticky sessions, so adjusted our LB config so that non-admin traffic was not sticky. This change was implemented in the middle of the highlighted area and we almost instantly saw a change in traffic balance between our two nodes.

Monitoring dependent systems helps find unexpected changes (a significant dip in emails may mean one system is not sending notifications due to a config error or mass change in notification schemes), but also validating expected changes like Jira 8’s email batching feature – and with an average weekday send of 30-60K emails, we’re looking forward to that one!

To get email logs from SES to ELK, each email triggers an SNS notification that goes in to an SQS Queue, which is then picked up by a Lambda function, adjusted and submitted to Elasticsearch.

There's some very valuable information in the various application audit logs, however I think it's fair to say that the search experience is not exactly up to scratch. We use a Lambda function that pulls out the most recent audit log entries from Jira/Confluence/Crowd REST APIs, then submits to elasticsearch. With each json object/array automatically turned in to it's own filterable/searchable/graph-able field, it's MUCH easier to get the insights you that help.

Above we have a large spike in Workflow changes

You can easily filter down on just a subset of changes (filtering on one dashboard graph will apply the same filter to all other widgets on that dashboard), we quickly identified the change was valid - a bulk update triggered by a business structural change.

We leverage the platform for a number of other reasons - given we ingest staff information from their home directories, we have access to their corporate avatars - so we were able to develop a gravatar system that combined avatars from multiple user stores. We're also planning to extend JSD and display user metadata on the requester/participants (phone number, country, job title etc) that will provide more context and allow our team to support the users quicker and better.

Searches don't have to just be run from the UI and on demand. Using watchers and OpsGenie, we now get near real time alerts when various conditions are met in our logs - examples:

We had an issue in our Jira environment where nodes would get themselves in to recurring full GCs - working with Premiere Support, they asked us to adjust some run time parameters next time it happened to capture a full heap dump before the next GC. To minimize disruption to our users, this meant getting this actioned ASAP, so we configured an alert if 3 full GCs were identified within a 5 minute period, it would trigger an OpsGenie alert.
Monitoring is also in place for abnormal amounts of failed login attempts (could be a network issue with our user directories and/or a brute force attempt)
Monitoring highly privileged groups for membership changes

ELK

Sounds good - right?

But how?!?

"ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana, however it might be better to think of them as LEK - Logstash, Elasticsearch, Kibana - as that's the order they often work in.

Logstash (send) - monitors/parses/sends files from your server(s)
Elasticsearch (store) - a nosql database that is great to store/analyze time series logs
Kibana (show) - a visualization product that works on top of Elasticsearch

Logstash

You tell Logstash what files you want to monitor, and how you want them processed (the structure). If the files are already in JSON, you don't need to do anything as the JSON structure will be used to store the data.

Instead of writing/tracking potentially hugely complicated regular expressions, logstash has "grok" patterns - which are really just abstractions of regular expressions. If you know that a field in a log file is an IPv6 address, with RegEx, you'd need to have the following pattern:

((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?

but as a grok pattern, this is simply:

%{IPV6}

Grok means that the default Jira application log file can be parsed using the below 2 lines

APPNAME [a-zA-Z0-9\.\@\-\+_%\:]+
ATLASSIANLOGJIRA ^%{TIMESTAMP_ISO8601:timestamp} (%{APPNAME:applicationModule}%{SPACE})?%{LOGLEVEL:severity}%{SPACE}%{GREEDYDATA:syslog_message}

https://grokdebug.herokuapp.com/ has been a huge help in getting our patterns setup correctly - bookmark it!

As mentioned above, we also wrote a number of URI parsing patterns that identify what add-on/action a user is doing - and we then tag that log line with some extra meta data.

if "/plugins/servlet/project-config/" in [request]  {
    grok {
        match => { "request" => "^/plugins/servlet/project-config/%{WORD:projectKey}/roles" }
        break_on_match => false
        add_field => {
          "AddOn" => "Jira Core"
          "AddOn_Action" => "Project Roles"
        }
    }
    grok {
        match => { "request" => "^/plugins/servlet/project-config/%{WORD:projectKey}/summary" }
        break_on_match => false
        add_field => {
          "AddOn" => "Jira Core"
          "AddOn_Action" => "Project Summary"
        }
    }
    grok {
        match => { "request" => "^/plugins/servlet/project-config/%{WORD:projectKey}/workflows" }
        break_on_match => false
        add_field => {
          "AddOn" => "Jira Core"
          "AddOn_Action" => "Project Workflows"
        }
    }
    .................. truncated ................
}

The 3 examples above will tag users managing project roles/summaries/workflows.

We've started a bitbucket.org repo at https://bitbucket.org/uhub/elastic and published a number of Jira related (software/core + addons) URI patterns - https://bitbucket.org/uhub/elastic/src/master/stubs/jira/

The plan is to expand on these existing patterns - some (Tempo) are for their previous major version and given the changes in their latest major release may not work correctly. We have also identified that TM4J and Tempo are both now using react/angular or similar and the core action that a user takes on a page is in the URI after the # - this information is not made available to the web-server logs, so this level of granularity is missing.

The goal for the repo is meant to be collaborative, so if you'd like to contribute (whether you're an add-on vendor or an end user), please let me know.

NB: There are some example logstash configs at https://bitbucket.org/uhub/elastic/src/master/conf.d/ and they rely on these patterns https://bitbucket.org/uhub/elastic/src/master/patterns/

We deployed our ELK environment using https://cloud.elastic.co - and they provide a 30 day $0 trial (I believe this is still valid) - so if you'd like to get your hands dirty and report back on what valuable information you've gained from your logs, reply below!

CCM

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Maximizing business value from your log files (insights in to app/add-on usage)