Data Center: What we did to optimize performance and scale in 2020

Like many of you, performance and scale are always top of mind for us here on the Data Center team. For us, the performance, scalability, and reliability of your products is not just a nice-to-have, but a requirement.

So we wanted to share with you all of the investments that we’ve made in this space over the last year and some of the new capabilities we have coming soon.

Jira Software and Jira Service Management 

As your Jira Software and Jira Service Management Data Center instances grow, it starts to take longer to index. Our focus in 2020 was to reduce that amount of time it takes to index and, in fact, make it even faster than before.

Improved the time it to index nodes

This year, we introduced document-based replication to lower the index update distribution time in Jira Software and Jira Service Management clusters. Now, index replication isn’t delayed and remains consistent across your cluster. We saw the average index time go from 2.5 seconds down to 6 milliseconds!

By making this fix, we immediately saw the index consistency across nodes improve. It became more stable and faster than it was before. This also helped distribute changes in the index more quickly across nodes.

Instead of doing the same database operations to get the correct data into the indexes, the originating node will serialize the prepared document and propagate it to the nodes in your cluster. This reduces the impact of apps and the document-related impact on your database.

Additionally, we saw a number of other improvements.

  • Ability to handle higher throughput - allowing better horizontal scaling.

  • Database traffic was reduced, thus reducing the chance of a database failure.

  • All index copies receive exactly the same data during updates.

Before making this fix, in heavy load situations, the index replication process could take up to 30 minutes depending on the number of custom fields, apps, and load size.

Cut the average reindex time 50% by tuning custom fields

We know that custom fields are an important part of your teams' workflows. However, the amount of time it took to reindex your instance was impacted by the number and configuration of custom fields in it.

In 8.10, we made two large improvements to reduce this performance overhead:

  • Custom fields with local context are indexed only for relevant issues.

  • Empty custom fields are skipped during the indexing process.

Enhanced cluster monitoring

A clustered architecture is critical for organizations that need high-availability. However, for it be effective, you need to understand the status of each of the nodes in your cluster.

In 8.6, we added a cluster overview page that indicates the status of your nodes - which you can also track in your audit logs if you have advanced auditing enabled.

With this new page, you can see if your nodes are Active, No Heartbeat, or Offline. This enables you to make informed decisions about the administration of your cluster. But we didn’t stop there. If a node doesn’t have a heartbeat, it is now automatically removed from your cluster so that it doesn’t consume resources or impact performance.

Extra performance boosts

  • License checks are more lightweight and average response time improved.

  • Your instance will no longer freeze when removing a user from a group or project due to a performance lock that has been eliminated. The removal process went from 12 seconds to approximately 1 second.

And even more performance boosts for Jira Software

  • Sequential project creation is 65% faster in synthetic tests.

  • Improved performance and stability by optimizing epic searches.

  • Sped up the Favorite Filter gadget by restricting the number of issues that load.

Confluence

We also focused on making significant changes to Confluence Data Center’s architecture to make indexing faster. We also prioritized making performance faster in different areas of the product.

Optimized resource consumption with split-indexing

As your Confluence instance gets larger, so does the index. In Confluence 7.9, we made an architectural change to the index and split it in two: one for content and one for changes. For every piece of content that is indexed, there is at least one changed indexed - more if there are lots of edits to a piece of content. It’s uncommon that your teams would need to search both content and change at the same time.

By splitting the indexes, we’ve seen performance increase and a reduction in both memory and CPU consumption.

Improved the reindexing experience

Previously, reindexing your instances would require you to pull down your instance while the changes were fully propagated. For some teams, that could take up to 48 hours to complete. Having your instance down - for even just a few minutes - could limit your teams' ability to be productive.

We improved the process by adding a new admin UI trigger that triggers the process at runtime and automatically propagates the new index across your nodes seamlessly. The best part? When you reindex using the new UI, you don’t experience any downtime.

Optimized the performance of advanced user permissions

We introduced advanced user permissions in 7.3 to help you manage, troubleshoot, and audit permissions in your instance. However, as your database grows, we saw that some customers experienced some performance challenges because of how permissions were checked.

So, we introduced denormalized permissions. We also added additional database tables for space permissions and separated different types of permissions into their own tables so that your database could handle permissions more efficiently.

By making these changes, we improved the performance of searches, dashboard renderings, lists of visible spaces, and macros that list visible spaces. This not only makes page rendering faster but also decreases database load. We even extended these performance enhancements at the page level to make it easier for you to check page permissions.

Improved performance with changes to cache architecture

In 7.6, we updated the cache architecture and saw dramatically improved performance under high load conditions. So improved, in fact, we saw that it was 4 times faster under simulated high loads.

Previously, when deployed in a clustered architecture, Confluence used a distributed cache - evenly partitioning data across all the nodes in the cluster rather than replicating the data. To improve cluster resilience and to unlock additional horizontal scaling capabilities, we switched some specific caches to local caching with remote invalidation.

Extra performance boosts

  • Used External Process Pool to improve stability of HTML conversion for Word and Office documents that are viewed with the Office Word and Office Excel macros.

Bitbucket

In Bitbucket Data Center, we added a new data management capability to help you clean-up your instance.

Automatically decline state pull requests

Typically, pull requests have a source and target branch and, typically, the target branch is Main. Whenever the target branch changes, all pull requests to Main need to be recalculated to review the differences between the source and target branch. This is a computationally extensive operation that can consume a lot of memory and CPU.

We released automatic decline of stale pull requests in 7.7. This capability enables you to decline any open pull requests that are considered stale, which helps to lower the resources you’re using.

Improved the performance of nested groups

We've moved some gears and sprockets, and made Crowd handle nested groups hierarchies in a more efficient way. This small change brings significant performance improvements for user authentication, permission checks, and the User groups screen.

Crowd Data Center

Last, but not least, we prioritized optimizing full-time synchronization in Crowd Data Center.

Managing your user groups is important to maintain the security of your instance and the productivity of your teams. To help keep your user groups synced, we introduced the new canonicality checker.

The canonicality checker pre-fetches your user/groups names and shares them during the membership synchronization. We also optimized the existing non-shared mode of the checker too.

With the canonicality checker, we saw:

  • Memory consumption decreased during full sync of memberships by ~300MB.

  • Synchronization time was shorten by ~1h 40m when compared to the updated non-shared mode and approximately 2 hours compared to the old non-shared mode.

  • Improved overall sync time by 86% (Canonically) and 98% (Batched).

What’s next

As you can see, we did a lot to optimize and improve the performance and scalability of our products and we plan on continuing to focus on supporting these needs for you in the new year. Here’s a sneak peek at some of the features we’re working on.

Data management capabilities

When your teams come to rely on products for their day-to-day activities, it’s not surprising that they generate a lot of data. At the enterprise-level, that only becomes compounded. That’s why we’re adding more data management - or clean-up - capabilities to our products, which will help you manage your data more effectively, reduce resource consumption, and ultimately improve the overall experience of your teams.

Access based synchronization

Another key performance update coming from Crowd is access based synchronization. With this improvement, only those users who have access to a given application will be synchronized, allowing you to save time in the synchronization process and improve performance by reducing the amount of user data that needs to be processed.


If you’re interested in learning more about other Data Center features that we’re working on, check out the Data Center roadmap.

3 comments

Comment

Log in or Sign up to comment
Gonchik Tsymzhitov
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 16, 2021

Thanks, could you share customer stories  after starting work DBR?

Gaby Cardona
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 22, 2021

@Gonchik Tsymzhitov - thank for the comment and apologize for the slight delay in responding. The PM was out of office and I wanted to confirm with him. 

We did an analysis with 17 customers and saw the median replication time improved from 2.5 seconds to 6 milliseconds. 

Daimler's replication time dropped from 400-600 seconds to 200 milliseconds. 

I'll keep you updated if I hear any additional customers stories too laugh

Like Gonchik Tsymzhitov likes this
Gonchik Tsymzhitov
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 22, 2021

@Gaby Cardona thanks for keep updating. 

Do you have info about 80 percentile, instead median information ? 

TAGS
AUG Leaders

Atlassian Community Events