Is there a way to limit the number of active repositories?

Chad Barnes
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 19, 2016

Fisheye has a concept of repository passivation. As with many things Fisheye, it is a mystery to us. There is a MBean called "activeCount" that seems to indicate how many "active" repositories at any given time. We've noted significant impact to our JVM heap when the "activeCount" approaches our actual number of repositories (~700). The active count will happily stay in the 600-700 range even when the heap is full and the GC is struggling to free memory (1 minute pauses, or more).

However, sometimes, when the system is running relatively calm... the "activeCount" shoots down to 0 and 6-8 GiB of heap is freed-up. Then, until the activeCount gradually rises again to several hundred, the system runs much better and the GC is able to keep up without the huge STW collections.

Is activeCount related to repository passivation?

If so, is there a way to limit the number of active repositories such that we can limit the amount of heap the active repositories consume?

Fisheye 3.7.1

2 answers

0 votes
Grzegorz Lewandowski
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 19, 2016

There is a MBean called "activeCount" that seems to indicate how many "active" repositories at any given time. 

That is correct. Passivation manager should passivate open repository handles, but it can't simply force close if some other threads are using it. Passivation manager should try to passivate repositories on three conditions: 

  1. Maximum active repositories this number is calculated using following formula. 

    0.3 * Runtime.getRuntime().maxMemory() / (5 * 1024 * 1024)

    Look at Marek comment, it can be adjusted by setting repcache.memcache.total_size

  2. On timely manner. There's timer task which runs every minute to:
    1.  check which repositories can be passivated
    2. look for high GC spikes and tries to passivate some of the repositories, look for following debug logs: 

      Passivating 3 repos due to GC load

But as I said before passivation mechanism can only request repository release and can't force it. Those are closed on log entries:

DB close on REPO_NAME

And it can only happen when all threads release the handle, look for: 

acquire engine on REPO_NAME, count= HOW_MANY_CALLERS_HOLDS_THE_REPO
release engine on REPO_NAME, count= HOW_MANY_CALLERS_STILL_HOLDS_THE_REPO
(it has to equal 0 in order to release repo)

You can verify how many repositories awaits for closing in needingPassivationCount counter. You can reference it from MBean you've pointed. 

the active count will happily stay in the 600-700 range

I wonder why there's so many open repositories, is it due to index polling, or most of those repositories are used by HTTP threads? If it's the former maybe you should probably think about moving to SCM Hooks and disable polling which will cause to index repository only if there's something new to index.

 

Note: In order to see any of log entires I provided here you have to enable debug logs

Chad Barnes
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 20, 2016

Thanks for all that good insight. I'll pass it on to my system administrators.

Regarding the SCM Hooks, we do have those in place. We keep an hourly polling in Fisheye in the event the hooks don't fire correctly or they haven't been setup on a repository yet. We have considered moving them to daily.

I've also responded with some additional information about our current state to Marek's post.

Chad Barnes
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 20, 2016

For reference, here's a current snapshot of the MBeans.

#activeCacheSize 3181457210
#activeCount 370
#cacheUsage 0.5555555475034958
#gcPassivations 253
#maxActivePassivations 0
#maxActiveRepos 1092
#maxCacheSize 5726623061
#needingPassivationCacheSize 0
#needingPassivationCount 0
#perRepoCacheSize 8598533
#runnableCount 666

0 votes
Marek Parfianowicz
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 19, 2016

Hello Chad,

Yes, the activeCount is related to repository passivation. FishEye starts to passivate repositories when it detects that time spent on GC exceeds 15% of CPU time. It will usually passivate up to 5 repositories at once. If you enable debug logging, you shall find messages "Passivating NN repos due to GC load" in FishEye log.

 

FishEye has one parameter named maxTotalCacheSize . It's the total of cachesizes for all active (non-passivated) repositories. By default it is 33% of Runtime.maxMemory(), it can be also set using a property "repcache.memcache.total_size" in the config.xml. 

<config>
  <properties>
    <property name="repcache.memcache.total_size" value="(size in bytes)"/>
  </properties>
</config>

 

Another parameter is the maximum number of active repositories, which is calculated as follows: maxActive = maxTotalCacheSize / 5MB. It can be also set using a "repcache.passivate.soft_max_active" property in the config.xml.

 

However, before tuning these parameters I would suggest to have a look at this problem:

... and the GC is struggling to free memory (1 minute pauses, or more)

Having 1-minute pauses suggests that you have wrong memory settings. Could you tell what JVM options you use ( -Xmx, -Xmx, garbage collector type etc)? I'm asking, because it's not uncommon to use a wrong garbage collector or wrong GC settings.

 

References:

How FishEye uses memory - passivation

 

Cheers
Marek 

Chad Barnes
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 20, 2016

Thanks for the response.

I've included an image of some JMX metrics for a 24-hour period. In it there are some interesting points:

1) Overnight (very low end-user usage) full GCs continue to hit in groups. Mostly just hourly polling happening overnight. Some of our nasty repos may still be processing very large or complex change sets. These full GC groupings essentially deny our users service for minutes.

2) The active repositories metric stays high even though it seems to be struggling with memory. But, at about 11:00, they all get passivated. Memory loosens-up and the full GCs go away for a while.

2016-04-20_7-25-38.png

-XX:-CreateMinidumpOnCrash
-XX:ReservedCodeCacheSize=350m
-XX:+UseG1GC
-XX:+DisableExplicitGC
-XX:-UseBiasedLocking
-XX:InitiatingHeapOccupancyPercent=25
-XX:MonitorBound=32768
-XX:+PerfDisableSharedMem
-XX:+AlwaysPreTouch
-XX:-UseAdaptiveSizePolicy
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCCause
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-Xms16384m
-Xmx16384m

These arguments have evolved from our trials and errors starting with no arguments, moving to G1GC (from throughput), and then incorporating some options that have really helped us with other heavy Java web apps.

We've got close to 700 repositories. They range from small, unused repos. Properly formed repos (red bean trunks,tags,branches) to not-so-properly formed. To huge repos with millions of files and very heavy branching and renaming. Our total Fisheye repository cache is approaching 2TB.

All repositories are set to poll each hour as a fail-safe in the case a specific SCM trigger didn't run correctly.

 

Thanks for your assistance.

Kevin Radke April 20, 2016

Some additional details.

Our largest Fisheye index file is 145GB for a 55GB sized repository.  I frequently see Fisheye creating 100GB+ of temp files in var/tmp while processing branch operations.  Disk I/O has hit 8Gb/s during these operations.  Memory allocation has been over 4GB/s during the same timeframe.

For a single repository I'll see Fisheye requesting nearly 100 requests per second of the Subversion server.  This value doesn't limit even if the throttle connections value is set to 1 per second.

The gc.log shows that Fisheye has gone through 26TB of memory in the last 44 hours.

Eventually we would want to scale this to over 3,000 repositories.  Our largest repository is around 650GB, but Fisheye is not configured to index it (yet).

2016-04-20_8-17-04.png

Kevin Radke April 20, 2016

Forgot to attach the GC count table.  Most GCs are less than 7s...

2016-04-20_8-47-35.png

Kevin Radke April 20, 2016

And here is the 24 hour mission control info that corresponds to part of the gc log that was analyzed in my comments above.

2016-04-20_9-07-33.png

Chad Barnes
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 20, 2016

We had a 12 minute pause (3 back to back pauses) due to GC today. When this happened, "activeCount" went from 400 down to ~375. Here's a  24-hour sample including the pause.

2016-04-20_17-02-15.png

Hard to see, but the day was littered with 20-40s pauses.

Marek Parfianowicz
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 20, 2016

Hi Chad,

You wrote yesterday: "Some of our nasty repos may still be processing very large or complex change sets.". Do you see any correlation between processing large changesets + gc overload + repository passivation? In case when a single commit contains, let's say, 1 million paths to be processed (which is not uncommon for large SVN monorepos), it may indeed cause a significant system load.

Could you also explain why do you use XX:MonitorBound

You also wrote: "We had a 12 minute pause (3 back to back pauses) due to GC today.". Could you check FishEye/Crucible logs what was happening at that time? Was it indexing of a large commit?

Cheers
Marek 

Marek Parfianowicz
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 20, 2016

Hi Chad, 

Disk I/O has hit 8Gb/s during these operations. 

I'm curious - what disk(s) do you use?

For a single repository I'll see Fisheye requesting nearly 100 requests per second of the Subversion server.  This value doesn't limit even if the throttle connections value is set to 1 per second.

Could you give more details about it? Does FishEye access remote SVN server via http:// or svn:// or you rather use local repositories via file://? Did you set the throttling for all repositories? 

Cheers
Marek 

Kevin Radke April 21, 2016

We use our corporate SAN array for storage.  Multiple 10G links to a large disk array. It appears InfinityDB frequently reads in it's whole index file and then writes it back out.  When those files are >100GB, that tends to do a large amount of sustained I/O.  In local testing in the past we have saturated a locally connected SSD for hours doing I/O to the InfinityDB files.

We are accessing the svn server through http(s).  svn protocol showed no performance improvement in our testing.  file access is out of the question due to the data size (>20TB). The protocol doesn't seem to have the biggest performance impact, since running the command shown in FishEye returns the data in only a few seconds from the Subversion server, but then Fisheye can take multiple minutes to process the data result churning through 4GB of memory per second.  (No additional Subversion server requests are seen during this processing.)

I am processing the svn access logs and filtering on a single repository access from the IP address of the FishEye server.  As a test, I re-indexed a small repository that contained around 3,000 revisions and set the connection throttling to 1/sec.  FishEye performed well over a million operations to the svn server for this specific repository while indexing, many times asking the same exact question multiple times in a row at a high rate.  Once FishEye is done with these repeated high-rate queries, it then allocates large amounts of memory while not contacting the Subversion server.  It then repeats this entire process mostly re-fetching the entire repository history for each branch and tag.

The XX:MonitorBound option is set because we have seen large Java applications that use lots of java monitors cause exceedingly long garbage collection pauses cleaning up millions of orphaned monitors.  It does not affect an application that uses sane numbers of monitors, it just ensures the cleanup process starts before the number gets exceedingly large.  (I haven't specifically analyzed FishEye to see how it uses monitors, it is just one of our standard Java options.)

It is very difficult to tune the GC for FishEye since it periodically does these insane memory and I/O requests.  Since both the UI and and indexing is done in the same VM, one has to balance the large amounts of memory required with the needed response time for the user.  Splitting the data collection processing from the user interface would make this a much easier task.

Thanks!

Kevin

 

Chad Barnes
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 21, 2016

Thanks again for your responses, Marek. To be clear, I work with Kevin and he maintains our Atlassian-hosting servers.

Kevin Radke May 4, 2016

So I defined the following in our config.xml file for our QA server.

<properties>
<property name="repcache.memcache.total_size" value="1073741824"/>
</properties>

I then enabled all 700+ repositories and started them.  The active count grew to 600 and stayed there.  No users accessing this system (other than me), so there should be nothing holding repositories open.  There are 4 initial indexing threads and 8 incremental threads.  All indexes were up to date as of a few weeks ago so there should be no initial indexes running.

Using the formula above, shouldn't the active count eventually settle down to around 200?

JVM heap is set to 31G on this machine with 64G of memory.  I see the garbage collector collecting around 24G of memory every 5 seconds.  The background part of G1GC is able to keep up, so there are no long pauses in this unloaded test.

My next step is to enable debug logs to see if those provide any useful information.

Kevin Radke May 4, 2016

Ok, I verified these parameters don't seem to change anything when placed in the app/config.xml, and/or data/config.xml file.  Can you show a full example config.xml file?

 

2016-05-04_13-20-27.png

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events