My workplace is running a Bamboo 3.1.4 instance under CentOS 5. Recently, after over a year of fairly stable usage, Bamboo has started returning 500 errors and the logs are full of various "too many open files" exceptions. Here are the things I've tried so far:
Up the per-user file limits to 102400 for both hard and soft.
su bamboo - ulimit -n
shows 102400, so it's in effect. `lsof -p <bamboo-jetty pid> | wc -l` shows between 900 and 1500 file descriptors open at any given time.
Up the system-wide open file limit to 300000. Checking the contents of /proc/sys/fs/file-max shows that's also taken effect.
I've also seen the Bamboo error message in the UI mentioning that too many files were open, which reccommends running a full reindex. I have done that as well, but we're still having the problem.
Anyone out there have any other ideas?
We ran into this exact issue in our system and spent a better part of 6 months going back and forth between Atlassian Support and the internet to try and figure out the problem before we found out the cause.
Our "Too Many Open Files" problem wasn't with bamboo, it was with Crucible/Fisheye specifically, but I am imagine due to what we found out the problem was, it's most likely revolves around the same issue.
Our problem was the Atlassian tool's interaction with Perforce. Perforce, at least for us, does not like having a lot of different users downloading repositories from it at the same time (4-5 simultaneous interactions). It causes perforce to ignore a few of those requests from the Atlassian tools that request access to it, which causes the Atlassian tools to error out.
For us the problem was this. We had set up fisheye with several "repositories" for different areas the same perforce server. We probably had 6-7 different scanning setups in Fisheye for the different areas in Perforce. Each one scanned every 2-3 minutes to see if there were changes in their area. We encountered the "Too Many Open Files" problem when several of those repositories re-scanned at the same time. Our only solutions were to restart Fisheye. Once we found out that out, we changed Fisheye to have a single scanning repository for our entire perforce sever, and the problem hasn't shown up in several months. (And it actually sped things up for us)
But since you are messing with Bamboo, I would imagine (especially if you are using Perforce) that you have several plans trying to access the same perforce server at the same time.
We use Bamboo as well, and during our initial setup of Bamboo, which I helped do for our company, we found out that if we needed to build 10 or configurations of a single product, each one done in it's own job that executed simultaneously, then half the jobs would fail because of perforce connection issues, and we'd have to re-run it. What we've done to fix that is to have a "pre-build" stage that pull the repository, zips it up, and then shares it to the other jobs that execute at the same time in the next stage.
This, though, is only pheasable if your repository is less than a few hundred megs.... If you have a repository you are building that is several gigs, then sharing an artifact that large can definitely slow things down.
If you are in fact building a super large repository (several gigs) then another thing you can do, which would only mitigate the problem, not solve it) would for your simultaneous jobs, have them lag at different intervals in an inline script task. Simply just sleep for 20-30 seconds or w/e to prevent pulling from perforce at the same time...
... OR ...
You can simply do your builds in different stages so they don't execute at the same time.
I know this was a wall of information, but hopefully it's helpful
After I was done typing my answer, I thought of something that might be worth trying as well...
The "Too Many Open Files" problem at it's base simply means that there are too many open file descriptors. You are looking at it correctly when you've tried the "ulimit -n" option, but what might have been the problem for us all along, was not neccessaily the Atlassian tools running out of file descriptors, but in fact the Perforce server running out of file rescriptors and reporting that back to the Atlassian tool...
If you feel that based off of my explanation that you think it could be your repository server, try doing the "ulimit -n" for your repository (Perforce, SVN, etc), not neccessarily the Atlassian tool
Yes, we use Perforce and we have way more than 10 or so plans accessing Perforce, possibly simultaneously. What's strange is that this hasn't been a problem until now - we've actually been running Bamboo in this setup for a couple of years, and it hasn't been touched since we upgraded to Bamboo 3.2.
All of the acutal building is on one of 3 remote agents, so that's the maximum number of simultaneous source checkouts, but certainly there could be many more connections from the Bamboo server itself as it validates client info and checks for changes.
Thanks for the info - I'll investigate that a bit more and see if I can narrow it down to Perforce.
They're on two different servers, one of which (the Perforce proxy that the Bamboo server is connecting to) is on a Windows machine... so no ulimit and no sharing of file descriptors. The exception is happening on the Bamboo server, which in turn does no builds - it delegates all builds to remote build agents (which are also Windows machines). The only thing on our CentOS server is the Bamboo server itself.
Yes, it shows the increased values. Paste from the Bamboo logs:
08-Oct-2012 08:58:52 core file size (blocks, -c) 0 08-Oct-2012 08:58:52 data seg size (kbytes, -d) unlimited 08-Oct-2012 08:58:52 scheduling priority (-e) 0 08-Oct-2012 08:58:52 file size (blocks, -f) unlimited 08-Oct-2012 08:58:52 pending signals (-i) 72704 08-Oct-2012 08:58:52 max locked memory (kbytes, -l) 32 08-Oct-2012 08:58:52 max memory size (kbytes, -m) unlimited 08-Oct-2012 08:58:52 open files (-n) 102400 08-Oct-2012 08:58:52 pipe size (512 bytes, -p) 8 08-Oct-2012 08:58:52 POSIX message queues (bytes, -q) 819200 08-Oct-2012 08:58:52 real-time priority (-r) 0 08-Oct-2012 08:58:52 stack size (kbytes, -s) 10240 08-Oct-2012 08:58:52 cpu time (seconds, -t) unlimited 08-Oct-2012 08:58:52 max user processes (-u) 72704 08-Oct-2012 08:58:52 virtual memory (kbytes, -v) unlimited 08-Oct-2012 08:58:52 file locks (-x) unlimited 08-Oct-2012 08:58:53 Finished task 'Ulimit'
Curious what came of this since we are seeing the same issue, but writing to bamboo logs.
[lbfrg@vxpit-elforg02 ~]$ /usr/sbin/lsof -p 32583 | wc -l
java.io.FileNotFoundException: /opt/lforge/atlassian_data/bamboo_home/xml-data/build-dir/logs/TEST-MIDWAY-CHEC/download-data/build_logs/TEST-MIDWAY-CHEC-1534.log (Too many open files)
[lbfrg@vxpit-elforg02 ~]$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 46669 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 80000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 46669 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Over the next several weeks we'll be sharing some of our Getting Started guides here in the community. Throughout this series of posts, we'd love to hear from customers and non-customers ab...
Connect with like-minded Atlassian users at free events near you!Find a group
Connect with like-minded Atlassian users at free events near you!
Unfortunately there are no AUG chapters near you at the moment.Start an AUG
You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs