How to decrease disk usage or find interesting things during investigation on filesystem?

Hi, awesome community! 

I hope you are doing well. 

 

In this article, I'd like to share my usage a small util fdupes

Home page of that project located in that GitHub

Let's determine to exist use case: 

1. We have a huge directory {jira_home}/data/attachments/ or {confluence_home}/attachments. (for bamboo and bitbucket it will not work properly)

In my use case is ~750GB and ~180GB

All those instances are on-premises.

We need to analyze exist disk usage for duplicates and if it is possible to replace by symlinks.

 

So, without any more stalling, here we go.

1. Install fdupes if it is not in your system.

on RHEL/CentOS-based and Fedora based system

yum install fdupes
dnf install fdupes [On Fedora 22 onwards]

Debian based:

sudo apt-get install fdupes

or 

sudo aptitude install fdupes

macOS based:

brew install fdupes

 

2. Next step is to change to an attachments directory

fdupes --recurse --size --summarize ./attachments/

3. You can see progress like this:

image.png

 

4. And finally I got that result. 

image.png

 

5. Only after analyze I suggest to replace by --hardlinks, I totally recommend to check on test files for to do production. Sometimes better to do that per Jira project key directory.  

 

Conclusion: 

Doing that analyze, I found causes like exocet, clone plus, custom post-function to clone issue, and mail handlers problems, which generates most of the duplicates. And of course, it can be workflow bottleneck, some duplicate, incorrectly parsing emails. 

And please, don't do as a fanatic, because disk usage is the cheapest thing in nowadays. (I hope) ;)

Also, I suggest to read: Hierarchical File System Attachment Storage and Jira attachments structure

 

I will be happy if community members post here: own statistics.

Maybe will implement disk usage deduplication ;)

 

P.S.: For windows, please have a look that util jdupes

P.S.S. If you want to do for all files, please, have a look that kvdo , because on lower level.   I hope that article will interesting for you ( VDO new linux compression layer ).

P.S.S.S. If you are using NetApp  ;), it is out-of-the-box functionality 

 

Cheers,

Gonchik Tsymzhitov

4 comments

Matt Doar
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
March 16, 2019

Interesting tool, thanks for the article

What are the advantages of using the tool? As you say, storage is cheap.

How much space did you recover? 1%, 5%?

I don't advise duplicating attachments in Jira with plugins, workflows or anything. It just confuses people and goes against the DRY principle

Gonchik Tsymzhitov
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
March 16, 2019

In first time I decreased 17% from ~340GB on old Jira instance before migrate into main.

Then I started to deep investigate why a lot of duplicate files generates. Because end of users, can't upload in short time a lot of duplicates. I hope only automatisation ;) 

 

About cheap, yes, nowadays -+ 100GB it's ok. Just check dockers, m2 , gradle, npm dependency directories . 

JP _AC Bielefeld Leader_
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 16, 2019

Gonchik,

I would rather go for deduplication using a SAN system that supports this function transparently. Assume the same file is attached to different pages in Confluence & the users aren't aware the file exists more than once because they don't see the page or space. Then this file or page containing the file is deleted by someone: What happens? Is the file still available on the other pages with your solution?

It makes much more sense with Jira on closed, older issues. If we have issues that need to be documented after closed, we move them to a Confluence page & also move the attachments of the issue. Simple princinple: Jira is Process, Confluence is Documentation...

For us we introduced an automatic attachment purger on older versions of attachments in Confluence. The most up to date file is never purged.

Gonchik Tsymzhitov
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
March 16, 2019

Jan-Peter, 

Functionality of a SAN system is good. But I was wondering a lot of company don't have that function.

Therefore I highlighted kvdo (https://www.redhat.com/en/blog/look-vdo-new-linux-compression-layer). e,g, who used proxmox (can read that stats https://forum.proxmox.com/threads/virtual-data-optimizer-vdo.42838/)

 

Also, sometimes checking that stats will help to find some interesting automatisation.  I hope that was to me very interesting :)

 

Could you upload your stats of that info into this thread, please?  

About clean the older version of attachments. Thanks for that notice, I will share my solution around that :)

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events