How can I tell how much LFS storage space is used by each repository where it's enabled?

Dave Thomas
Contributor
August 18, 2018

On BitBucket Server:

Looking at the LFS storage under git-lfs/storage, I see that one of the directories is 52GB in size.  I'm not sure if each directory under there corresponds to a single repository or if all repositories using LFS share directories.   

At any rate, I'd like to figure out how much LFS storage space is being consumed by each repository where I've enabled LFS.   Seems like a logical question any system admin would want to ask and yet the documentation seems to be completely silent on this.  

Thanks for any insight!

3 answers

1 accepted

0 votes
Answer accepted
Michael Walker
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 21, 2018

Hello Dave,

The directory $Bitbucket_Home/shared/data/git-lfs/storage is used to store all repo LFS files, sorted based on git hashes, not sorted by repository.

The simplest way to identify the amount of LFS space used by any one repo would be to fetch the files and then look at the resulting folder size. This can be achieved with "git lfs fetch --all" as discussed at the end of Fetching extra Git LFS history. Once fetched, you can explore to your .git/lfs/objects directory and run a "du -hd 0" to find the full size of the current directory and all subdirectories.

Hopefully this helps!

Dave Thomas
Contributor
August 22, 2018

I appreciate the advice...truly.   We have hundreds of repositories, though.   The strategy of cloning all repositories and fetching the LFS contents doesn't really seem like a viable way for a large enterprise to manage space on BitBucket server.   Is there really no other way?

Michael Walker
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 22, 2018

Dave,
I am working with some of my colleagues to go over any other possibilities, however, this appears to be the only way so far. Having said that, the process can be scripted to allow you to "set it and forget it". This can be achieved by following the below high-level steps.

  1. Identify all projects on the system. Retrieving the Project keys "key:" via api
  2. Identify all repos with each project on the system. Retrieving the Repo Slug "slug:" via api
  3. Use the project/repo keys from step 1 and 2 to clone each repo in a loop, 1 at a time.
  4. Verify all LFS objects are downloaded. "git lfs fetch --all"
  5. After cloned, check to see if the <repo>/.git/lfs/objects directory exists, and if so, retrieve the size on disk
  6. Delete local clone and clone the next repo in the list. "Rinse and repeat"
Dave Thomas
Contributor
August 27, 2018

This solution is woefully inadequate for enterprise customers who have thousands of repositories.   I'm going to mark it as the accepted answer, however, since it seems to be the only one available.

0 votes
Charles Pikscher
Contributor
January 28, 2020

If I make a request to <server>/rest/git-lfs/storage/<proj>/<repo>/<fake oid>, I get an error message that includes the path to where the file would have been stored if it existed.  So if you iterate over all projects/repos you can build up a mapping from the git repositories to the the directories on the server in shared/data/git-lfs/storage.

At least in my case (BB Server 6.7.2), I got a perfect one-to-one mapping.  So even though the documentation says that "all repositories share this object store", it appears that each repo gets its own directory.  I guess there could be a collision, but that seems rare, and you could deal with that on a case by case basis.  Or there is some something I've missed.

Anyhow, it would be nice if Atlassian could tell us what the hashing function is or confirm my guess.

Or I am really missing something.  As an example, I get 1.9GB when I pull down a repo and lfs fetch all.  But the server directory I think corresponds is 6.3GB.  Bad or missing garbage collection, maybe.  I don't think its packing on the local side, the number of files in .git/lfs/objects matches the output of git lfs ls-files.

Bryan Turner
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 28, 2020

@Charles Pikscher

Just a little heads up, if the server is leaking paths on disk in error messages, that's a bug that's likely to get fixed. No error from the server should ever report a path on disk; it's a potential security issue. So that mechanism for finding the path on disk is likely to stop working at some point (likely quite soon).

That said, there's no need to hassle with fake REST requests you expect to fail. The layout of the LFS storage on the server is straightforward and built using repository hierarchy IDs. That means it is _not possible_ to get the usage size per repository--unless the repository is a top-level repository (i.e. not a fork) which has never been forked. All of the LFS objects for every repository in a hierarchy are stored together.

The repository hierarchy ID for any repository is readily available at `/rest/api/1.0/projects/<key>/repos/<slug>`. If you have access to the repository you can find its hierarchy ID, and if you have access to the server you can use that ID to find the LFS objects shared amongst every repository in that hierarchy.

If per-repository numbers that differentiate between forks are required, then the answer provided previously remains the closest way to approximate it.

Best regards,
Bryan Turner
Atlassian Bitbucket

Charles Pikscher
Contributor
January 29, 2020

Thanks, that was the insight I was looking for.

My case is likely somewhat unique, in that forks are seldom used in my organization.  That explains why I got such a nice mapping.

I looked at `/rest/api/1.0/projects/<key>/repos/<slug>`, I did not see a hierarchy ID.  There is an numeric id field that indicates where the repo is stored under shared/data/repositories/, but I don't see a hash that indicates where the LFS files are stored under shared/data/git-lfs/storage/.  Is that available somewhere else?

Also, FYI to all, there is no garbage collection of LFS right now.  So the delta between the server and the "pull repo down and measure it" method could be a lot if you have a developer that is, um, let's say "prone to mistakes".

https://jira.atlassian.com/browse/BSERV-9246

Dave Thomas
Contributor
January 29, 2020

I'm not following you Bryan.   When I do a get on /rest/api/1.0/projects/<key>/repos/<slug>, I get the repository ID as Charles mentioned, but I also don't see anything there that's a hierarchy ID.   I don't see it mentioned in the documentation and I don't see it in the results when I try it against our BitBucket Data Center instance that's running the latest version.

It looks like the top-level directory for a storage hierarchy is looks something like this:  /shared/data/git-lfs/storage/022e25516213ddd4f082.   That long number at the end is the hierarchy ID you referred to I guess.   On our system, I know we have 274 repositories with LFS enabled, but I only see 91 directories under git-lfs/storage, so I guess there are 91 separate hierarchies, with the rest being forks that are folded into these.

Could you please elaborate on the proper method of determining the hierarchy ID?  I confirmed that I can currently see it in the error message that Charles mentioned, but I do *not* see it in the REST api output.

Bryan Turner
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 30, 2020

My apologies, @Dave Thomas and @Charles Pikscher. A bug I thought had been fixed apparently hasn't been, and so the hierarchy ID isn't in the REST response.

There's another way to find the necessary information, though. Once you have the repository ID, you can use that to navigate to the repository's directory on disk. For most repositories, that should contain a `repository-config` file, which will have a "hierarchy" value in it. That's the repository's hierarchy ID. (If the repository was created prior to Bitbucket Server 4.12 and hasn't been renamed or moved to a new project, it may not have a `repository-config`.)

Otherwise, the only other way to get the hierarchy ID is to check the `repository` table in the database.

Again, apologies for the misinformation on it being in the REST payload--but it will be there in the future. (See BSERV-12174; I'll have that change up for review internally later today.)

Like # people like this
0 votes
Dave Thomas
Contributor
August 22, 2018

answer was meant to be a reply... moved.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events