Git: purging binary blobs from a git repo
Context
Some time ago, at work, I ran into a frustrating problem with managing some test environments, the git repository that was used as the source had grown into a monster.
There were two main issues:
-> Cloning the repo took ages, which meant that building a new test environment was slow, since every build started with a git clone.
-> Each test environment took hundreds of megabytes of storage space, which quickly exhausted the disk space allocated to the test server in question.
This ended up wasting time, not only because of the slow test environment build time, but also because the management and deletion of test environments was always a constant issue, due to their large size.
After a short bit of investigation, it turns out that some binary blobs were committed into the repository long time ago. These blobs were applications invoked by the scripts, but obviously, the binaries themselves were version controlled elsewhere, in another repo, so it made no sense for them to be here. Simply doing an rm -rf wouldn’t have really done any justice in this situation, since git keeps things forever around in its history, meaning that even deleted files remain stored in old commits.
There was now a clear goal: locate all of the binary blobs and remove them from every commit in all of the repository history, so that git clone would execute much faster, and the resulting repo would occupy far less space.
Precautions
This kind of cleanup is only possible by rewriting the entire commit history of the repository, so co-ordination with other users of the repository is absolutely essential. This section here is mostly considering the case if the repository is actively worked on in some collaborative tool like bitbucket or gitlab.
-> Merge or reject all outstanding pull / merge requests before starting. Any pending PR/MRs based on old commits will break after the commit history is rewritten.
-> Notify other users. When the process is done, everyone will need to do a git pull on their local master, as well as manually fix any local branches if they were affected by the binary blob removal.
Basically, what happens is that git will no longer recognize old commits related to the new ones. So trying to push something based off of an outdated local branch will either cause conflicts, or reintroduce the deleted binaries.
Locating the binary blobs
There’s many techniques for this, and the most suitable one will really depend on your specific situation. If you’re lucky enough that you know that the binary files were never deleted from any commits, a command like this in the working tree will likely suffice:
cd <repo location> && du -h | grep -Eo "[0-9]+\.[0-9]+M.*"
If you do have script files that are big enough to compete with binary blob sizes, you might need to get fancy with a find command. Unfortunately, none of this was a 100% answer to the problem in my case, because I knew that some binary blobs were deleted some commits ago, while some were not.
For locating files that are already deleted, but still exist in old commits, there is nothing but pain ahead. The way is to iterate through every commit, to look for the buried binaries.
So, depending on the amount of commits, you might need to script these steps (I needed to):
git rev-list HEAD
This will print you a simple list of every commit, e.g.:
40db5b2c3155e0056d3f2fb38a672ad2eb29bb87
bf3da7913687524f683906bc6cb6420515bc5725
09f2872610c7b3ca5133395e8f4a83650ae88042
fb9c8d0bc7d1e878ec6a95980fccc852e72d651d
For each commit, do:
git ls-tree -r <commit-id>
Which should print out a list of what the state of the branch looked like in that commit:
100644 blob 0c03f81996732db9c2b469b3c37cff1b0591df8b Makefile
100644 blob 9383f3dd61988784435a0fce671a98352f2cf616 README
100644 blob f97a69b8497848184d7f1a2dfc39b0116d790e00 compat.h
You can use git show to recover the files, and do a test whether that file was a binary or not.
root@debian-test:~/git/demo# (master) git show 0c03f81996732db9c2b469b3c37cff1b0591df8b > $HOME/out
root@debian-test:~/git/demo# (master) /bin/file $HOME/out
/root/out: makefile script, ASCII text
Executing the rewrite
Make sure you’re on the correct branch, and that you’ve got a clear list of items you wish to purge from the git repo.
If the file you want to purge is visible in the working tree, git checkout master will be perfectly fine. If you needed to dig up files from older branches that have long since been removed, you will need git checkout <latest branch that contained old file>. You can figure the latest branch via git log, just look for the commit id you got from git rev-list.
# Remove the binary from every commit (add a line for each binary to be purged)
git filter-branch -f --tree-filter 'rm -rf path/to/binary' HEAD
# Remove backup refs that filter-branch creates
git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
# Expire reflogs and prune unreachable data
git reflog expire --expire=now --all
git gc --prune=now
# Force push the cleaned history
git push --force
In my case, the repo size went down drastically, from hundreds of megabytes to not even 100. Once this stage is done, it’s time to inform everyone else that they must checkout to master, and bring their local repository up to date.
Deploying a new test environment went from annoyingly slow to pretty much instant, and it was no longer necessary to hawk over free disk space as much.
This process (the git filter-branch in particular) may take a good while to execute depending on the size of the repository’s history.
Conclusion
Rewriting git history isn’t as scary as it sounds, the key is good coordination with the rest of the team. In my case, the repo in question went from painfully slow clones to fast test environment builds in an afternoon. If your repo is bloated with old binaries, it’s worth looking into a cleanup.