TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Removing crap from a git repository's history

When I google "git remove from history" (because I frequently forget the exact sequence of commands as I don't have to remove history very often), this is the first result. It almost works. Don't use it, use the second result. (To further be in favor of the second link, the first is from 2009, the second is from github itself and they're pretty good at keeping their material up-to-date with recent gits.)

My current git version is 1.7.3.4; not the bleeding edge, but if you're using 1.7 at the end of 2011 you're generally in good shape. Anyway, the "I don't know, I don't wanna know" version to getting rid of crap you don't want with some commentary in between:


$ du -sh .git
946M .git


As you can see, the git repo I'm using is huge. Github soft-limits free users to 300MB; if I want people to fork, it needs to get much smaller. Fortunately, almost all that size comes from a glaring thirdparty/ directory and its history over four branches. (This git repo comes from a perforce one.)

So let's kill it!


$ git filter-branch --prune-empty --tree-filter 'rm -rf thirdparty/' HEAD


Why tree-filter instead of the faster index-filter? Who cares! I don't wanna know, I want this binary crap gone!

If you have multiple branches where the vile may rest, you have to make sure you switch to them and rerun the command after removing the thing it tells you to remove if you don't remove it before rerunning the command.


$ git checkout otherbranch
$ rm -rf .git/refs/original
$ git filter-branch --prune-empty --tree-filter 'rm -rf thirdparty/' HEAD


Etc. Also note that if you have any tags that contained the evilness, make sure you delete those tags or they'll hoard it even after you complete this process. Anyway, presuming your tags are gone and you've killed the data for all the branches you care about:


$ du -sh .git
986M .git
$ git reflog expire --expire=now --all
$ git gc --prune=now
Counting objects: 116947, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (33198/33198), done.
Writing objects: 100% (116947/116947), done.
Total 116947 (delta 91775), reused 94657 (delta 76507)
Removing stale temporary file .git/objects/pack/tmp_pack_5HLLcY
$ du -sh .git
712M .git
$ git gc --aggressive --prune=now
Counting objects: 116947, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (109654/109654), done.
Writing objects: 100% (116947/116947), done.
Total 116947 (delta 92214), reused 17437 (delta 0)
$ du -sh .git
551M .git


What's the problem here? At first it was even bigger than before! Now it's at least manageable but it should be smaller...

Oh look. I forgot to delete my origin remote tracker and .git/refs/remotes. Let's do that and re-gc.


$ git remote rm origin
$ rm -rf .git/refs/remotes
$ git gc --prune=now
Counting objects: 105938, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (16254/16254), done.
Writing objects: 100% (105938/105938), done.
Total 105938 (delta 82503), reused 105817 (delta 82398)
$ du -sh .git
87M .git


Huzzah! Much better. Anyway, there you go. It's such a freaky process when there are lots of places where your data could be hiding still, it's so typical of a migration problem instead of a typical use problem. (Typically, if you do something that calls for history deletion like accidentally commiting a password file, you can amend it before anyone notices.)

I hope someone finds this useful for when the first google result fails them and instead of clicking the second one they click the Nth one that this blog shows up as. (Okay so neither the first nor second tells you our little secret of --prune-empty with --tree-filter or if it really matters ;) You don't know and don't wanna know!)


Posted on 2011-12-20 by Jach

Tags: git, programming, tips

Permalink: https://www.thejach.com/view/id/225

Trackback URL: https://www.thejach.com/view/2011/12/removing_crap_from_a_git_repositorys_history

Back to the top

Anonymous February 03, 2012 09:49:34 AM Now try running "git gc --aggressive"
It's a good idea to squeeze your repo as much as possible before publishing it.
Sam February 06, 2013 09:09:46 AM Here's my bad-ass version, called git-gc-all-ferocious!

#!/bin/sh -ev
git remote rm origin || true
git branch -D in || true
(
cd .git
rm -rf refs/remotes/ refs/original/ *_HEAD logs/
)
git for-each-ref --format="%(refname)" refs/original/ | xargs -n1 --no-run-if-empty git update-ref -d
git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc --aggressive "$@"

No doubt I will have to add further crud to it as I discover new ways git tries to hold on to unwanted objects!
Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.