git gc | Git garbage collection for orphaned, dangling objects

0
23

Table of Contents

Introduction

Leveraging Git for version control involves storing your project’s history in the Git object database. Within this database, there are four types of objects: trees, commits, blobs, and tags.

Objects can become unreachable in several ways as your project grows and evolves. Unreachable objects cannot be reached by at least one branch or tag in the repo’s history. Unreachable objects were historically called ‘orphaned’. However, orphaned objects are now typically called ‘dangling’ objects. If you perform a history-altering operation like git reset or git rebase, your commits may become inaccessible, and you may inadvertently create dangling objects.

Dangling objects can accumulate and become quite large over the lifetime of a project, creating performance or storage-related issues. To remedy this, Git gc, or Garbage collection periodically scrubs the repository of unreachable histories and packs (compresses) loose Git objects more efficiently.

What is gc in Git

Git gc is a maintenance command designed to clean up your repository. Git gc is essential for cleaning up your project and removing orphaned and inaccessible commits, also known as dangling objects.

Aside from cases where Git is otherwise configured, garbage collection is run automatically during several often used commands, such as git commit and git merge.

While the automated task is usually sufficient, Git also allows us to manually run garbage collection using the git gc command. This command lets you leverage properties and settings to fine-tune the garbage collection process.

Git gc Example

In its most simple form, you can run git gc in the terminal like this:

$ git gc

This command will repack loose Git objects and delete unreachable objects only if they qualify.

Git will only delete objects if they have been unreachable for at least two weeks. This is a default safety mechanism to avoid unwanted and irreversible deletions.

Next, we’ll use a real example to demonstrate how dangling objects are created and how to override the default time constraint and prune all unreachable objects using git gc.

Deleting all unreachable objects in Git

To get a better feel for how the git gc command works, we’ll use an example repo called gcex and run a simulation. After initializing our repo and making a commit, let’s take the following steps and review the output:

  1. Create a new branch called feature and check it out.
$ git checkout -b feature
  1. Make a single commit inside a feature. Also, create a new file called danglingBlobthen stage and unstage it.
$ git commit -am 'This will be a dangling commit.`
[feature 127d323] This will be a dangling commit.
 1 file changed, 3 insertions(+), 1 deletion(-)

$ touch danglingBlob
$ git add danglingBlob
$ git restore --staged danglingBlob
  1. Switch back to the master branch, and delete the feature branch without merging.
$ git branch -D feature
Deleted branch feature (was 127d323).
  1. Review the output of git reflog and git fsck.
$ git reflog
8077596 (HEAD -> master) HEAD@0: checkout: moving from feature to master
127d323 HEAD@1: commit: This will be a dangling commit.
8077596 (HEAD -> master) HEAD@2: checkout: moving from master to feature
8077596 (HEAD -> master) HEAD@3: commit (initial): Init commit.

$ git fsck
Checking object directories: 100% (256/256), done.
dangling blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Take a look at the git fsck output. Notice we have a dangling blob but no dangling commit. That’s because an object must truly be unreachable to be considered dangling. As seen in the reflog output, we can still reach the feature commit from there.

Our feature commit is not dangling yet. Therefore, running git gceven with the --prune=now flag, will not delete this commit.

First, we must expire all unreachable commits from the reflog using the following command:

$ git reflog expire --expire-unreachable=now --all
$ git reflog
8077596 (HEAD -> master) HEAD@0: checkout: moving from master to feature
8077596 (HEAD -> master) HEAD@1: commit (initial): Init commit.

We have now officially created a dangling commit and blob, which we can verify by again running the file system check utility:

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (6/6), done.
dangling blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
dangling commit 127d32351bcfee463d893ff2f2470192553f5fb8

At this point, Git still allows us to see the diff associated with the commit and the content within the blob. We can verify by running git show 127d323.

This is a handy fact to be aware of. If you somehow orphaned some important objects, we still have some recovery options as long as garbage collection hasn’t run yet.

Let’s run now git gc to complete the demonstration with the --prune=now flag. This flag tells Git to disregard the two-week safety net and delete all dangling commits regardless of time.

$ git gc --prune=now
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 3 (delta 0), pack-reused 0

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (3/3), done.
Verifying commits in commit graph: 100% (1/1), done.

No more dangling commits. All Git objects created within the feature branch are now deleted and unrecoverable.

Git gc configuration properties

As mentioned earlier, garbage collection configuration properties have preset values ‚Äč‚Äčthat govern their behavior. One example is the two-week minimum for deleting unreachable objects.

To modify these values, use the git config [property] [value] syntax.

Here’s a list of the properties most likely to be relevant:

gc.pruneExpire

The gc.pruneExpire git config determines how long dangling objects will be preserved before pruning during automated or manual garbage collection. Defaults to two weeks.

gc.reflogExpire

The gc.reflogExpire git config determines when the reflog expires and deletes a branch’s reflog entries. Defaults to 90 days.

gc.reflogExpireUnreachable

The gc.reflogExpireUnreachable git config determines when entries for unreachable objects in a branch’s reflog are deleted. Defaults to 30 days because unreachable objects are likely to not be needed earlier.

git gc prune

Garbage collection utilizes the git prune command under the hood when executing its tasks. Git pruning is the actual deletion of dangling objects from the repository.

Garbage collection handles tasks besides just pruning, and git prune can be thought of as a child of the git gc command.

Once a Git object is pruned, it is permanently deleted and cannot be recovered.

git gc aggressive

When Git goes about the garbage collection process, it also takes loose objects and repacks them into files called pack files. Pack files are essentially a compression optimization, similar to creating zip files or other archive types.

The git gc –aggressive flag tells Git to do a much more thorough job when attempting to optimize a repository.

Git gc defaults to using the least resources possible when looking for the best optimization. This is because throwing away all existing deltas and recomputing them to shrink the repo is a costly process in terms of computing. Because of this, Git implements some sane limitations to the depth and scope of repacking.

git gc --aggressive will significantly expand these limits to the extent that even the Git documentation recommends avoiding aggressive flags almost entirely.

git gc failed to run repack

Sometimes when running the git gc command, you may get the following error:

$ git gc
...
error: failed to run repack

This error is typically a permissions issue caused by some process that has a handle on a particular file that the garbage collection process is trying to delete. Usually, the culprit is the code editor itself, and the problem can be resolved by simply closing and reopening the editor.

However, the rogue handle could have its origins in any other process, so a reboot should clear things up if all else fails.

Summary

The git gc command is a utility offered by Git to run the garbage collection process manually. We went over what garbage collection is and the specific Git objects it’s looking for when attempting to clean up a repository.

The automated garbage collection process is usually enough to keep most repositories optimized for everyday use. However, in cases where a more surgical approach is required, the git gc is a great option.

Overall, it’s highly recommended to have a firm understanding of Git’s garbage collection process. If nothing else, it will help you avoid accidental deletions and help you understand better when objects are and are not still recoverable.

Next steps

If you’re interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git’s code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this, we documented the first version of Git’s code and discussed it in detail.

We hope you enjoyed this post! Feel free to shoot me an email at jacob@initialcommit.io with any questions or comments.

References

  1. Git SCM Docs, git gc – https://git-scm.com/docs/git-gc

Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here