You’re reading Ry’s Git Tutorial
In Rewriting History, I talked about the internal representation of a Git repository. I may have mislead you a bit. While the reflog, interactive rebasing, and resetting may be more complex features of Git, they are still considered part of the porcelain, as is every other command we’ve covered. In this module, we’ll take a look at Git’s plumbing—the low-level commands that give us access to Git’s true internal representation of a project.
Unless you start hacking on Git’s source code, you’ll probably never need to use the plumbing commands presented below. But, manually manipulating a repository will fill in the conceptual details of how Git actually stores your data, and you should walk away with a much better understanding of the techniques that we’ve been using throughout this tutorial. In turn, this knowledge will make the familiar porcelain commands even more powerful.
We’ll start by inspecting Git’s object database, then we’ll manually create and commit a snapshot using only Git’s low-level interface.
If you’ve been following along from the previous module, you already have everything you need. Otherwise, download the zipped Git repository from the above link, uncompress it, and you’re good to go.
Examine Commit Details
First, let’s take a closer look at our latest commit with the
git cat-file plumbing command.
commit parameter tells Git that we want to see a commit
object, and as we already know,
HEAD refers to the most recent
commit. This will output the following, although your IDs and user information
will be different.
tree 552acd444696ccb1c3afe68a55ae8b20ece2b0e6 parent 6a1d380780a83ef5f49523777c5e8d801b7b9ba2 author Ryan <[email protected]> 1326496982 -0600 committer Ryan <[email protected]> 1326496982 -0600 Add .gitignore file
This is the complete representation of a commit: a tree, a parent, user data, and a commit message. The user information and commit message are relatively straightforward, but we’ve never seen the tree or parent values before.
A tree object is Git’s representation of the “snapshots” we’ve been talking about since the beginning of this tutorial. They record the state of a directory at a given point, without any notion of time or author. To tie trees together into a coherent project history, Git wraps each one in a commit object and specifies a parent, which is just another commit. By following the parent of each commit, you can walk through the entire history of a project.
Notice that each commit refers to one and only one tree object. From the
git cat-file output, we can also infer that trees use SHA-1
checksums for their ID’s. This will be the case for all of Git’s
Examine a Tree
Next, let’s try to inspect a tree using the same
cat-file command. Make sure to change
552acd4 to the ID of
your tree from the previous step.
Unfortunately, trees contain binary data, which is quite ugly when displayed in its raw form. So, Git offers another useful plumbing command:
This will output the contents of the tree, which looks an awful lot like a directory listing:
100644 blob 99ed0d431c5a19f147da3c4cb8421b5566600449 .gitignore 040000 tree ab4947cb27ef8731f7a54660655afaedaf45444d about 100644 blob cefb5a651557e135666af4c07c7f2ab4b8124bd7 blue.html 100644 blob cb01ae23932fd9704fdc5e077bc3c1184e1af6b9 green.html 100644 blob e993e5fa85a436b2bb05b6a8018e81f8e8864a24 index.html 100644 blob 2a6deedee35cc59a83b1d978b0b8b7963e8298e9 news-1.html 100644 blob 0171687fc1b23aa56c24c54168cdebaefecf7d71 news-2.html ...
By examining the above output, we can presume that “blobs”
represent files in our repository, whereas trees represent folders. Go ahead
and examine the
about tree with another
to see if this really is the case. You should see the contents of our
So, blob objects are how Git stores our file data, tree objects combine blobs and other trees into a directory listing, then commit objects tie trees into a project history. These are the only types of objects that Git needs to implement nearly all of the porcelain commands we’ve been using, and their relationship is summed up as follows:
Examine a Blob
Let’s take a look at the blob associated with
(be sure to change the following to the ID next to
your tree output).
This should display the entire contents of
confirming that blobs really are plain data files. Note that blobs are pure
content: there is no mention of a filename in the above output. That is to say,
blue.html is stored in the tree that contains the
blob, not the blob itself.
You may recall from The Basics that an SHA-1 checksum ensures an object’s contents is never corrupted without Git knowing about it. Checksums work by using the object’s contents to generate a unique character sequence. This not only functions as an identifier, it also guarantees that an object won’t be silently corrupted (the altered content would generate a different ID).
When it comes to blob objects, this has an additional benefit. Since two
blobs with the same data will have the same ID, Git must share blobs
across multiple trees. For example, our
hasn’t been changed since it was created, so our repository will only
have a single associated blob, and all subsequent trees will refer to it. By
not creating duplicate blobs for each tree object, Git vastly reduces the size
of a repository. With this in mind, we can revise our Git object diagram to the
However, as soon as you change a single line in a file, Git must create a new blob object because its contents will have changed, resulting in a new SHA-1 checksum.
Examine a Tag
The fourth and final type of Git object is the tag object
We can use the same
git cat-file command to show the details of a
This will output the commit ID associated with
v2.0, along with
the tag’s name, author, creation date, and message. The straightforward
relationship between tags and commits gives us our finalized Git object
Inspect Git’s Branch Representation
We now have the tools to fully explore Git’s branch representation.
-t flag, we can determine what kind of object Git uses
That’s right, a branch is just a reference to a commit object, which
means we can view it with a normal
This will output the exact same information as our original
cat-file commit HEAD. It seems that both the
HEAD are simply references to a commit object.
Using a text editor, open up the
You should find the commit checksum of the most recent commit, which you can
git log -n 1. This single file is all Git needs to
master branch—all other information is
extrapolated through the commit object relationships discussed above.
HEAD reference, on the other hand, is recorded in
.git/HEAD. Unlike the branch tips,
HEAD is not a
direct link to a commit. Instead, it refers to a branch, which Git uses to
figure out which commit is currently checked out. Remember that a
detached HEAD state occurred when
HEAD did not
coincide with the tip of any branch. Internally, all this means to Git is that
.git/HEAD doesn’t contain a local branch. Try checking out
an old commit:
.git/HEAD should contain a commit ID instead of a branch.
This tells Git that we’re in a
detached HEAD state.
Regardless of what state you’re in, the
git checkout command
will always record the checked-out reference in
Let’s get back to our
master branch before moving on:
Explore the Object Database
While we have a basic understanding of Git’s object interaction, we
have yet to explore where Git keeps all of these objects. In your
my-git-repo repository, open the folder
This is Git’s object database.
Each object, regardless of type, is stored as a file, using its SHA-1 checksum as the filename (sort of). But, instead of storing all objects in a single folder, they are split up using the first two characters of their ID as a directory name, resulting in an object database that looks something like the following.
00 10 28 33 3e 51 5c 6e 77 85 95 f7 01 11 29 34 3f 52 5e 6f 79 86 96 f8 02 16 2a 35 41 53 63 70 7a 87 98 f9 03 1c 2b 36 42 54 64 71 7c 88 99 fa 0c 26 30 3c 4e 5a 6a 75 83 91 a0 info 0e 27 31 3d 50 5b 6b 76 84 93 a2 pack
For example, an object with the following ID:
is stored in a folder called
7a, using the remaining characters
52bb8...) as a filename. This gives us an object ID, but before
we can inspect items in the object database, we need to know what type of
object it is. Again, we can use the
Of course, change the object ID to an object from your database
(don’t forget to combine the folder name with the filename to get the
full ID). This will output the type of commit, which we can then pass to a
normal call to
My object was a blob, but yours may be different. If it’s a tree,
remember to use
git ls-tree to turn that ugly binary data into a
pretty directory listing.
Collect the Garbage
As your repository grows, Git may automatically transfer your object files
into a more compact form know as a “pack” file. You can force this
compression with the garbage collection command, but beware: this command is
undo-able. If you want to continue exploring the contents of the
.git/objects folder, you should do so before running the following
command. Normal Git functionality will not be affected.
This compresses individual object files into a faster, smaller pack file and removes dangling commits (e.g., from a deleted, unmerged branch).
Of course, all of the same object ID’s will still work with
cat-file, and all of the porcelain commands will remain unaffected. The
git gc command only changes Git’s storage
mechanism—not the contents of a repository. Running
every now and then is usually a good idea, as it keeps your repository
Add Files to the Index
Thus far, we’ve been discussing Git’s low-level representation of committed snapshots. The rest of this module will shift gears and use more “plumbing” commands to manually prepare and commit a new snapshot. This will give us an idea of how Git manages the working directory and the staging area.
Create a new file called
my-git-repo and add the following HTML.
<p>Last week, a coalition of Asian designers, artists, and advertisers announced the official color of Asia:
>Return to home page
Then, update the
index.html “News” section to match
>Blue Is The New Hue
>Our New Rainbow
>A Red Rebellion
>Middle East's Silent Beast
git add, we’ll use the low-level
update-index command to add files to the staging area. The
index is Git’s term for the staged snapshot.
The last command will throw an error—Git won’t let you add a new file to the index without explicitly stating that it’s a new file:
We’ve just moved the working directory into the index, which means we
have a snapshot prepared for committal. However, the process won’t be
quite as simple as a mere
Store the Index in the Database
Remember that all commits refer to a tree object, which represents the snapshot for that commit. So, before creating a commit object, we need to add our index (the staged tree) to Git’s object database. We can do this with the following command.
This command creates a tree object from the index and stores it in
.git/objects. It will output the ID of the resulting tree (yours
may be different):
You can examine your new snapshot with
git ls-tree. Keep in
mind that the only new blobs created for this commit were
news-4.html. The rest of the tree
contains references to existing blobs.
So, we have our tree object, but we have yet to add it to the project history.
Create a Commit Object
To commit the new tree object, we need to manually figure out the ID of the parent commit.
This will output the following line, though your commit ID will be different. We’ll use this ID to specify the parent of our new commit object.
3329762Add .gitignore file
git commit-tree command creates a commit object from a tree
and a parent ID, while the author information is taken from an environment
variable set by Git. Make sure to change
5f44809 to your tree ID,
3329762 to your most recent commit ID.
This command will wait for more input: the commit message. Type
4th news item and press
Enter to create the commit message,
Enter for Windows or
Ctrl-D for Unix to specify an “End-of-file” character
to end the input. Like the
git write-tree command, this will
output the ID of the resulting commit object.
You’ll now be able to find this commit in
HEAD nor the branches have been updated to include
this commit. It’s a dangling commit at this point. Fortunately
for us, we know where Git stores its branch information.
Since we’re not in a
detached HEAD state,
HEAD is a reference to a branch. So, all we need to do to update
HEAD is move the
master branch forward to our new
commit object. Using a text editor, replace the contents of
.git/refs/heads/master with the output from
commit-tree in the previous step.
If this file seems to have disappeared, don’t fret! This just means
git gc command packed up all of our branch references
into single file. Instead of
.git/refs/heads/master, open up
.git/packed-refs, find the line with
refs/heads/master, and change the ID to the left of it.
Now that our
master branch points to the new commit, we should
be able to see the
news-4.html file in the project history.
The last four sections explain everything that happens behind the scenes
when we execute
git commit -a -m "Some Message". Aren’t you
glad you won’t have to use Git’s plumbing ever again?
After this module, you hopefully have a solid grasp of the object database that underlies almost every Git command. We examined commits, trees, blobs, tags, and branches, and we even created a commit object from scratch. All of this was meant to give you a deeper understanding of Git’s porcelain commands, and you should now feel ready to adapt Git to virtually any task you could possibly demand from a version control system.
As you migrate these skills to real-world projects, remember that Git is merely a tool for tracking your files, not a cure-all for managing software projects. No amount of intimate Git knowledge can make up for a haphazard set of conventions within a development team.
Thus concludes our journey through Git-based revision control. This tutorial was meant to prepare you for the realities of distributed software development—not to transform you into a Git expert overnight. You should be able to manage your own projects, collaborate with other Git users, and, perhaps most importantly, understand exactly what any other piece of Git documentation is trying to convey.
Your job now is to take these skills and apply them to new projects, sift through complex histories that you’ve never seen before, talk to other developers about their Git workflows, and take the time to actually try all of those “I wonder what would have happened if…” scenarios. Good luck!
For questions, comments, or suggestions, please contact us.
git cat-file <type> <object-id>
- Display the specified object, where
<type>is one of
git cat-file -t <object-id>
- Output the type of the specified object.
git ls-tree <tree-id>
- Display a pretty version of the specified tree object.
- Perform a garbage collection on the object database.
git update-index [--add] <file>
- Stage the specified file, using the optional
--addflag to denote a new untracked file.
- Generate a tree from the index and store it in the object database. Returns the ID of the new tree object.
git commit-tree <tree-id> -p <parent-id>
- Create a new commit object from the given tree object and parent commit. Returns the ID of the new commit object.
Sign up for my low-volume mailing list to find out when new content is released. Next up is a comprehensive Swift tutorial planned for late January.
You’ll only receive emails when new tutorials are released, and your contact information will never be shared with third parties. Click here to unsubscribe.