Understanding Git — Branching

This is the second post in my Understanding Git series so be sure to check out the first post that deals with git’s data model before you start with this one.

Let’s start where we left off last time — at git’s data model. Only this time we will simplify it a bit by only displaying the commit objects and giving them some symbolic names instead of checksums (just to make it easier to follow), so we get a graph like this:

Git data model simplified by displaying only commit objects

Those familiar with the graph theory will notice that this is a Directed Acyclic Graph (DAG). What that means is that the connection edges between graph nodes (in git’s case commits) are directed and if you start from one node travelling through the graph and following the edges direction you can never come to the same node that you started off (there are no “round-trips” ).

It is pretty much intuitive that we can differ three branches on our example graph. We’ll mark them as red (containing commits A,B,C,D,E), blue (containing commits A,B, F ,G) and green (containing commits A,B,H,I,J).

Git data graph containing three branches

So that’s one way of defining a branch — to associate it with a list of commits it contains. However, this is not the way git does it. Git uses a simpler and cheaper solution. Instead of having a list of all the commits belonging to a branch and keeping it updated, git only keeps track of the last commit on a branch. By knowing the last commit of a branch it is quite trivial to reconstruct the whole commits list of that branch just by following the directed edges of the git commit graph. For example, to define our blue branch, we only need to know that the last commit on the blue branch is G and from there if we need a list of all commits the blue branch contains we can just follow the directed graph edges starting from G.

Knowing the last commit on the Blue branch we can easily reconstruct its whole commits list

And this is how git manages branches, by keeping pointer to commits. So let’s see it “in action”.

First, we will initialise an empty repository

git init

and take a look at .git directory

$ tree .git/

.git/├── HEAD├── config├── description├── hooks│ ├── applypatch-msg.sample│ ├── commit-msg.sample│ ├── post-update.sample│ ├── pre-applypatch.sample│ ├── pre-commit.sample│ ├── pre-push.sample│ ├── pre-rebase.sample│ ├── pre-receive.sample│ ├── prepare-commit-msg.sample│ └── update.sample├── info│ └── exclude├── objects│ ├── info│ └── pack└── refs├── heads└── tags

This time we will focus on the refs sub-directory. It stands for references and this is where git keeps the branch pointers.

Since we didn’t commit any changes yet, refs directory is empty, so we will create and commit a few files.

echo "Hello World" > helloEarth.txtgit add .git commit -m "Hello World Commit"

echo "Hello Mars" > helloMars.txtgit add .git commit -m "Hello Mars Commit"

echo "Hello Saturn" > helloSaturn.txtgit add .git commit -m "Hello Saturn Commit"

If we do git branch now we see this output

* master

meaning we are now on the master branch (that git created automatically upon our first commit).

If we take another look at .git/refs

└── refs├── heads│ └── master└── tags

we see there is a file in refs/heads sub-directory and it is named master just as our branch is. This is a text file so we can use cat to take a look at it

cat .git/refs/heads/master

and we see it contains a checksum

c641e4f0d19df0570667977edff860fed8f6c05a

and if we do

git log

we see it is the checksum of our last commit:

commit c641e4f0d19df0570667977edff860fed8f6c05a (HEAD -> master)Author: zspajich <zspajich@gmail.com>Date: Mon Feb 12 16:28:44 2018 +0100

Hello Saturn Commit

(Note: checksums will have different values on you computer)

So there we have it — a branch in git is just a text file containing a checksum of the last commit on that branch. In other words — a pointer to a commit.

A branch in git is just a pointer to a commit object

If we now create and checkout a new feature branch

git checkout -b feature

and take another look at .git/refs

tree .git/refs

sure we see another file called feature

└── refs├── heads│ ├── feature│ └── master

and if we take a look at it’s checksum (pointer)

cat .git/refs/heads/feature

we see it’s the same as in the master file (branch)

c641e4f0d19df0570667977edff860fed8f6c05a

since we didn’t do any new commits on that branch.

Creating a new branch means creating a new pointer to the current commit

So that’s how fast and cheap creating a new branch in git is. Git just creates a text file and fills it with the checksum of the current commit.

But now that we have two branches there is one question. How does git know which of these two branches we are currently checked on? Well, there is one more special pointer (whose name will probably sound familiar to you) called HEAD . It is special because it (usually) doesn’t point to a commit object, but to a ref (branch) and git uses it to track which branch is currently checked out.

If we look inside HEAD

cat .git/HEAD

we see it currently points to the feature ref file (branch).

ref: refs/heads/feature

Special HEAD pointer tracks current ref/branch

If we would do

git checkout master

and take a look at HEAD

cat .git/HEAD

we would see

refs: refs/heads/master

it would point to the master branch.

HEAD points to master ref after checkout on master branch

So that‘s git’s branch model. It is very simple but important to know in order to understand many git operations that operate on that graph (merge, rebase, checkout, revert …).

In our next part of this series we will look at something that we have skipped so far — git staging area. We all know we have to stage our changes before committing them, but what exactly is that staging directory or index as it is sometimes called? We’ll see in the next post.