2 The Basic Git Workflow

Basic Git use includes how to create a repository, track changes in the files within your repository, view your repository history, and view line by line changes in modified files. The title for each section in this chapter will be a basic Git command with the text, code blocks, and figures in each section describing the use and result for each command.

2.0.1 git init

If you have an empty folder that you would like make a git repository git init is the appropriate command. In the example below I have an empty folder named “git_practice” on S drive under S:/RTS/Reimer/Research_best_practices. The terminal session below shows 3 commands and the output received after each command. The first command, git status is a general diagnostic of the Git repository. This command is commonly used when working in the terminal.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice
$ git status
fatal: not a git repository (or any of the parent directories): .git

The second command uses git init to make the working directory into a Git repository.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice
$ git init
Initialized empty Git repository in //dfg.alaska.local/DSF/Anchorage/RTS/Reimer/Research_Best_Practices/git_practice/.git/

The third command used git status to verify the working directory is now a git repository.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git status
On branch {main}

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        .gitignore
        git_practice.Rproj

nothing added to commit but untracked files present (use "git add" to track)

The git init command creates a git repository in your project directory. To help conceptualize the use of Git we will be developing a series of figures to demonstrate how files are introduced into the workflow and move through the Git work space once there. At this point, a few files are in the working directory and while a repository has been created we have yet to commit anything to the repository. The rounded rectangle in Figure 2.1 indicates the contents of your working directory at this point in time. At the moment they are represented by an empty rectangle which indicates that the files in your working directory have yet to be staged. We will build this figure as we complete tasks within the Git work space.

Figure 2.1: The Git workspace after you have initilized a repository.

2.0.2 git add

In the last git status report shown above 2 files (.gitignore and git_practice.Rproj) were noted that could be added to the repository. Before we do that let’s create third file named fib_seq.R which contains a brief header and a single line of code fib_seq <- c(0, 1). The Fibonacci sequence is the sequence created when each value in the sequence is defined as the sum of the 2 previous values in the sequence and the vector c(0, 1) initializes the sequence. We will add to this sequence to practice the use of Git. The terminal session below shows 5 commands necessary to add or stage the files in our working directory. You stage a file when you would like Git to keep track of changes in that file through time or when that file is an important piece of the analysis that would be required for a peer to recreate your work. As usual, git status is a good place to start and allows us to verify which files can be added/staged.

$ git status
On branch {main}

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        .gitignore
        fib_seq.R
        git_practice.Rproj

nothing added to commit but untracked files present (use "git add" to track)

We can then use git add to stage each file one at a time. Notice that I made a typo while attempting to add the git_practice.Rproj file. I survived this catastrophe with a warning, which is typical of mistakes in the terminal.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git add .gitignore

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git add fib_seq.R

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git_add git_practice.Rproj
bash: git_add: command not found

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git add git_practice.Rproj

A second call to git status allows us to verify all files are staged.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git status
On branch {main}

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   .gitignore
        new file:   fib_seq.R
        new file:   git_practice.Rproj

Use of git add stages files you would like to track in your git repository. The rectangle in Figure 2.2 is now filled which indicates that files within your working directory are staged and ready to be committed.

Figure 2.2: The Git workspace after you have staged files in your working directory which you intend to add to your local repository.

Notice Git tells you how to unstage a file if you added one inadvertently. On occasion there are files in your working directory which you do not want Git to track. Examples might be .pdf files for literature you referenced while conducting the analysis, word documents you produced for operational planning and reporting, or extremely large outputs. It’s fine to exclude these sort of files but before you do so consider… “Would a future researcher need access to this file to recreate my work?”. If they would you should track them in the repository.

It can be cumbersome to have a long list of files which Git recognizes as present in your working directory but you are not actively tracking. The solution is to open the file .gitignore and include the names of the files you do not want to track. You can use wildcards if you prefer not to track all files of a certain type and/or specify folders if you don’t want to track anything in certain sub-directories. For example, *.xlsx would ignore all .xlsx files in your working directory while posts/ would ignore all of the files in the folder posts within your working directory.

2.0.3 git commit

In the git status response above 3 files were staged. Let’s commit those files in the terminal. When you make a commit you are asking Git to keep a copy of your staged files at that particular point in time. In the terminal session below we use git status to verify all of the files we want to commit are staged.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git status
On branch {main}

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   .gitignore
        new file:   fib_seq.R
        new file:   git_practice.Rproj

The command git commit is then run, with the option -m included to specify the commit message. Notice the -m option is specified twice. The first time -m is specified Git knows the text string following is the commit title. The second time -m is specified Git knows the text string following is the commit description. Commit titles are required while commit descriptions are highly encouraged. A good practice is for the title to be brief (less that 50 characters) so that it displays well in most formats. There is no length limit for the description and this is the place to provide some explanation beyond what you can capture in the title. I’ve purposely been verbose with the commit above to demonstrate a long description. Please avoid non-informative commit messages and avoid the tendency to fatigue and include less and less informative messages as the analysis proceeds.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git commit -m "Initialize Fibonacci sequence" -m "Sequential additions to the Fibonacci sequence will provide a simple way to demonstrate several cycles of the git workflow including add/commit, push/pull, collaborate, fork, branch, merge, merge conflicts, exc. I even got my wife a GitHub account for this. Wish us luck!"
[{main} e17181f] Initialize Fibonacci sequence
 Date: Sun Jul 2 12:59:06 2023 -0800
 3 files changed, 21 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 fib_seq.R
 create mode 100644 git_practice.Rproj

The git status output after the commit shows that all of the files in the working directory are included in the repository in their current state.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git status
On branch {main}
nothing to commit, working tree clean

The git log command provides a summary of the commit. Two important parts are the commit ID and the commit message. The commit ID is a code which can be used to reference the commit in the future. Git assigned a long ID to each commit (e17181fa781b2e30096e1c7d31443aac18d527e5 for this commit) but its common to using only the first 7 characters of the commit ID (e17181f) to refer to the commit.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git log
commit e17181fa781b2e30096e1c7d31443aac18d527e5 (HEAD -> {main})
Author: Adam Reimer <adam.reimer@alaska.gov>
Date:   Sun Jul 2 12:59:06 2023 -0800

    Initialize Fibonacci sequence

    Sequential additions to the Fibonacci sequence will provide a simple way to demonstrate several cycles of the git workflow including add/commit, push/pull, collaborate, fork, branch, merge, merge conflicts, exc. I even got my wife a GitHub account for this. Wish us luck!

Use of git commit includes the files you would like to track in your git repository. The working directory in Figure 2.3 now shows an empty rectangle (as the files are no longer staged) but also includes a entry in the local repository identified with a short ID.

Figure 2.3: The Git workspace after you have committed your staged files to your local repository.

The git workflow described so far forms the basis on reproducible research using Git. We will calculate the next several values in the Fibonacci sequence to practice this workflow. The same sequence described above is repeated:

A change is made to fib_seq.R (in this case a new line fib_seq[3] <- fib_seq[1] + fib_seq[2] is added) and saved to the working directory. The working directory now represents a more recent snapshot of time that the local repository.
The changed file is staged.
The staged file is committed.
Repeat steps 1 through 3.

The process looks like:

2.0.3.1 When to Commit?

Saves and a commits serve different purposes. As we all know, save can and should be used frequently… many times an hour and/or any time you are stepping away from your work. This use is agnostic to whether the analyst is or is not using a git workflow.

In contrast, commits are made for two reasons. First, a commit should be made whenever the analysis is at a point which you may want to revisit. Examples include: adding new data, adding a new feature to the analysis, or any time the code was run and the results were distributed. Any one of these tasks may have resulted in a new ‘version’ in the traditional workflow but they don’t have to be major updates. A second reason to commit is when the changes are substantive enough that the line-by-line change may be difficult to track if you did not commit until the new data/features are complete. These commits snapshot significant steps in a new feature’s development or prior to experimenting with a new feature.

The most important thing to note regarding commits is that, unlike save, there is no temporal component. While saves are designed to minimize the risk of lost work and should be frequent in time, commits are designed to record importance stages of the analysis and should be frequent with respect to progress. A difficult feature may take days to code but represent a single commit, provided the actual changes to the code are modest. Efficiency in commit frequency will pay off when the repository is being revisited at a later date and each commit is a snapshot of the analysis that is important or informative to the reviewer. Note that the commits demonstrated in this text are purposely pedantic but a good rule of thumb is that if you can’t describe why you are committing your files it may be an unnecessary commit.

2.0.4 git log

To view our commit history in the terminal use git log. Note that the commit ID, author, date, commit title and commit description are all shown in the log. Note on large projects calling git log can result in a substantial amount out output which can be difficult to exit (type q at any time and you will exit)¹.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git log
commit 11cf98ff67b8ec4f8cd7f2c1650a176d5875fdcf (HEAD -> {main})
Author: Adam Reimer <adam.reimer@alaska.gov>
Date:   Sun Jul 2 14:53:05 2023 -0800

    Fourth entry in the Fibonacci sequence

    Long and informative message goes here.

commit 3bb6c98bb0048bad7bda489bd8d40be24fb66acf
Author: Adam Reimer <adam.reimer@alaska.gov>
Date:   Sun Jul 2 14:31:04 2023 -0800

    Third entry in fib_seq

    This message is not necessary for such a simple commit, but descriptions are an important part of reproducible research I’m writing a long message to set a good example.  Have better content in yours.

commit e17181fa781b2e30096e1c7d31443aac18d527e5
Author: Adam Reimer <adam.reimer@alaska.gov>
Date:   Sun Jul 2 12:59:06 2023 -0800

    Initialize Fibonacci sequence

    Sequential additions to the Fibonacci sequence will provide a simple way to demonstrate several cycles of the git workflow including add/commit, push/pull, collaborate, fork, branch, merge, merge conflicts, exc. I even got my wife a GitHub account
for this. Wish us luck!

2.0.5 git diff

To see the difference between two commits use git diff. With no additional arguments git diff will show the changes in the working directory relative to the last commit. Our working directory has no changes as illustrated by git status.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git status
On branch {main}
nothing to commit, working tree clean

If we add a commit ID to the git diff command the output will show how the tracked files have changed relative to the point in time when commit 3bb6c9 was made.

amreimer@DFGSXQDSF223076 MINGW64 /s/RTS/Reimer/Research_Best_Practices/git_practice ({main})
$ git diff 3bb6c9
diff --git a/fib_seq.R b/fib_seq.R
index 9c118d5..4ce1d70 100644
--- a/fib_seq.R
+++ b/fib_seq.R
@@ -2,4 +2,5 @@
 #Adam Reimer

 fib_seq <- c(0, 1)
-fib_seq[3] <- fib_seq[1] + fib_seq[2]
\ No newline at end of file
+fib_seq[3] <- fib_seq[1] + fib_seq[2]
+fib_seq[4] <- fib_seq[2] + fib_seq[3]
\ No newline at end of file

Git GUI’s are superior to the terminal when looking at line-by line differences but for completeness we will discuss how to read the output. The section --- a/fib_seq.R to +++ b/fib_seq.R identifies the files that were modified where --- a/ and +++ b refer to the previous and the current versions of the file respectively. The line @@ -2,4 +2,5 @@ tells us that the output is showing the original file starting on the second line and displaying the the next 4 lines (three unmarked lines and the line with a negative symbol) while the new version of the file is also shown starting from the second line but displaying the next 5 lines (the three unmarked lines and the two lines with an addition symbol). This makes sense because a single line was added to the new version.