Jeromy Anglim's Blog: Psychology and Statistics


Monday, November 29, 2010

Getting Started with Git, EGit, Eclipse, and GitHub: Version Control for R Projects

This post provides information on
(a) installing Git using the Eclipse plugin Egit. (b) uploading repositories to GitHub, and (c) links to resources on Git, Git and LaTeX, and Git and R. The focus is on version control for people working on R, Sweave, and LaTeX related projects.

Overview

Version control works really well with R, Sweave, and LaTeX projects.

Benefits of Version Control

There are many benefits to version control for the data analyst. Version control allows you to:

  • Rewind a project or a file to a previous state, which in turn encourages experimentation
  • Ensure there is a record of changes
  • Facilitate collaboration
  • Facilitate backup
  • Show changes between files
  • Facilitate code sharing and reproducibility
  • and much more... See this question on StackOverflow for further discussion.

I also found that adopting version control facilitated several conceptual benefits. It encouraged greater consideration of:

  • the distinction between source and derived files
  • the nature of dependencies:
    • dependencies between elements of code
    • dependencies between files within a project
    • and dependencies with files and programs external to the repository
  • the nature of a repository and how repositories should be divided
  • the nature of committing and documenting changes and project milestones

Choosing a Version Control System and Workflow

Why Git?

There are many version control systems (see Plastic SCM for a discussion). I've chosen to use Git for the following reasons:

  • Git can work well with Eclipse and Windows using Egit and many other tools
  • Git is one of the popular version control systems
  • Git enables uploading to Github
  • Git has good documentation and support material
  • Experts, who know a lot more about version control than I do, use Git (e.g., Hadley Wickham); the designer of Git is Linus Torvalds.

Finally, the big difference is between using a version control system and not using a version control system.

Why EGit?

EGit is a Git plugin for Eclipse. I use Eclipse and StatET to write R code and Sweave documents. I found EGit a particularly easy tool for getting started with Git and version control. The documentation is straightforward and the interface is easily integrated into my Eclipse workflow.

Getting Started with EGit and Git in Eclipse

There are many ways to interact with Git.

Installing EGit in eclipse involves using the update manager. Vogella.de has a tutorial.

To get started with your first Git repository in Eclipse, check out the EGit user Guide. When I was first getting started I used a simple R project rather than a Java Hello World application.

GitHub

GitHub is one of several sites for sharing git repositories (for example, see Hadley Wickham's baby names analysis, or my own example of using Sweave to write Multiple Choice Questions). It also has many useful social networking features.

Uploading a repository to GitHub from Eclipse

While the above tutorial briefly mentions SSH Configuration, it does not go into detail. When setting up my SSH key, I did the following:

  1. Click Eclipse -- Window -- Preferences -- General -- Network Connections -- SSH2 -- Key Management
  2. Click on Generate DSA Key
  3. Type in a passphrase (i.e., a long and robust password)
  4. Click Save Private Key (I saved it to a new folder under my user account)
  5. Go to this new folder and open "id_dsa.pub" as a plain text file and copy the contents of the file to the clipboard.
  6. Go to github.com -- Account Settings -- SSH Public Keys and click Add another public key.
  7. Paste the public key into the box and give it a name

Using gist.github

Gists provide a quick way to get started with GitHub. Gists are useful for storing and sharing snippets of code. The result can be embedded into blog posts. To get formatted R code, give the file name a ".r" file extension (e.g., "test.r") (thanks to Hadley Wikham)

A simple example of an embedded gist is shown below:

Interesting R GitHub repositories

Good examples of people sharing R projects on GitHub include:

See the suggestions on Stats.SE. I also have an account in case you are interested.

Additional Resources

General Git

Git and LaTeX

Git and R