Jeromy Anglim's Blog: Psychology and Statistics


Wednesday, May 28, 2014

Customising ProjectTemplate in R

This post talks about my workflow for getting started with a new data analysis project using the ProjectTemplate package.

Update (24th August 2016)

Over the last two years, I have been refining this customised version of ProjectTemplate.
I have more detailed information about the latest version here.

Video at Melbourne R Users July 4th 2017

Overview of ProjectTemplate

ProjectTemplate is an R Package which facilitates data analysis, encourages good data analysis habits, and standardises many data analytic steps. After many years of refining a data analysis workflow in R, I realised that I'd basically converged on something similar to ProjectTemplate anyway. However, my approach was not quite as systematic, and it took more effort than necessary to get started on a new project. Thus, since late 2013, I've been using ProjectTemplate to organise my R data analysis projects.
While I have found ProjectTemplate to be an excellent tool, I realised that when I created a new data analysis project based on ProjectTemplate, I was repeatedly making a large number of customisations to the initial set of files and folders. Thus, I've now set up a repository to store these customisations so that I can get started on a new data analysis project more efficiently. The purpose of this post is to document these modifications.
This post assumes a reasonable knowledge of R and ProjectTemplate. If you're not familiar with ProjectTemplate, you could check out the ProjectTemplate website focusing particularly on the Getting Started section. If you're really keen you could also watch an hour long video on ProjectTemplate, RStudio, and GitHub

General setup

I have a copy of my customised version of the ProjectTemplate directory and file structure on github in the AnglimModifiedProjectTemplate repository. Specifically, it has:
  1. Modifications to global.dcf as described below,
  2. a blank readme.md
  3. a couple of directories removed that I don't use (e.g., diagnositics, logs, profiling)
  4. an initial rmd file with the customisations mentioned below in the reports directory
  5. An .Rproj RStudio project file to enable easy launching of RStudio.
  6. An additional output directory for storing tabular, text, and other output
Thus, whenever I want to start a new data analysis project I can download and extract the zip file of the repository on github).
Thus, after creating a project folder, the following steps can be skipped when using my customised template.
  • Open RStudio and create RStudio Project in existing directory
  • Create ProjectTemplate folder structure with library(ProjectTemplate); create.project()
  • Move ProjectTemplate files into folder
  • Modify global.dcf
  • Setup rmd reports
I also document below a few additional points about subsequent steps including:
  • Setting up the data directory
  • Updating the readme file
  • Setttig up git repository

Modifying global.dcf

My preferred starting global.dcf settings are
data_loading: on
cache_loading: off
munging: on
logging: off
load_libraries: on
libraries: psych, lattice, Hmisc
as_factors: off
data_tables: off
A little explanation:
  • as_factors I do quite a bit of string processing, particularly on meta data and on output tables. I find the automatic conversion of strings into factors to be a really annoying feature. Thus, setting this to off is my preferred setting.
  • load_libraries: I always have additional libraries so it makes sense to have this on.
  • libraries: There are many common packages that I use, but I almost always make use of the above comma separate list of packages.

Setup rmd files

Basics of such files

The first line in the first chunk is always:
```{r}
library(ProjectTemplate); load.project()
```
This loads everything required to get started with the project.

Setup data folder

ProjectTemplate automatically names resulting data.frames with a name based on the file name. This is convenient. However, it is often the case that the file names need to be changed from some raw data supplied or it may be that the original data format is not perfectly suited for importing. In that case, I store the raw data in a separate folder called raw-data and then export or create a copy in the desired format with the desired name in the data folder.

Overriding default data import options

Some data files can not be imported using the default data import rules. Of course, you can change the file to comply with the rules. Alternatively, I think the standard solution is to add a file in the lib directory (e.g., data-override.r) that imports the data files. Give the imported data file the same name that ProjectTemplate would.

Update readme

I change the file to README.md to make it clear that it is a markdown formatted file. I can then add a little information about the project.

Setup git repository

If using github, I create a new repository on github.

Output folder

A common workflow for me is to generate tables, text, and figure output fromthe script which is then incorporated into a manuscript document. While I really like Sweave and RMarkdown, I often find it more practical to write a manuscript in Microsoft Word. I use the output folder to store tabular output, standard text output, and figures.
In the case of tabular output, there is the task of ensuring the table is formatted appropriately (e.g., desired number of decimal places, cell alignment, cell borders, font, cell merging, etc.). I typically find this easiest to do in Excel. Thus, I have a file called output-processing.xlsx. I import the tabular data into this file and apply relevant formatting. This can then be incorporated into the manuscript. Here are a few more notes about Table conversion in MS Word.