Jeromy Anglim's Blog: Psychology and Statistics


Monday, September 12, 2016

Suggestions for how R and RStudio could improve auto-completion and usability of R

RStudio has improved the power of auto-completion in R and generally increased usability. However, there remains the potential to improve discoverability and usability. There are also coding practices that R package authors can adopt both to work better with auto-complete and make the features of their R package more discoverable.  After using and teaching R for the last ten years, this post outlines what I see as major areas for potential improvement.
R has a reputation as being efficient once you know how it works, but difficult to learn.
Auto-completion increases coding productivity.
  • Users don't have to memorise the precise spelling of the name of every function, argument name, data frame variable, and argument value. It also helps to resolve the issue of the wide range of coding conventions in R (camelCase, dot.names, under_score_names, etc.). 
  • It means that users can focus more on coding and less on looking up help files for the precise phrasing of some low level feature, or constantly typing dput(names(mydata)) to get lists of variable names.
  • New users may also know what they are looking for, but not know how to obtain it. Auto-completion can facilitate this.
My general conclusion is that auto-completion needs to be taken more seriously in R. RStudio has done a great job of implementing auto-completion. I also think that the R language and package authors could incorporate features to work better with IDEs that implement auto-complete. 

Auto-completion of arguments that take a character variable

Many functions have multi-category options (e.g., method of correlation, missing data procedure for a table, type of factor analysis rotation. It would be good to have auto-completion on these values.  

Example 1: If I have missing data on a correlation matrix, then I use the "use" argument to specify what kind of missing data substitution should occur. It would be good if code completion operated on the available options. That said, at least RStudio automatically shows the argument instructions which lists the options.

Example 2: The options for useNA and exclude arguments of table

Example 3: The rotation argument for factanal does not list available rotations. The help only states that the default argument is "varimax" and that there are other rotations in some other packages, although the help files does show "promax" as another option.

Recommendation:

  • Package authors should ensure that the help files list all argument options in the "arguments" section of the help file. If using "see details", at least list the permissible option names in the arguments section and use "see details" for actually putting the details of what each of the arguments means. RStudio displays the argument information in auto-complete. Often a user just wants to be reminded of the precise spelling for the argument option or wishes to get an overview of the choices.
  • It should be possible to enable auto-completion on the available options. I imagine this would involve the specification of additional language features in R which would then be detected by IDEs like RStudio.

Auto-completion for nested elipsis arguments 

Elipsis arguments (...) allow for flexibility. However, they also decrease usability because, users are less clear on what arguments can be passed to a function. This is particularly true for arguments to methods like print and summary.

Example 1: I'm running a factor analysis
fit <- factanal(matrix(rnorm(1000), ncol = 10), 2)

The code for printing the loadings, has several arguments including "sort" and "cutoff"i.e.,
print(fit, sort = TRUE, cutoff = .5)


But auto-complete doesn't see these arguments. RStutidio actually does  a pretty good job of finding arguments. It seems that these arguments are related to "print.loadings" as opposed to "print.factanal".  Thus, if you go:
loads <- fit$loadings
Then, pressing tab after 
print(loads, 
will show the cutoff and sort arguments.
However, it seems that RStudio is only able to to go one layer deep.

I imagine that this is a hard one to solve.

Auto-completion of variable names in  data frames

There is limited auto-completion support in RStudio for names in data frames. It has improved. You can type mydata[, {tab}  and get the variable names. However, you can't type  mydata[,c(" {tab}.

Recommendations:
  • RStudio should also auto-complete variable names after mydata[,c(" . i.e., after quotation marks. Because presumably that is how the user would be selecting variables and then they realise that they can't remember the precise spelling and so need to tab complete.

Auto-completion on formulas

Many functions in R use formulas. Most notable are model fitting functions like lm and glm. However, there is no support in RStudio for auto-completing variable names in formulas. Some of the impediments to this: Formulas come before listing the data.frame in most functions (e.g., lm). So if there are multiple data frames in the workspace, then it would be a little tricky to know which to list. 

Auto-completion in the Hadleyverse (e.g., ggplot2) and other functions where a data frame is one argument and variable names are another

Hadley Wickham's packages are awesome. However, they have a particular coding style. In particular, a data frame is commonly one argument (e.g., the first) and variable names are specified as a separate argument; often this is done without quotation marks and in a slightly separate context to the specification of the data frame. For example, in the following context:
ggplot(mydata, aes(my_very_long_variable_name))
There is no auto-completion in RStudio for the variable my_very_long_variable_name.

Similar coding rules apply to a wide range of functions where variable names are specified in a separate argument to the data.frame (e.g., see many of the dplyr and tidyr functions, but also base R functions like subset and reshape).   These functions would be so much easier to use if there was auto-completion of variable names in these contexts. 

One approach would just be to show auto-completion of variable names of data.frames in more places. However, this could get noisy. Another approach would require a deeper understanding of the language. Presumably this could be done on an ad hoc basis. For example, RStudio could hard code ggplot2 features to know when auto-completion on variable names should occur. Otherwise, perhaps there could be a convention for how package authors could speak to IDEs that want auto-completion information, and a more general way of indicating that auto-completion software should look at the preceding data.frame for the variables.

Auto-completion for function arguments that take lists

There are many functions that have an argument that takes a named list:
  • nls(..., control = list(...))
  • ProjectTemplate::load.project(override.config = list(...))
There is no auto-completion on what are the allowed named elements.

Recommendations:
  • Package authors: should include the list of permissible argument names in the argument section of the help file so that auto-completion software could quickly show this information.
  • R language: There should be a way to specify what are the permissible arguments which could then be then incorporated into some form of auto-complete in RStudio.

Some other issues

The following are some other related issues that link with the issue of auto-completion. 

Make more model fit information accessible from the fit object

An attractive feature of SPSS and related software is that you get a lot of output and there is often a GUI that allows you to select the output that you want. R model output tends to be brief, and if you want additional output, you need to ask for it. This is good also, but how to obtain the additional output could be more intuitive. For example, there is a lot of different information that you might want to obtain from a multiple regression (influence statistics, standardized coefficients,  zero-order correlations between predictors and outcome, and so on). One of the challenges is that the model in R is often of the form: (1) return fit, (2) run function or method on that fit object. However, for a new user, it is often difficult to discover what are the available functions and methods that are required to derive a relevant bit of information from an R fit object. 

It would be nice if it was as simple as typicaly fit. {tab} and you would get a big list of things that you might want to obtain.

Avoid printing output to the screen that can not easily be extracted

R generally makes reproducible analysis easier to perform. A common use case is to take the output of a function and use that output in a subsequent function. This can be as simple as creating a table that combines different elements (e.g., coefficients from multiple models along with fit statistics).

However, some functions print the statistics you want to the screen, but these numbers are not readily available. In general, this means that print function is performing the calculations and printing them to the screen, without ever storing the results in an object.

Example 1: The print method for factanal prints proportion variance explained for each factor. This is calculated in the print function but is not accessible. If you didn't know how to calculate this yourself, you would have to know that getAnywhere(print.factanal) is the incantation for seeing how R calculates it, and then you'd have to extract the code that does it.

In contrast, when you run summary on an lm fit, you can explore the object and extract things like adjusted r-squared. E.g.,
fit <- lm(y ~ x, mydata)
sfit <- summary(fit)
sfit$ (tab)

This will show the elements of what has been calculated. Depending on trade-offs for computation time, it might even be simpler, if more of these relevant summary statistics are calculated with the fit. So that a user only has to fit the object, and then they can extract the relevant information with fit$ (tab)

Recommendation
  • Package authors should try to ensure that for every important bit of output in a print function, there should be a standard way of extracting that information into an object. For example, the summary method for lm returns the adjusted r-squared.

Many different object exploration operators

There are many different operators for exploring objects
  • $ (dollar) to extract named elements of a list (particularly used for output of statistical functions, variables in data.frames and general lists of things) .
  • :: (double colon) to extract functions and other objects in a package (e.g., mypackage::foo())
  • ::: (triple colon) to extract hidden functions
  • @ (at symbol) to extract elements of S4 class objects
  • . (period) which is a notational rule relevant to understanding S3 methods (e.g., print.lm)

Many rules for examining source code

Being able to see the source code is a nice feature in R. But equally, you need to know quite a bit to actually look at source code. e.g., getAnywhere, double colon versus triple colon, compiled code.