Jeromy Anglim's Blog: Psychology and Statistics


Sunday, July 17, 2011

Correlation Resources: SPSS, R, Causality, Interpretation, and APA Style Reporting

This post provides links to a range of resources related to the use and interpretation of correlations. I wanted to provide a page with links to a number of additional resources that would be useful both for those of my students who might be keen to learn more and for anyone else who might be interested. Specifically, this post provides links to: (a) introductory book-style chapters on correlation, (b) resources related to assorted issues in correlation (i.e., discussion of causal inference, correlation with various variable types, range restriction, statistical power, correlation interpretation, and significance testing), (c) tutorials on computing correlations using SPSS and R, and (d) tips for reporting correlations in APA Style.

Introductions to correlation

The following provide general textbook style overviews of correlation:

Assorted Issues

Correlation and Causation

Knowing how to reason about causality in the behavioural and social sciences is a really important skill.

Types of variables

The prototypical correlation example is based on two continuous, normally distributed variables. However, in practice there are many other types of variables that you might wish to correlate. The following provide pages provide links to suggestions for how to analyse some other common scenarios:

Range restriction

Statistical Power

Statistical power within the context of correlation is the probability of obtaining a statistically significant correlation in a study given that a true correlation exists.

  • This earlier post provides (a) some simple rules of thumb for power analysis for correlations, (b) how to calculate statistical power using free software called G-Power, and (c) links to additional reading on the important topic of statistical power.

Interpretation

When I first learnt about the correlation coefficient, I found it challenging to truly grok what a particular value meant. Learning the standard interpretation was easy. The challenging part was understanding the practical and theoretical implications for a correlation of a given size.

  • The following are some of the standard interpretations of a correlation:

    • Pearson's correlation is an index of the direction and strength of linear association between two variables.
    • The square of the correlation between X and Y is the percentage of variance shared between X and Y (e.g., if r = .50, then the two variables share .50 * .50 = 25% of variance).
    • If X and Y were standardised (i.e., made so that the mean of both variables was zero and the standard deviation was one) then, the correlation would be the same as the regression coefficient of X predicting Y or Y predicting X. Thus, for example, if r = .25 you could say that "a value one standard deviation greater on X predicts a .25 standard deviation greater value on Y".
  • Strategies for building an intuition of what a correlation means:

    • Play with the Regression by Eye simulation. The simulation generates a scatterplot, and you are asked to indicate which of a set of correlations corresponds to the scatterplot. It helps to build a mapping between the graphical intuitiveness of a scatterplot and the numeric summary of the linear association in the scatterplot (i.e., the correlation coefficient).
    • Memorise some of the rules of thumbs for describing correlation effect sizes (see this discussion by Andy Field), but don't take the rules of thumb too seriously.
    • Try to build up a frame of reference for correlations in different contexts by reading results sections. Meta analyses can also be particularly useful in this regard.
    • Read the article 'Meyer, G. J., et al (2001). Psychological Testing and Psychological Assessment: A Review of Evidence and Issues. American Psychologist, 56(2), 128-165.' (PDF) which provides large tables of meta-analytic correlations for a wide range of medical and psychological domains sorted by the size of the correlation. Studying these tables can help build an intuition and a context for interpretation of correlations.

Graphical approaches

As with most statistical techniques, there are various ways of representing the data. The correlation coefficient provides a very brief summary of the association between two variables. However, graphical representations of association are much richer.

The following are some general heuristics that I find useful when plotting data that might also be represented as a correlation:

  • Use scatterplots to explore features of the association (e.g., presence of outliers, linearity, distributional properties, spread of data around any trend line, etc.);
  • If one of the variables is positively skewed, consider plotting the corresponding axis on a log scale;
  • If there are a lot of data points (e.g., n > 1000), adopt a different strategy such as using some form of partial transparency (e.g., see use of the alpha property in ggplot2), or sampling the data;
  • If one of the variables takes on a limited number of discrete categories, consider using a jitter or a sunflower plot;
  • If there are three or more variables, consider using a scatterplot matrix;
  • Fitting some form of trend line is often useful;
  • Adjust the size of the plotting character to the sample size (for bigger n, use a smaller plotting character).

Significance tests on correlations

There are a wide range of possible significance tests that can be performed on correlations. The following links provide some suggestions and links for different scenarios.

Statistical Software

Calculating a correlation coefficient and its associated statistical significance is a standard task that almost any statistical package can perform. Many psychology students are taught to use SPSS. It is a proprietary (i.e., you can't run it at home without a paid licence) data analysis system with a strong empahsis on a GUI and making it easy to perform various standardised analyses common in the social sciences.

My preferred tool for performing data analysis is R. It is open source (thus, you can run it at home for free) and is often described as the lingua franca of statistics. It generally requires a more sophisticated understanding of statistics and computing to use effectively. Thus, for the interested psychology student or researcher I have this introduction to R for researchers in psychology.

Below I list resources for performing correlation analysis in SPSS and R.

SPSS

R

R makes it easy to perform correlations on datasets. Specifically, the following links provide example syntax:

Reporting Correlations in APA Style