10.5 Partial correlations

10.5.1 Introduction

Recall that a correlation coefficient quantifies the strength of the relationship between two variables. That is a standard (zero-order) Pearson’s correlation coefficient, which is ubiquitous in just about any statistical analysis involving continuous variables.

There are many more types of correlation coefficients out there, some of which we talk about in this subject and others which we won’t touch. On this page, we’ll look at two extensions of Pearson’s correlation coefficent in particular: partial correlations and semipartial correlations. Both measures are useful specifically in regression contexts. On this page, we will start with partial correlations only, and move onto semipartials in the next.

10.5.2 Partial correlations

A partial correlation quantifies the relationship between variables X and Y, while controlling for variable Z. In essence, you can imagine that the effect of Z is ‘partialled out’ - or removed - from the correlation between X and Y. This is useful in situations just like the one described above - when you want to control for the effect of a certain variable when calculating a correlation coefficient.

The formula for a partial correlation is:

\[ r_{xy.z} = \frac{r_{xy} - (r_{xz} \times r_{yz})}{\sqrt{(1-r^2_{xz})(1-{r^2_{yz}})}} \]

That is, you need to know the standard correlation between variables X and Y ($r_{xy}$), the correlation between X and Z ($r_{xz}$) and the correlation between Y and Z ($r_{yz}$), and then plug them into the formula above.

Let’s come back to the flow data in Week 10 once again. Imagine we want to test the correlation between Gold MSI scores and flow proneness. However, we might suspect that openness to experience may play a role in the relationship between these variables - in other words, we expect openness to experience to affect both Gold-MSI scores and flow proneness. One thing we could do is to calculate a partial correlation between MSI scores and flow proneness while controlling for openness.

If we know the correlation between:

MSI scores and flow proneness,
MSI scores and openness,
Flow proneness and openness,

Then we can calculate partial correlations between MSI scores and flow proneness, controlling for openness.

In R, we can calculate both partial and semi-partial correlations using the ppcor package. For demonstration’s sake, we will filter the dataset so it only contains the three variables we are interested in here - Gold-MSI scores, openness and flow proneness - though it is not necessary to do this.

To calculate a partial correlation with a significant test, we can use the pcor.test() function. The function needs three arguments:

x: The first variable to correlate.
y: The second variable to correlate.
z: The control variable.

The output will look something like this. This looks similar enough to a standard correlation output, and you can interpret it as such. The output shows the relationship between X (Gold-MSI) and Y (DFS Total, flow proneness), while controlling for the effect of openness to experience.

library(ppcor)
pcor.test(
  x = flow_data$GoldMSI,
  y = flow_data$DFS_Total,
  z = flow_data$openness
)

Compare this to the standard correlation call below - we can see that even after controlling for openness the correlation is still significant, and doesn’t decrease by a huge amount (partial r(809) = .42, p < .001).

# Standard correlation
cor.test(
  x = flow_data$GoldMSI,
  y = flow_data$DFS_Total
)

## 
##  Pearson's product-moment correlation
## 
## data:  flow_data$GoldMSI and flow_data$DFS_Total
## t = 14.743, df = 809, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4041563 0.5127911
## sample estimates:
##       cor 
## 0.4601945

To run multiple partial correlations, between variables, the pcor() function will compute these given a dataset. This will calculate partial correlations between pairs of variables, controlling for every other variable included in the call to pcor(). In the example below, we fuse select() to pull only the variables we are interested in and pipe this to pcor() - basically, we are giving pcor() a data frame with only the three variables of interest⁵.

flow_data %>%
  dplyr::select(GoldMSI, openness, DFS_Total) %>%
  pcor()

## $estimate
##             GoldMSI  openness DFS_Total
## GoldMSI   1.0000000 0.2154726 0.4207818
## openness  0.2154726 1.0000000 0.1181387
## DFS_Total 0.4207818 0.1181387 1.0000000
## 
## $p.value
##                GoldMSI     openness    DFS_Total
## GoldMSI   0.000000e+00 5.794901e-10 4.274538e-36
## openness  5.794901e-10 0.000000e+00 7.546911e-04
## DFS_Total 4.274538e-36 7.546911e-04 0.000000e+00
## 
## $statistic
##             GoldMSI openness DFS_Total
## GoldMSI    0.000000 6.272218 13.184932
## openness   6.272218 0.000000  3.381816
## DFS_Total 13.184932 3.381816  0.000000
## 
## $n
## [1] 811
## 
## $gp
## [1] 1
## 
## $method
## [1] "pearson"

The way to read this table is to look at the $estimate part of the output for the correlations, and the $p.value for the corresponding p-value. We can see that the partial correlation between Gold-MSI scores and flow proneness is where the rows/columns for DFS_Total and GoldMSI meet.

Note that ppcor loads a package called MASS, which is used for a number of advanced statistical functions. MASS has its own select() function, which has a quirk of overwriting the tidyverse select() if MASS/ppcor is loaded after tidyverse; that is, any select() call you make might actually be MASS::select() (which won’t work for what you probably want to do) instead of dplyr::select() (which is the column selector). The solution is to either load tidyverse after everything else (if you’re sure this won’t break some other package’s function), or make explicit calls to dplyr using dplyr::select(), which is what I’ve done here.↩︎