8.2 Regression hypotheses
Let’s move on from correlations to regressions, where we test whether one variable can predict another. To do that, let’s start by considering what it is we actually hypothesise in doing a linear regression.
8.2.1 Terminology
Let’s kick off the regression portion of this module with a bit of terminology: - The line is called the line of best fit. The slope of the line is, well, the slope. It describes how much Y changes if X changes by one unit. - The point at which the line crosses the y-axis is called the intercept (the y-intercept in full). The intercept is one of the two paramaters of a regression line (the other being the slope).
Linear regressions involve plotting a line of best fit to the data, and using this line to make predictions.
If you think back to high school, you may have learned something like y = mx + c or y = ax + b in algebra to describe a straight line. In linear regression, we use the same concepts to both describe our line and make predictions using it (more on that later). The key difference is that we change the letters a bit:
\[ y = \beta_0 + \beta_1x + \epsilon_i \]
Let’s break this down:
- y is simply our predicted value (i.e. the line of best fit)
- \(\beta_0\) is our intercept (the c in y = mx + c)
- \(x\) simply refers to our independent variable
- \(\beta_1\) is the slope for the independent variable - in other words, how much y increases for every unit increase of x. We also call this B
- \(\epsilon_i\) is error, which is essentially random variation (due to sampling). This error is normally distributed.
Keep this in mind for now - we’ll come back to this later in the module!
8.2.2 Hypothesis testing in regressions
When we conduct a linear regression, in part we’re testing if the two variables are correlated. However, we’re also testing whether we can predict our DV from our IV, so our statistical hypotheses are formulated around this idea. Our null and alternative hypotheses are therefore about the slope (\(\beta_1\)):
\(H_0: \beta_1 = 0\), i.e. the slope is 0
\(H_1: \beta_1 \neq 0\), i.e. the slope is not equal to 0
Consider the two graphs below. On the left is a graph where the line of best fit has a slope of 0 (the null). No matter what value X is, the value of Y is always the same (2.5 in this example) - in other words, X does not predict Y. The graph on the right side, on the other hand, is an example of the alternative hypothesis in this scenario. Here, X does clearly predict Y - as X increases, Y increases as well.
But how do we actually test this? The answer is something we’ve seen before - we do a t-test!
We came across t-tests in context of comparing two means against each other. The logic here is exactly the same, except now we compare two slopes with each other - the slope we actually observe (B) minus the null hypothesis slope (0). We can use this logic to calculate a t-statistic using the below formula:
\[ t = \frac{B}{SE_B} \]
Where B = observed slope and SE = standard error of B. This is the same formula as the calculation for a t-test, albeit that the top row is just B (because it is B - 0). We can do this for each predictor, and then use the same t-distribution to test whether this slope is significant - in other words, whether our predictor (IV) significantly predicts our outcome (DV).