8.3 The regression equation

Now we have a sense of how to place our line of best fit, what does that actually tell us about the relationship between our predictor and outcome? Here, we will look a bit closer at what the slope of the line indicates.

8.3.1 Slopes and intercepts

Coming back to our gestation versus birth weight example once again, let’s now estimate a line of best fit through the data using our program of choice:

As we can see, there is clearly a positive relationship between gestation and birth weight - as we expected from us eyeballing the data. But recall that we’re interested in seeing whether our predictor (gestation) does significantly predict the outcome (birth weight). How do we tell if this is the case? The slope of the line of best fit tells us this, because it quantifies how much the outcome changes with each unit increase of the predictor.

An intuitive way to think about it is this: if the slope here was steep, it would tell us that every extra week of gestation would lead to a fairly noticeable increase in birth weight. If the slope was instead very gradual or flat, every extra week of gestation would lead to barely any change in birth weight - meaning that gestation would not predict birth weight very well.

8.3.2 The regression equation

If you think back to high school, you may have learned something like y = mx + c in algebra to describe a straight line, where:

m denotes the slope, and
c denotes the intercept.

In linear regression, we use the same concepts to both describe our line and make predictions using it (more on that later). The key difference is that we change the letters a bit:

\[ y = \beta_0 + \beta_1x + \epsilon_i \]

Let’s break this down:

y is simply our predicted value (i.e. the line of best fit)
\(\beta_0\) is our intercept (the c in y = mx + c)
\(x\) simply refers to our independent variable
\(\beta_1\) is the slope for the independent variable - in other words, how much y increases for every unit increase of x. We also call this B
\(\epsilon_i\) is error (i.e. our residuals), which is essentially random variation (due to sampling). This error should be normally distributed.

Keep this in mind for now - we’ll come back to this later in the module!

8.3.3 Hypothesis testing in regressions

When we conduct a linear regression, in part we’re testing if the two variables are correlated. However, we’re also testing whether our predictor significantly predicts our outcome, so our statistical hypotheses are formulated around this idea. Our null and alternative hypotheses are therefore about the slope (\(\beta_1\)):

\(H_0: \beta_1 = 0\), i.e. the slope is 0
\(H_1: \beta_1 \neq 0\), i.e. the slope is not equal to 0

Consider the two graphs below. On the left is a graph where the line of best fit has a slope of 0 (the null). No matter what value X is, the value of Y is always the same (2.5 in this example) - in other words, X does not predict Y. The graph on the right side, on the other hand, is an example of the alternative hypothesis in this scenario. Here, X does clearly predict Y - as X increases, Y increases as well.

But how do we actually test this? The answer is something we’ve seen before - we do a t-test!

We came across t-tests in context of comparing two means against each other. The logic here is exactly the same, except now we compare two slopes with each other - the slope we actually observe (B) minus the slope under the null hypothesis, which would be 0. We can use this logic to calculate a t-statistic using the below formula:

\[ t = \frac{B}{SE_B} \]

Where B = observed slope and SE = standard error of B. This is the same formula as the calculation for a t-test, albeit that the top row is just B (because it is B - 0). We can do this for each predictor, and then use the same t-distribution to test whether this slope is significant - in other words, whether our predictor (IV) significantly predicts our outcome (DV).