3.3 Variability

The other important part of describing data is in how spread out it is. Is our data tightly bunched together, or is it very spread out? This helps us understand where most of our data falls, as well as how it looks.

3.3.1 The variability of data

The other key way of describing data is in its spread, or distribution. The way data is distributed can give key insights into how that data should be treated.

Consider the following graphs below.

You can see that all three graphs peak at around the same point, but look very different outside of that. The orange line is narrow, while the red line is considerably more spread out. All of these graphs peak at the same point but still look very different. Therefore, they have very different spreads, or distributions.

We saw on the last page that we can quantify how far values are spread apart by finding the range. However, this isn’t always a good idea - two datasets with the exact same range can look wildly different. Therefore, we need ways of quantifying how data is spread out as well.

3.3.2 Standard deviation

Standard deviation (\(\sigma\), or SD) describes how spread out our data is within our sample, in standard (i.e. comparable) units. Data that is spread out widely (like the red curve above) will have a large standard deviation; likewise, data that has a narrow spread will have a small standard deviation. We’ll touch on this a bit more in the following pages, but for now just remember what a standard deviation is for.

To calculate standard deviation, we first calculate variance, which is another measure of spread:

\[ Variance = \frac{\Sigma (x_i - \bar{x} )^2}{n - 1} \]

Or, in human terms:

Take each data point (\(x_i\))
Subtract the mean from each data point (x with the bar) and square that difference
Add them all up together
Divide by \(n - 1\)

And then to calculate standard deviation, we simply take the square root of the variance.

\[ SD = \sqrt{Variance} \] Or, in full formula form:

\[ SD = \sqrt{\frac{\Sigma (x_i - \bar{x} )^2}{n - 1}} \]

Standard deviations (SD) should reported alongside means when results are written up (consult an APA guide).

To calculate standard deviations in R, use the sd() function. Once again, this has an na.rm argument you can specify.

sd(vector_a, na.rm = TRUE)

## [1] 1.75119

# Tidyverse form
df_a %>%
  summarise(
    sd = sd(column_a, na.rm = TRUE)
  )

3.3.3 Standard error, and the SDoTM

Imagine that I have a population of 100 regular people (shown on the left). I take a sample of 10 people, measure their heights and then calculate the mean height of that one sample. I then repeat this process over and over again, and plot where each sample’s mean falls. Of course, because every sample is slightly different the mean of each sample will be slightly different too due to sampling error. Some sample means will be lower than the true population mean, while some will be higher. Eventually, we might end up with something like this:

The spread of these sample means is called the sampling distribution of the mean (SDoTM), shown on the right. This gives us a sense of where the population mean (the parameter that we are interested in) might lie. With enough samples, the peak of this sampling distribution of the mean will converge around the population mean. As you can see in our hypothetical example, the peak of the sampling distribution of the mean sits pretty close to the original population mean, meaning our estimate is pretty good.

The standard error of the mean (standard error; SE) is another measure of variability - this time, it is the spread of sample means across the sampling distribution of the mean. This represents how close our sample mean is to the likely population mean. If our sampling distribution is wide, our standard error will be large - and that means that we won’t have a very precise estimate of the population mean. However, if we have a small standard error that will mean that our sample mean is likely to be close to the population mean.

Standard error is calculated using the below formula:

\[ SE = \frac{SD}{\sqrt n} \]

Where SD = standard deviation, and n = sample size.

Practice: You have a dataset of 400 people. You know that the mean of the DV is 760, with a standard deviation of 40. Calculate the standard error for this sample.