3.2 Counts and central tendencies

Once we understand what our data looks like, we can then move to describing the general properties of the data. Such general properties are called descriptive statistics. Reporting descriptive statistics is crucial for many aspects of quantitative research.

3.2.1 Basic features

There are a couple of basic features of any dataset that should be looked at and noted:

Name (APA Symbol) Definition When to report?
Count (n) The number of data points. The number of participants should always be reported - not just for the sample as a whole, but for each analysis done.
Range In the context of writing up statistics, this is usually the minimum and maximum values. Reporting these values is often useful as a range when writing up demographic variables, e.g. age or years of training.
Percentages Use primarily for categorical data, e.g. sex or groups.

We can use R to find some of these values, either using straight base R or tidyverse functions. For this page/module only, I will use the variable_a mock variable from Section 2.2.2 with a minor amendment:

vector_a <- c(4, 1, 6, 2, 3, 4)
vector_a
## [1] 4 1 6 2 3 4

For tidyverse usage, I’ll refer to df_a, which is just the same but as a one-column dataframe:

df_a

To find the count, or the number of items in a vector, we can use the length() function.

length(vector_a)
## [1] 6

To find the minimum and maximum, we can use the min() and max() functions respectively. Specifying na.rm = TRUE will remove any missing data before calculation.

min(vector_a, na.rm = TRUE)
## [1] 1
max(vector_a, na.rm = TRUE)
## [1] 6

In tidyverse fashion, we can wrap this all in summarise() as follows:

df_a %>%
  summarise(
    n = n(),
    min = min(column_a, na.rm = TRUE),
    max = max(column_a, na.rm = TRUE)
  )

Note here that rather than using length(), we use a function called n(). This function only works within summarise() and mutate(), but is essentially shorthand for length().

3.2.2 Central tendencies

While the range can be informative in some sitautions, it usually isn’t enough to draw deeper interpretations from raw data. One key way of describing data is in terms of central tendency - or where the ‘average’ value approximately is. There are three main types of central tendency, summarized in the table below.

Name (APA Symbol) Definition When to use? Things to note
Mean (M) The sum of all values, divided by the number of data points. Use if data is normally distributed. Can be influenced by outliers, so generally unsuitable when data is skewed.
Median (Mdn) The ‘middle’ data point, when sorted in order. Use for skewed data, or for ordinal data. Generally is less preferable to the mean, except for use in skewed/ordinal data.
Mode The most frequent value. Use for nominal data. Unsuitable for most other types of data.

To calculate a mean and median, use the mean() and median() functions respectively. Both functions also take the na.rm argument.

mean(vector_a, na.rm = TRUE)
## [1] 3.333333
median(vector_a, na.rm = TRUE)
## [1] 3.5
# Using summarise() and piping

df_a %>%
  summarise(
    mean = mean(column_a, na.rm = TRUE),
    median = median(column_a, na.rm = TRUE)
  )

Interestingly, R doesn’t offer a base function to calculate a mode - if you need this information, you either need to manually work this out or turn to a package that offers it. One such example is the fantastic DescTools package, which provides a function called Mode():

DescTools::Mode(vector_a)
## [1] 4
## attr(,"freq")
## [1] 2

The first number is the value of the mode, while the second number is the number of times the mode occurs (twice, in this case).