2.5 Wrangling data with dplyr and others

dplyr is a package within the tidyverse for manipulating and wrangling data. dplyr is one of the most popular packages on R because it provides a suite of functions that are fairly essential to manipulating and working with data. Below is a brief overview of some of these functions, applied to the penguins dataset.

2.5.1 Selecting columns with select()

select() lets you select the columns you want from a dataset. Simply specify the columns that you want by name. Below we take the species and island columns from the penguins dataset:

penguins %>%
  select(species, island)

You can select columns via a number of ways:

  • Simply by name, e.g. select(species, island)
  • By their index or column number, e.g. select(1, 2)
    • select(1:4) will select columns 1 to 4
    • select(-1) will select the last column
  • By certain operator functions, such as starts_with() and ends_with(), e.g. ends_with("mm") will select all columns that end with “mm”

Combinations of the above also work. Removing columns is simply done by adding a minus sign - in front of the arguments for select, and are compatible with all of the options above.

2.5.2 Filtering rows with filter()

filter() selects the rows that you want based on a certain condition. Here, we specify the column that want to filter by and state the condition (== means equals to). We can filter on multiple conditions; for example, filtering Adelie penguins by the year 2007:

penguins %>%
  filter(year == 2007, species == "Adelie")

You can also choose to filter out rows based on a condition by adding an exclamation mark in front of the column name.

penguins %>%
  filter(!year == 2007)

drop_na() is another useful starting function that simply removes all rows with NA/empty cells. If you enter it as is then it will clean the entire dataset; if you specify a column then it will remove all rows with NAs in that column. Below is an example of a pipe using these functions - going from selecting columns to filtering rows and finally cleaning up the empty cells.

penguins %>%
  select(species, body_mass_g, year) %>%
  filter(year == 2009) %>%
  drop_na()

2.5.3 Creating new columns with mutate()

mutate() is a function that lets you create new columns. This can be extremely useful for operations like recoding variables and transforming them. The nice thing about mutate() is that you can do all manner of operations without touching your original data.

The basic workflow of mutate() looks like this:

data %>%
  mutate(
    new_column_1 = a_function(...),
    new_column_2 = another_function(...)
    )

This example code would create two new columns, named new_column_1 and new_column_2, with their respective values being whatever the functions were. For example, in the penguins dataset we have a variable called body_mass_g, which is the body mass of each penguin in grams. If we wanted to convert this to kilograms, we would need to divide each penguin’s value on this variable by 1000. mutate() makes this a piece of cake. Let’s also chain a select() command to only show the following variables: species, island, body_mass_g and sex.

penguins %>%
  select(species, island, body_mass_g, sex) %>%
  mutate(body_mass_kg = body_mass_g/1000)

You can see now that we have a new column called body_mass_kg that has our new transformed variable.

mutate() can also take functions (and likely will make up the majority of your use of it). For example, let’s say that we want to make a variable that takes the natural logarithm of bill length (for whatever reason). We could do this as follows using the log() function within our mutate() call:

penguins %>%
  select(species, island, bill_length_mm, bill_depth_mm) %>%
  mutate(bill_length_log = log(bill_length_mm))

2.5.4 Summarising data with summarise() and group_by()

Finally, sometimes we will want to summarise data - for example, to calculate basic features such as descriptives or for plotting. To do that, we can use the function summarise() (or summarize() for American users).

summarise() works very similarly to mutate().

data %>%
  summarise(
    summary_1 = a_function(...),
    summary_2 = another_function(...)
    )

The difference is that while mutate() retains the features of your data, summarise() will instead collapse it. To illustrate, let’s say we want to calculate a) how many penguins there are (with the function n()) and b) the mean body mass (with the mean() function).

penguins %>%
  summarise(
    n_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE)
  )

This is… good and all, but consider what we’ve just done. We’ve just calculated the number of penguins and the mean body mass across the entire dataset. However, that may not necessarily be meaningful, particularly in this instance where we have meaningful groups within the data. For example, the above mean collapses across years, which may not be appropriate.

Enter in another function called group_by(). As the name implies, group_by() will perform operations per a grouping variable that you specify. group_by() works especially well with summarise, because the idea is something like this:

data %>%
  group_by(variable) %>%                 # Tell R to group the subsequent output by this variable
  summarise(
    summary_1 = a_function(...),
    summary_2 = another_function(...)
    ) %>%
  ungroup()         # Tell R grouping is no longer needed

Let’s put this into practice by calculating the n and mean per year. Notice how the output now calculates n and the mean body mass per year, which is much more informative!

penguins %>%
  group_by(year) %>%
  summarise(
    n_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE)
  ) %>%
  ungroup()

Naturally, group_by() can group using multiple variables. This is easy to do so as well

penguins %>%
  group_by(year, island) %>%
  summarise(
    n_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'year'. You
## can override using the `.groups` argument.

Suddenly this is much more informative - we can now do calculations/operations per year and island, which provides a lot more nuance.

2.5.5 Some other handy tidyverse functions

As stated at the start of this chapter, this book will only cover enough R functions to provide you with an understanding of what goes on in this book and how. Nonetheless, there are so many tidyverse functions out there that are worth exploring and knowing about. Below is a brief list of some other functions from dplyr you may wish to keep in mind. To find out what a given function does in more detail, you just need to type ?function into the R console to search for its documentation (or ??x to do a broad search).

Note that all of these are dplyr functions, so need to be piped from a dataset like usual.

  • arrange() will sort your rows by a variable you specify. For example, if you wanted to sort the penguins dataset by island, you could use arrange(island) (or arrange(desc(island)) for descending order).
  • distinct() will give you all unique values in a given column. distinct(island), for instance, will give you each unique island name.
  • rename() will let you rename columns.
  • relocate() will let you rearrange the column order.
  • The slice() set of functions will subset rows, but mainly based on positions (e.g. first, last) rather than conditions.