2.3 Packages and functions
2.3.1 Functions
R works primarily with functions, which take one or more inputs and return an output. Functions in R are always defined in name() format, i.e. the name of the function followed by brackets. Every function generally serves a very specific purpose, such as performing a specific calculation, manipulation of data or otherwise. Therefore, working with R requires us to use lots and lots of functions.
Most functions will have at least one, if not multiple arguments. Arguments define the options for a given function. For example, the class() function, which comes in base R, tells us the type of a variable. class() has one main argument, x - which is simply the name of the variable we want to know about. As an example, let us say we wanted to know what type of variable var_a was. We could write the following:
## [1] "numeric"
More simply, because class() only has one argument we can optionally write:
## [1] "numeric"
Functions may either have mandatory or optional arguments. Mandatory arguments are ones that you need to provide in order for the function to run. Optional arguments are often defaults that can be changed if needed. Many functions in R will have multiple arguments that are typically a mix of both. A key point arises here though: functions in R define arguments in specific orders. In other words, R expects you to input arguments in specific orders unless you explicitly define each argument’s value, as we did for the first instance of class().
The easiest way to find out information about what a function does and the arguments it requires is to type a question mark, ?, with the name of the function immediately afterwards. e.g.
A basic function is the round() function, which - as the name suggests - rounds a value you give it. round() has two arguments: x, which is the number or the name of an object we want to round, and digits, which specifies the number of digits we want to round to. x is a mandatory argument, but digits is optional and has a preset value of 0. Therefore, if we type in the following you can see what we get:
## [1] 4
If we want to round to 2dp, for instance, we would need to explicitly define the digits argument and set it to a different value.
## [1] 4.33
2.3.2 Packages
By default, R comes with lots of functions, many of which will be used throughout this book. The beauty of R though is that its functionality is essentially limitless; its open-source nature and strong community mean that new functions and capabilities are regularly made for R. These new functions augment/extend what R is capable of doing, and are generally available in the form of packages. To provide a very simple explanation of what packages are, packages are a collection of code that provide extra functions in R. Some packages occassionally come with data too, such as the palmerpenguins package.
When you start a new R session, the first thing that you’ll want to do is load the packages that you need to use.
For now, we’ll load two packages for functions: tidyverse and rstatix. tidyverse is a huge package that contains a group of other packages designed for data manipulation, visualisation and cleaning. rstatix allows for simple statistical tests to be performed in an easy way. palmerpenguins comes with a dataset for practicing on. palmerpenguins loads a dataset named penguins that contains basic info on 3 species of penguins across 3 different islands.
To load a package, call the library() function and enter the name of the package in brackets:
## ── Attaching core tidyverse packages ───────────
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:stats':
##
## filter
2.3.3 Tidyverse
The tidyverse is a mega-collection of phenomenal packages that fundamentally change how to interface with R. The tidyverse provides packages for things like:
- Wrangling data
- Reading and writing data
- Making graphs
- Tidying and reshaping data
- Scraping the web
The tidyverse is probably the single most popular suite of packages for R because of the functionality it provides. All of the tidyverse packages are written in consistent syntax, generally use very easy language that has an emphasis on verbs (i.e. you’re telling R to do something) and integrate seamlessly with each other and R. The tidyverse is a philosophy of R just as much as it is a suite of functions, and is part of what makes R so powerful today.
Many aspects of the tidyverse are reliant on the pipe operator, %>%. This basically tells R to take a dataframe or output and pass it onto a function that comes directly afterwards. Any function that takes a data frame as its first argument can (theoretically) be piped, meaning that we can chain strings of functions together in one run in a readable way. See the example below:
in the hypothetical example above, we first take data, and pass it to function 1. We then take the output of function 1 and pass that to function 2, which has the argument do = "this". Afterwards, we take the output of function 2 and pass it to function 3.
There are a number of clear benefits to the tidyverse way of doing things. These include:
- Functions are generally stated as verbs, which means that you’re always doing something with a function (and it’s clear what that something is)
- Piping avoids the cyclical hell of creating intermediate variables. Consider a non-tidyverse version of the code example above, written down below. This code is not only a bit of a pain to read, but is also clunky in that it generates several intermediate variables that aren’t all that useful (a lot of the time).
output_1 <- function_1(data)
output_2 <- function_2(output_1, do = "this")
output_3 <- function_3(output_2, avoid = "that")- Piped code is generally quite easy to read.
tidyversealso provides a consistent syntax for other packages! It provides a quasi-philosophy and style guide for developers to write their own packages to write ‘tidy’ packages.rstatixis a great example of this.
Throughout this book you will see a lot of tidyverse!
2.3.4 The here package
Another staple bit of code you will see throughout RPMP is the here package. The official vignette for here summarises what this package does:
The
herepackage enables easy file referencing by using the top-level directory of a file project to easily build file paths. This is in contrast to usingsetwd(), which is fragile and dependent on the way you order your files on your computer.
Alternatively, read the below quote from R goddess Jenny Bryan:
If the first line of your #rstats script is
setwd("C:\Users\jenny\path\that\only\I\have"), I will come into your lab and SET YOUR COMPUTER ON FIRE.
In short, here() allows us to locate files in a relative manner as opposed to an absolute one. This is super super useful for sharing your code and data with other people, and ensuring that your scripts will run no matter where they are.
We won’t get too into project-oriented workflows for RPMP. However, imagine you have a folder structure like this:
|-code
|-----rpmp_week1.Rmd
|-data
|-----w1_dataset.csv
|-output
|-RPMP.rproj
Normally, to locate a file on a disk you would generally have to give the entire pathway to that file. That could be something like "C:\Users\Dan\Documents\Subjects\RPMP\data\w1_dataset.csv" - which is immensely unwieldy if we want to read in data - and won’t work the moment I give my script to someone else, as their folder structure could be completely different!
The alternative with here() could be as simple as:
This tells R that I’m looking for something here in the data folder, specifically a file named w1_dataset.csv. As long as the relative positions are correct - i.e. the .csv file is in the data folder - R will know where to locate the file.
You will see a lot of here() in this version of the RPMP guide because as you may appreciate, there are a lot of data files stored in all manner of folders. We will talk specifically about using this function to read in data on the next page!