How To Use R Dplyr Package

Contents

  1. Introduction
  2. Installation
  3. Loading the dplyr package with library()
  4. Pipes in dplyr
  5. The five core verbs of dplyr
    1. filter()
    2. select()
      • select() - dropping one column
      • select() - dropping two or more columns
    3. mutate()
      • mutate_if()
      • mutate_at()
    4. summarise()
    5. arrange()
  6. Other useful functions in the dplyr package
    1. group_by()
    2. left_join()
    3. right_join()
    4. full_join()
    5. inner_join()
  7. An exercise in data wrangling - how to make a grouped boxplot
    1. melt()
    2. Generating the grouped boxplot with ggplot2

Introduction

The dplyr package is the fundamental package of the wider tidyverse in R. Functions of the dplyr package, in this particular package known as "verbs", are used to manipulate data into a suitable format for subsequent data analysis.

Installation

Before using dplyr it is necessary to install it, since it is not a part of base R. The dplyr package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages().

In the case of dplyr, we will pass it as an argument for install.packages() and run it. {r, eval=FALSE} install.packages("dplyr") Alternatively, we can install the tidyverse collection of packages, which will also include dplyr.

install.packages("tidyverse")

Tip:

This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.

In [ ]:
list_of_packages <- c("dplyr", "ggplot2")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Loading the dplyr package with library()

Now that we have the dplyr package installed we need to load it to our R workspace. For this purpose we can use the library() function. If installed correctly, the message "Attaching package: ‘dplyr’" will appear.

In [2]:
library("dplyr")

Pipes in dplyr

When reading R code that uses functions from dplyr, you will almost certainly encounter this string of symbols: %>% This is an operator used to pipe the output from one function to the input of another function. For example:

In [3]:
iris$Sepal.Length %>% mean()
5.84333333333333

Here we have used the %>% operator to pipe the Sepal.Length column into the function mean(). Writing code in this way provides for a natural and logical flow of operations.

Tip: In order to instantly type the %>% operator, press down Ctrl + Shift + M keys simultaneously.

The five core verbs of dplyr

filter()

The filter() function of dplyr is used to extract rows, based on a specified condition. For example, in the iris dataset, we want to extract only the rows belonging to the "setosa" species.

In [4]:
iris_setosa <- iris %>% filter(Species == "setosa")
head(iris_setosa)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

The result is a dataframe with rows belonging only to the "setosa" species.

select()

Just as the filter() function extracts rows, the select() function extracts columns from a dataframe based on specified condition. Here we can extract columns based on name, the Sepal.Length and Petal.Length columns.

In [5]:
iris_sepal_petal_length <- iris %>% select(Petal.Length, Sepal.Length)
head(iris_sepal_petal_length)
A data.frame: 6 × 2
Petal.LengthSepal.Length
<dbl><dbl>
11.45.1
21.44.9
31.34.7
41.54.6
51.45.0
61.75.4

select() - dropping one column

The select() function can also be used to drop columns from a dataframe. Maybe we would like to have a dataframe with only numerical values. In the case of the iris dataset, the solution would be to drop the species column. We can use the logical NOT operator in R, the ! symbol. The following code can be read as follows: "From the iris dataset, select all columns that are not the species column".

In [6]:
iris_numeric <- iris %>% select (!Species)
head(iris_numeric)
A data.frame: 6 × 4
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
<dbl><dbl><dbl><dbl>
15.13.51.40.2
24.93.01.40.2
34.73.21.30.2
44.63.11.50.2
55.03.61.40.2
65.43.91.70.4

Note that the above result can be achieved like this as well, but it is not as elegant.

In [7]:
iris_numeric <- iris %>% select (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
head(iris_numeric)
A data.frame: 6 × 4
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
<dbl><dbl><dbl><dbl>
15.13.51.40.2
24.93.01.40.2
34.73.21.30.2
44.63.11.50.2
55.03.61.40.2
65.43.91.70.4

select() - dropping two or more columns

Here we use the same logic as with dropping one column, expect we will apply the ! operator to a vector of columns we want dropped. As a reminder, the c() is a function that returns a vector. In this example we want to drop the sepal lengths and widths columns.

In [8]:
iris_numeric <- iris %>% select (!c(Sepal.Length, Sepal.Width, Species))
head(iris_numeric)
A data.frame: 6 × 2
Petal.LengthPetal.Width
<dbl><dbl>
11.40.2
21.40.2
31.30.2
41.50.2
51.40.2
61.70.4

mutate()

The mutate() function is useful for adding new columns to a dataframe, which will have the results of operations on already existing columns. For example, in the iris_sepal_petal_length dataframe we have created in the previous example, the lengths are given in centimeters and now we would like to add columns with lengths given in inches.

In [9]:
iris_sepal_petal_length_inch <- iris_sepal_petal_length %>% 
  mutate(Sepal.Length.inches = Sepal.Length/2.54,
         Petal.Length.inches = Petal.Length/2.54)

head(iris_sepal_petal_length_inch)
A data.frame: 6 × 4
Petal.LengthSepal.LengthSepal.Length.inchesPetal.Length.inches
<dbl><dbl><dbl><dbl>
11.45.12.0078740.5511811
21.44.91.9291340.5511811
31.34.71.8503940.5118110
41.54.61.8110240.5905512
51.45.01.9685040.5511811
61.75.42.1259840.6692913

mutate_if()

The mutate_if() function checks if a certain condition is met before applying the transforming operation on the column. In the iris dataset numerical values are given as doubles (number with a decimal). Now imagine if we want to convert the iris dataset to integers, lets try to use mutate() first.

In [ ]:
round(iris)

Error in Math.data.frame(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, : non-numeric variable(s) in data frame: Species Traceback:

  1. Math.data.frame(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, . 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7,

Oh no, we have an error. The round() function seemed to work fine until it encountered the non-numeric species column. We could drop this column as we showed with select(), but instead we can use mutate_if() to check if a column is numeric before trying to change it.

In [11]:
iris_int <- iris %>%  mutate_if(is.double, round)
head(iris_int)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15410setosa
25310setosa
35310setosa
45320setosa
55410setosa
65420setosa

mutate_at()

We see in the previous example that the petal width column rounds down to zero, lets round all other numerical columns, but keep the petal width column as a double. The mutate_at() function allows us to specify on exactly which colums we want to perform some operation.

In [12]:
iris_int <- iris %>% mutate_at(c("Sepal.Length", "Sepal.Width", "Petal.Length"), round)
head(iris_int)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15410.2setosa
25310.2setosa
35310.2setosa
45320.2setosa
55410.2setosa
65420.4setosa

summarise()

The summarise() function is used to provide different summary statistics from a dataframe. For example, we want to summarise the means of the petal and sepal lenghts from our iris_sepal_petal_length dataframe.

In [13]:
iris_sepal_petal_length %>% 
  summarise(mean.Sepal.Length = mean(Sepal.Length),
            mean.Petal.Length = mean(Petal.Length))
A data.frame: 1 × 2
mean.Sepal.Lengthmean.Petal.Length
<dbl><dbl>
5.8433333.758

arrange()

The arrange() function sorts rows of dataframe based on the values in one (or more) of its columns. By default it sorts from lowest to highest value.

In [14]:
arranged_iris <- iris_sepal_petal_length %>% arrange(Sepal.Length)
head(arranged_iris)
A data.frame: 6 × 2
Petal.LengthSepal.Length
<dbl><dbl>
11.14.3
21.44.4
31.34.4
41.34.4
51.34.5
61.54.6

We could also arrange rows based on values in two or more columns.

In [15]:
arranged_iris2 <- iris_sepal_petal_length %>% arrange(Sepal.Length, Petal.Length)
head(arranged_iris2)
A data.frame: 6 × 2
Petal.LengthSepal.Length
<dbl><dbl>
11.14.3
21.34.4
31.34.4
41.44.4
51.34.5
61.04.6

To arrange rows in a descending order we can use the desc() function from dplyr package.

In [16]:
arranged_iris3 <- iris_sepal_petal_length %>% arrange(desc(Sepal.Length))
head(arranged_iris3)
A data.frame: 6 × 2
Petal.LengthSepal.Length
<dbl><dbl>
16.47.9
26.77.7
36.97.7
46.77.7
56.17.7
66.67.6

Other useful functions in the dplyr package

group_by()

Sometimes you want certain operations performed on groups in your dataset. Previously we used the summarise() to get column means of all our iris data. Now we would like to get the species means. Logically we can first group our data by the species column.

In [17]:
iris %>% 
  group_by(Species) %>% 
  summarise(mean.Sepal.Length = mean(Sepal.Length),
            mean.Petal.Length = mean(Petal.Length))
A tibble: 3 × 3
Speciesmean.Sepal.Lengthmean.Petal.Length
<fct><dbl><dbl>
setosa 5.0061.462
versicolor5.9364.260
virginica 6.5885.552

Compare this result with the result of the summarise() function in chapter 4.4 summarise(). Note that grouping data does not change how your data looks, only how it is interpreted by other functions.

left_join()

The left_join() function is used to join two dataframes based on matches in a common column between them. The function returns all rows from the left dataframe, and all columns from both dataframes. Rows in the left with no match in right will have NA (missing) values in the new columns. We can look at two dataframes, band_members and band_instruments.

In [18]:
band_members
A tibble: 3 × 2
nameband
<chr><chr>
MickStones
JohnBeatles
PaulBeatles
In [19]:
band_instruments
A tibble: 3 × 2
nameplays
<chr><chr>
John guitar
Paul bass
Keithguitar

We see that both dataframes have the name column in common, and it is by this column that we will join them.

In [20]:
#left dataframe is given priority
band_members %>% left_join(band_instruments)
Joining, by = "name"

A tibble: 3 × 3
namebandplays
<chr><chr><chr>
MickStones NA
JohnBeatlesguitar
PaulBeatlesbass

Notice that Mick has NA in the instruments column, because he does not have a match in the right dataframe.

right_join()

The right_join() works simmilarly as 5.2 left_join() only the right dataframe is given priority, meaning if the rows in the left dataframe do not have a match in right they will have NA values in the new columns.

In [21]:
band_members %>% right_join(band_instruments)
Joining, by = "name"

A tibble: 3 × 3
namebandplays
<chr><chr><chr>
John Beatlesguitar
Paul Beatlesbass
KeithNA guitar

full_join()

The full_join() function returns all rows and columns from both dataframes. If no matching values are found NAs are placed.

{r}
band_members %>% full_join(band_instruments)

inner_join()

The inner_join() function return all rows and columns from both dataframes that have a match, dropping all rows that have a mishmatch.

In [22]:
band_members %>% inner_join(band_instruments)
Joining, by = "name"

A tibble: 2 × 3
namebandplays
<chr><chr><chr>
JohnBeatlesguitar
PaulBeatlesbass

An exercise in data wrangling - how to make a grouped boxplot

We will use the iris dataset for this example.

In [23]:
head(iris)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

Here we see a dataframe in what is called a wide format, meaning every observation, in this case a individual iris plant has its measurements in its own row, and every variable has its own column. In order to make a grouped boxplot we need to change this dataframe into a long format.

melt()

We can use the melt() function to convert the iris dataframe into a long format. The long format has for each data point as many rows as the number of variables and each row contains the value of a particular variable for a given data point. The melt() function is part of the reshape2 package so we will first load it.

In [24]:
library(reshape2)
In [25]:
iris_long <- melt(iris)
head(iris_long)
Using Species as id variables

A data.frame: 6 × 3
Speciesvariablevalue
<fct><fct><dbl>
1setosaSepal.Length5.1
2setosaSepal.Length4.9
3setosaSepal.Length4.7
4setosaSepal.Length4.6
5setosaSepal.Length5.0
6setosaSepal.Length5.4

Generating the grouped boxplot with ggplot2

The package ggplot2 contains powerful tools for visualizing data in R, and will be covered in depth in a subsequent guide. For now we will just showcase its capabillites.

In [26]:
library(ggplot2)
ggplot(iris_long, aes(x = variable, y = value, fill = Species )) +  geom_boxplot()