Contents
- Introduction
- Installation
- Loading the dplyr package with library()
- Pipes in dplyr
- The five core verbs of dplyr
- filter()
- select()
- select() - dropping one column
- select() - dropping two or more columns
- mutate()
- mutate_if()
- mutate_at()
- summarise()
- arrange()
- Other useful functions in the dplyr package
- group_by()
- left_join()
- right_join()
- full_join()
- inner_join()
- An exercise in data wrangling - how to make a grouped boxplot
- melt()
- Generating the grouped boxplot with ggplot2
Introduction
The dplyr package is the fundamental package of the wider tidyverse in R. Functions of the dplyr package, in this particular package known as "verbs", are used to manipulate data into a suitable format for subsequent data analysis.
Installation
Before using dplyr it is necessary to install it, since it is not a part of base R. The dplyr package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages()
.
In the case of dplyr, we will pass it as an argument for install.packages()
and run it.
{r, eval=FALSE}
install.packages("dplyr")
Alternatively, we can install the tidyverse collection of packages, which will also include dplyr.
install.packages("tidyverse")
Tip:
This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.
list_of_packages <- c("dplyr", "ggplot2")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
library("dplyr")
iris$Sepal.Length %>% mean()
Here we have used the %>%
operator to pipe the Sepal.Length
column into the function mean()
. Writing code in this way provides for a natural and logical flow of operations.
Tip:
In order to instantly type the %>%
operator, press down Ctrl + Shift + M keys simultaneously.
The five core verbs of dplyr
filter()
The filter()
function of dplyr is used to extract rows, based on a specified condition. For example, in the iris dataset, we want to extract only the rows belonging to the "setosa" species.
iris_setosa <- iris %>% filter(Species == "setosa")
head(iris_setosa)
The result is a dataframe with rows belonging only to the "setosa" species.
select()
Just as the filter()
function extracts rows, the select()
function extracts columns from a dataframe based on specified condition. Here we can extract columns based on name, the Sepal.Length
and Petal.Length
columns.
iris_sepal_petal_length <- iris %>% select(Petal.Length, Sepal.Length)
head(iris_sepal_petal_length)
select() - dropping one column
The select()
function can also be used to drop columns from a dataframe. Maybe we would like to have a dataframe with only numerical values. In the case of the iris dataset, the solution would be to drop the species column. We can use the logical NOT operator in R, the !
symbol. The following code can be read as follows: "From the iris dataset, select all columns that are not the species column".
iris_numeric <- iris %>% select (!Species)
head(iris_numeric)
Note that the above result can be achieved like this as well, but it is not as elegant.
iris_numeric <- iris %>% select (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
head(iris_numeric)
select() - dropping two or more columns
Here we use the same logic as with dropping one column, expect we will apply the !
operator to a vector of columns we want dropped. As a reminder, the c()
is a function that returns a vector. In this example we want to drop the sepal lengths and widths columns.
iris_numeric <- iris %>% select (!c(Sepal.Length, Sepal.Width, Species))
head(iris_numeric)
mutate()
The mutate()
function is useful for adding new columns to a dataframe, which will have the results of operations on already existing columns. For example, in the iris_sepal_petal_length
dataframe we have created in the previous example, the lengths are given in centimeters and now we would like to add columns with lengths given in inches.
iris_sepal_petal_length_inch <- iris_sepal_petal_length %>%
mutate(Sepal.Length.inches = Sepal.Length/2.54,
Petal.Length.inches = Petal.Length/2.54)
head(iris_sepal_petal_length_inch)
mutate_if()
The mutate_if()
function checks if a certain condition is met before applying the transforming operation on the column. In the iris dataset numerical values are given as doubles (number with a decimal). Now imagine if we want to convert the iris dataset to integers, lets try to use mutate()
first.
round(iris)
Error in Math.data.frame(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, : non-numeric variable(s) in data frame: Species Traceback:
- Math.data.frame(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, . 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7,
Oh no, we have an error. The round()
function seemed to work fine until it encountered the non-numeric species column. We could drop this column as we showed with select()
, but instead we can use mutate_if()
to check if a column is numeric before trying to change it.
iris_int <- iris %>% mutate_if(is.double, round)
head(iris_int)
iris_int <- iris %>% mutate_at(c("Sepal.Length", "Sepal.Width", "Petal.Length"), round)
head(iris_int)
iris_sepal_petal_length %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Petal.Length = mean(Petal.Length))
arranged_iris <- iris_sepal_petal_length %>% arrange(Sepal.Length)
head(arranged_iris)
We could also arrange rows based on values in two or more columns.
arranged_iris2 <- iris_sepal_petal_length %>% arrange(Sepal.Length, Petal.Length)
head(arranged_iris2)
To arrange rows in a descending order we can use the desc()
function from dplyr package.
arranged_iris3 <- iris_sepal_petal_length %>% arrange(desc(Sepal.Length))
head(arranged_iris3)
Other useful functions in the dplyr package
group_by()
Sometimes you want certain operations performed on groups in your dataset. Previously we used the summarise()
to get column means of all our iris data. Now we would like to get the species means. Logically we can first group our data by the species column.
iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Petal.Length = mean(Petal.Length))
Compare this result with the result of the summarise()
function in chapter 4.4 summarise(). Note that grouping data does not change how your data looks, only how it is interpreted by other functions.
left_join()
The left_join()
function is used to join two dataframes based on matches in a common column between them. The function returns all rows from the left dataframe, and all columns from both dataframes. Rows in the left with no match in right will have NA (missing) values in the new columns. We can look at two dataframes, band_members and band_instruments.
band_members
band_instruments
We see that both dataframes have the name column in common, and it is by this column that we will join them.
#left dataframe is given priority
band_members %>% left_join(band_instruments)
Notice that Mick has NA in the instruments column, because he does not have a match in the right dataframe.
right_join()
The right_join()
works simmilarly as 5.2 left_join()
only the right dataframe is given priority, meaning if the rows in the left dataframe do not have a match in right they will have NA values in the new columns.
band_members %>% right_join(band_instruments)
full_join()
The full_join()
function returns all rows and columns from both dataframes. If no matching values are found NAs are placed.
{r}
band_members %>% full_join(band_instruments)
inner_join()
The inner_join()
function return all rows and columns from both dataframes that have a match, dropping all rows that have a mishmatch.
band_members %>% inner_join(band_instruments)
head(iris)
Here we see a dataframe in what is called a wide format, meaning every observation, in this case a individual iris plant has its measurements in its own row, and every variable has its own column. In order to make a grouped boxplot we need to change this dataframe into a long format.
melt()
We can use the melt()
function to convert the iris dataframe into a long format. The long format has for each data point as many rows as the number of variables and each row contains the value of a particular variable for a given data point. The melt()
function is part of the reshape2 package so we will first load it.
library(reshape2)
iris_long <- melt(iris)
head(iris_long)
library(ggplot2)
ggplot(iris_long, aes(x = variable, y = value, fill = Species )) + geom_boxplot()
Related Notebooks
- How To Use Grep In R
- How To Use Python Pip
- How To Use Pandas Correlation Matrix
- How To Use Selenium Webdriver To Crawl Websites
- How To Write DataFrame To CSV In R
- How To Plot Histogram In R
- How To Run Logistic Regression In R
- How To Analyze Yahoo Finance Data With R
- How to Create DataFrame in R Using Examples