Introduction To Ggplot2

knitr::opts_chunk$set(echo = TRUE)

Introduction to ggplot2

The R ggplot2 is one of its most popular and widely used packages. It provides a powerful and customizable data visualization tool. The ggplot2 package can be used to plot a dataset. It uses geoms which are visual markers for data points and a coordinate system. Proper visualization can give you a deeper insight into your data. Making informative and pleasing graphs is more of an art than science since it is a form of communication. Data visualization is the area of data analysis where you can show your creative skills.

Install ggplot2

Before using ggplot2 it is necessary to install it, since it is not a part of base R. The ggplot2 package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages(). Since ggplot2 is part of the wider tidyverse, you can either choose to install tidyverse or just the ggplot2 package itself.

install.packages("ggplot2")

Alternatively, we can install the tidyverse collection of packages, which will also include ggplot2.

install.packages("tidyverse")

Tip:

This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.

list_of_packages <- c("dplyr", "ggplot2")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Loading the ggplot2 package with `library()`

Now that we have the ggplot2 package installed we need to load it to our R workspace. For this purpose we can use the library() function.

library("ggplot2")

Basic Graph using ggplot2

The essential part of any plot is the data. Let us explore the motor cars dataset: mtcars. The dataset provides for various stats on different car models.

head(mtcars)

Let us try to visualize the relationship between the weight (wt) and mile-per-gallon (mpg). We should expect to see a negative correlation. When building your graphs, always start with the ggplot() function.

The first argument is the data, in our case mtcars.
The second argument in the ggplot function is the aes() function, short for aesthetics. This function describes how variables in the data will be linked to geoms, the visual marks representing our data on the graph.

In our example we specify the x axis as the mpg column, and y axis as the wt column of the mtcars dataset. Lastly we need to add a geom. Let us make a scatterplot first, for this we will need our geoms to be points and for that we will use the geom_point function. This function will be a new layer to our graph, which we will initialize using ggplot(). Using the "+", we add the new layer.

ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point()

And there we have it, our very first graph! Also notice the negative correlation between the car weight and mpg. For now the relationship is clear, but sometimes with too many data points, it is hard to visualize. We can smooth these points out using the geom_smooth() function which can use different methods. For now let us use linear regression.

ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm")

`geom_smooth()` using formula 'y ~ x'

Notice how we added a new layer with the "+" sign to our already existing graph. We can remove our previous layer and we will also have a graph, albeit without points.

ggplot(mtcars, aes(x = mpg, y = wt)) + geom_smooth(method = "lm")

`geom_smooth()` using formula 'y ~ x'

The entire graph can be stored in an variable.

my_first_graph <- ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm")

New layers can be added to your graph that is stored inside a variable. For example, We can add a title to our graph with ggtitle().

my_first_graph + ggtitle("Relationship between car weight and mile-per-gallon")

`geom_smooth()` using formula 'y ~ x'

The x and y axis names are inherited from column names specified in aes() unless overwritten. Our graph should be as informative as possible, so we should change our axis labels to something more descriptive. Axis labels can be changed using the xlab() and ylab() functions.

my_first_graph + ggtitle("Relationship between car weight and mile-per-gallon") +
  xlab("Miles-per-gallon") +
  ylab("Weight")

`geom_smooth()` using formula 'y ~ x'

ggplot themes

We can change background theme of our graph. You can choose between:

black and white - theme_bw()
classic theme - theme_classic()
the default grey theme - grey_theme()
the minimalistic theme - theme_minimal()

As an exercise visualize your graph using all of the four themes.

my_first_graph <- my_first_graph + ggtitle("Relationship between car weight and mile-per-gallon") +
  xlab("Miles-per-gallon") +
  ylab("Weight") +
  theme_bw()

my_first_graph

`geom_smooth()` using formula 'y ~ x'

Saving your graph as a file using ggsave()

We can save our graphs manually by right-clicking on it and then pressing "Save image as..." or we can save it from the code using the ggsave() function.

ggsave("my_first_graph.jpeg", #name of the file
       my_first_graph,#the graph you want to save
       device = "jpeg") #file format

Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

This will save your graph in a file in your working directory.

Plotting a Single variable

Continuous variable

For plotting a single continous variable we can use: geom_area(), geom_density() and geom_histogram()

single_continous_variable <- ggplot(mtcars, aes(wt))
single_continous_variable + geom_area(stat = "bin")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

"bin" option allows us to bin values in to number of bins and plot their frequencies. You can see the default values with the message: stat_bin() using bins = 30. Pick better value with binwidth. Let us try setting a lower number of bins to draw a continous plot.

single_continous_variable + geom_area(bins=10,stat = "bin" )

A density plot with geom_density().

single_continous_variable + geom_density(bins=10,stat = "bin" )

A histogram with geom_histogram().

single_continous_variable + geom_histogram(bins=10,stat = "bin" )

Plot Discrete or Categorical variables using Ggplot

Plotting discrete variables, such as categorical ones, the species counts in the iris dataset is best accomplished using a bar plot with geom_bar().

ggplot(iris, aes(Species)) + geom_bar()

Plotting two variables

Both continous variables

Plotting two continous variables is best accomplished using geom_point() in order to make a scatter plot. We already covered making this kind of plot in our "Making a basic graph" section. So here we can try to add some more layers and improve our first graph.

ggplot(mtcars, aes(x = mpg, y = wt)) +
  geom_smooth(method = "lm", se = F) + # se = F: turn off confidence interval
  geom_point() + 
  ggtitle("Relationship between car weight and mile-per-gallon") +
  xlab("Miles-per-gallon") +
  ylab("Weight") +
  geom_vline(xintercept = mean(mtcars$mp), linetype="dotted", 
                color = "green", size=1.5) + # add a x intercept line
  theme_bw()

`geom_smooth()` using formula 'y ~ x'

To wrap up, we will draw a grouped boxplot using iris data set.

Data wrangling - How to make a grouped boxplot

We will use the iris dataset for this example.

head(iris)

Here we see a dataframe in a wide format, meaning every row represents the measurements of the different characteristics of a plant. Here each variable represents a column. In order to make a grouped boxplot we need to change this dataframe into a long format.

We can use the melt() function to convert the iris dataframe into a long format. The long format has for each data point as many rows as the number of variables and each row contains the value of a particular variable for a given data point. The melt() function is part of the reshape2 package so we will first load it.

library(reshape2)

iris_long <- melt(iris)
head(iris_long)

Using Species as id variables

With geom_boxplot() we can create a boxplot. Boxplots provide additional information about our data. The horizontal black line represent the median value,the top and bottom borders of the "box" represent first and third quartiles. The extent of the vertical line marks the quartile + 1.5 * interquartile range. Dots beyond these points are considered outliers.

ggplot(iris_long, aes(x = variable, y = value, fill = Species )) +  geom_boxplot()

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
	<dbl>	<dbl>	<dbl>	<dbl>	<fct>
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

	Species	variable	value
	<fct>	<fct>	<dbl>
1	setosa	Sepal.Length	5.1
2	setosa	Sepal.Length	4.9
3	setosa	Sepal.Length	4.7
4	setosa	Sepal.Length	4.6
5	setosa	Sepal.Length	5.0
6	setosa	Sepal.Length	5.4

Introduction To Ggplot2