Introduction To Ggplot2

In [1]:
knitr::opts_chunk$set(echo = TRUE)

Introduction to ggplot2

The R ggplot2 is one of its most popular and widely used packages. It provides a powerful and customizable data visualization tool. The ggplot2 package can be used to plot a dataset. It uses geoms which are visual markers for data points and a coordinate system. Proper visualization can give you a deeper insight into your data. Making informative and pleasing graphs is more of an art than science since it is a form of communication. Data visualization is the area of data analysis where you can show your creative skills.

Install ggplot2

Before using ggplot2 it is necessary to install it, since it is not a part of base R. The ggplot2 package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages(). Since ggplot2 is part of the wider tidyverse, you can either choose to install tidyverse or just the ggplot2 package itself.

install.packages("ggplot2")

Alternatively, we can install the tidyverse collection of packages, which will also include ggplot2.

install.packages("tidyverse")

Tip:

This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.

In [2]:
list_of_packages <- c("dplyr", "ggplot2")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Loading the ggplot2 package with library()

Now that we have the ggplot2 package installed we need to load it to our R workspace. For this purpose we can use the library() function.

In [3]:
library("ggplot2")

Basic Graph using ggplot2

The essential part of any plot is the data. Let us explore the motor cars dataset: mtcars. The dataset provides for various stats on different car models.

In [4]:
head(mtcars)
A data.frame: 6 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Mazda RX421.061601103.902.62016.460144
Mazda RX4 Wag21.061601103.902.87517.020144
Datsun 71022.84108 933.852.32018.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.44017.020032
Valiant18.162251052.763.46020.221031

Let us try to visualize the relationship between the weight (wt) and mile-per-gallon (mpg). We should expect to see a negative correlation. When building your graphs, always start with the ggplot() function.

  • The first argument is the data, in our case mtcars.
  • The second argument in the ggplot function is the aes() function, short for aesthetics. This function describes how variables in the data will be linked to geoms, the visual marks representing our data on the graph.

In our example we specify the x axis as the mpg column, and y axis as the wt column of the mtcars dataset. Lastly we need to add a geom. Let us make a scatterplot first, for this we will need our geoms to be points and for that we will use the geom_point function. This function will be a new layer to our graph, which we will initialize using ggplot(). Using the "+", we add the new layer.

In [5]:
ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point()

And there we have it, our very first graph! Also notice the negative correlation between the car weight and mpg. For now the relationship is clear, but sometimes with too many data points, it is hard to visualize. We can smooth these points out using the geom_smooth() function which can use different methods. For now let us use linear regression.

In [6]:
ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

Notice how we added a new layer with the "+" sign to our already existing graph. We can remove our previous layer and we will also have a graph, albeit without points.

In [7]:
ggplot(mtcars, aes(x = mpg, y = wt)) + geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

The entire graph can be stored in an variable.

In [8]:
my_first_graph <- ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm")

New layers can be added to your graph that is stored inside a variable. For example, We can add a title to our graph with ggtitle().

In [9]:
my_first_graph + ggtitle("Relationship between car weight and mile-per-gallon")
`geom_smooth()` using formula 'y ~ x'

The x and y axis names are inherited from column names specified in aes() unless overwritten. Our graph should be as informative as possible, so we should change our axis labels to something more descriptive. Axis labels can be changed using the xlab() and ylab() functions.

In [10]:
my_first_graph + ggtitle("Relationship between car weight and mile-per-gallon") +
  xlab("Miles-per-gallon") +
  ylab("Weight")
`geom_smooth()` using formula 'y ~ x'

ggplot themes

We can change background theme of our graph. You can choose between:

  1. black and white - theme_bw()
  2. classic theme - theme_classic()
  3. the default grey theme - grey_theme()
  4. the minimalistic theme - theme_minimal()

As an exercise visualize your graph using all of the four themes.

In [11]:
my_first_graph <- my_first_graph + ggtitle("Relationship between car weight and mile-per-gallon") +
  xlab("Miles-per-gallon") +
  ylab("Weight") +
  theme_bw()

my_first_graph
`geom_smooth()` using formula 'y ~ x'

Saving your graph as a file using ggsave()

We can save our graphs manually by right-clicking on it and then pressing "Save image as..." or we can save it from the code using the ggsave() function.

In [12]:
ggsave("my_first_graph.jpeg", #name of the file
       my_first_graph,#the graph you want to save
       device = "jpeg") #file format
Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

This will save your graph in a file in your working directory.

Plotting a Single variable

Continuous variable

For plotting a single continous variable we can use: geom_area(), geom_density() and geom_histogram()

In [13]:
single_continous_variable <- ggplot(mtcars, aes(wt))
single_continous_variable + geom_area(stat = "bin")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

"bin" option allows us to bin values in to number of bins and plot their frequencies. You can see the default values with the message: stat_bin() using bins = 30. Pick better value with binwidth. Let us try setting a lower number of bins to draw a continous plot.

In [14]:
single_continous_variable + geom_area(bins=10,stat = "bin" )

A density plot with geom_density().

In [15]:
single_continous_variable + geom_density(bins=10,stat = "bin" )

A histogram with geom_histogram().

In [16]:
single_continous_variable + geom_histogram(bins=10,stat = "bin" )

Plot Discrete or Categorical variables using Ggplot

Plotting discrete variables, such as categorical ones, the species counts in the iris dataset is best accomplished using a bar plot with geom_bar().

In [17]:
ggplot(iris, aes(Species)) + geom_bar()

Plotting two variables

Both continous variables

Plotting two continous variables is best accomplished using geom_point() in order to make a scatter plot. We already covered making this kind of plot in our "Making a basic graph" section. So here we can try to add some more layers and improve our first graph.

In [18]:
ggplot(mtcars, aes(x = mpg, y = wt)) +
  geom_smooth(method = "lm", se = F) + # se = F: turn off confidence interval
  geom_point() + 
  ggtitle("Relationship between car weight and mile-per-gallon") +
  xlab("Miles-per-gallon") +
  ylab("Weight") +
  geom_vline(xintercept = mean(mtcars$mp), linetype="dotted", 
                color = "green", size=1.5) + # add a x intercept line
  theme_bw()
`geom_smooth()` using formula 'y ~ x'

To wrap up, we will draw a grouped boxplot using iris data set.

Data wrangling - How to make a grouped boxplot

We will use the iris dataset for this example.

In [19]:
head(iris)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

Here we see a dataframe in a wide format, meaning every row represents the measurements of the different characteristics of a plant. Here each variable represents a column. In order to make a grouped boxplot we need to change this dataframe into a long format.

We can use the melt() function to convert the iris dataframe into a long format. The long format has for each data point as many rows as the number of variables and each row contains the value of a particular variable for a given data point. The melt() function is part of the reshape2 package so we will first load it.

In [20]:
library(reshape2)
In [21]:
iris_long <- melt(iris)
head(iris_long)
Using Species as id variables

A data.frame: 6 × 3
Speciesvariablevalue
<fct><fct><dbl>
1setosaSepal.Length5.1
2setosaSepal.Length4.9
3setosaSepal.Length4.7
4setosaSepal.Length4.6
5setosaSepal.Length5.0
6setosaSepal.Length5.4

With geom_boxplot() we can create a boxplot. Boxplots provide additional information about our data. The horizontal black line represent the median value,the top and bottom borders of the "box" represent first and third quartiles. The extent of the vertical line marks the quartile + 1.5 * interquartile range. Dots beyond these points are considered outliers.

In [22]:
ggplot(iris_long, aes(x = variable, y = value, fill = Species )) +  geom_boxplot()