Tidy Data In R

Contents

Introduction
What is tidy data
What is messy data
Install tidyr
1. Loading the tidyr package with library()
How to tidy data in R - Example

Introduction

All manuals on tidying data start off with these two quotes:

“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

Tidy data is a specific organization of data that allows it to be used and manipulated easily for the purpose of data analysis. As we have learned from the introductory quotes, tidy data is uniform in its organization no matter its origin and size. And in contrast, data is always messy in a specific and unique way, making tidying messy data is a unique process in every case. For this reason, tidying data cannot be learned in a single guide, but is best learned from real world experience. The messier the datasets you have, you will become better at tidying data. Nevertheless, having a solid grasp on the basics of data tidying will help you becoming better at this process. Luckily, R has a wide array of packages, the wider tidyverse, and more specifically tidyr that provides powerful tools for tidying data.

What is tidy data?

Any dataset can be described as essentially a collection of values. Each value has a variable and an observation. Variables contain values that measure the same attribute, for example the sepal widths in the iris dataset. All values that are a result of measuring different attributes (i.e. variables) of a single entity are grouped into observations, for example one row in the iris dataset corresponds to all values derived from measuring a single plant.

head(iris)

Here on the iris dataset we can see the basic rules of tidy data. Namely, every column is a variable, every row an observation and every cell of the dataframe a single value.

What is messy data?

As mentioned before, every messy dataset is unique in its own way. However there are some most common themes in messy datasets that should be mentioned. These are:

Column headers are values, instead of being variable names
Two or more variables stored in a single column
Variables are stored in both rows and columns

These issues will be addressed individually in subsequent chapters of this guide.

Install tidyr

Before using the tidyr package it is necessary to install it, since it is not a part of basic R installation. The tidyr package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages().

To install using the following command...

install.packages("tidyr")

Alternatively, we can install the tidyverse which is a collection of packages (which includes tidyr).

install.packages("tidyverse")

Tip:

This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.

list_of_packages <- c("tidyr", "tidyverse")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Loading the tidyr package with `library()`

Now that we have the tidyr package installed, we need to load it or the tidyverse in to R workspace. For this purpose we can use the library() function.

library("tidyverse")
#or 
library("tidyr")

How to tidy data in R - Example

Consider the following dataset, relig_income. Would you consider it tidy? At the first look, the dataset (see below) looks clean and it should be easy to manipulate it? But if you look closely, the header row is not right. However looks like the first column in the dataset has the headers. Let us try to convert this wide format to a long format using the pivot_longer function.

head(relig_income)

long_relig_income <- relig_income %>% 
  pivot_longer(-religion, names_to = "income", values_to = "frequency")

head(long_relig_income)

With -religion, we imply: Pivot (convert to long) all columns except religion. The names_to argument takes a string as the name of a new column. The values_to argument takes a string as the name of a new column that will contain values. This dataframe is now tidy since it has each column represents a variable and each row represents an observation. The opposite function of pivot_longer() is pivot_wider(), converting long dataframe format to wide.

The same two functions pivot_longer() and pivot_wider() can be used to tidy datasets that have multiple variables in a single column, let us take a look at the World Health Organization data on TB infections. This table has a row for each year and strain of tuberculosis (in 2nd and 3rd column). The remaining columns state the number of cases for a given demographic. For example, m1524 corresponds to males between 15 and 24 years old, and f1524 are females between age 15 and 24. As you can see each of these columns has two variables: sex and age.

my_who <- who
#for the purpose of this guide we will simplify this dataset
my_who <- my_who %>% select(country, iso2, iso3, year,grep("new_sp", colnames(my_who), value = T))
head(my_who)

So in order to tidy this dataset, we need to perform two tasks. Change all the columns with the number of infections to long format

long_who <- gather(data = my_who, key = sex_and_age, number_of_infected, -country, -iso2, -iso3, -year)
head(long_who)

Let us separate the male and female from the rest of the column name such as new_sp_m_2534. This way the sex and age will be separated into different columns.

#again we will simplify the data a little bit
library(stringr)
long_who$sex_and_age <- str_sub(long_who$sex_and_age, -4, -1)


head(long_who)

Now we should break the sex_and_age column into separate columns. The column sex_and_age holds both the values of the sex and age of each observation. In tidy data they should be in separate columns. Conveniently we can use the separate() function from tidyr.

tidy_who <- separate(long_who, sex_and_age, c("sex", "age"), sep = 1)
head(tidy_who)

Now we have a dataset that we can call tidy. However there are still some rows that have NAs. To drop rows with 'NA' in our dataset, we can use the drop_na() function. It will simply drop all rows that have NAs in from our dataframe.

tidy_who_no_NAs <- drop_na(tidy_who)
head(tidy_who_no_NAs)

A good way to check if your dataframe has NAs, is with the complete.cases function.

sum(!complete.cases(tidy_who_no_NAs)) == 0
#sums the number of NAs, if zero returns true

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
	<dbl>	<dbl>	<dbl>	<dbl>	<fct>
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Agnostic	27	34	60	81	76	137	122	109	84	96
Atheist	12	27	37	52	35	70	73	59	74	76
Buddhist	27	21	30	34	33	58	62	39	53	54
Catholic	418	617	732	670	638	1116	949	792	633	1489
Don’t know/refused	15	14	15	11	10	35	21	17	18	116
Evangelical Prot	575	869	1064	982	881	1486	949	723	414	1529

religion	income	frequency
<chr>	<chr>	<dbl>
Agnostic	<$10k	27
Agnostic	$10-20k	34
Agnostic	$20-30k	60
Agnostic	$30-40k	81
Agnostic	$40-50k	76
Agnostic	$50-75k	137

country	iso2	iso3	year	sex_and_age	number_of_infected
<chr>	<chr>	<chr>	<int>	<chr>	<int>
Afghanistan	AF	AFG	1980	new_sp_m014	NA
Afghanistan	AF	AFG	1981	new_sp_m014	NA
Afghanistan	AF	AFG	1982	new_sp_m014	NA
Afghanistan	AF	AFG	1983	new_sp_m014	NA
Afghanistan	AF	AFG	1984	new_sp_m014	NA
Afghanistan	AF	AFG	1985	new_sp_m014	NA

country	iso2	iso3	year	sex_and_age	number_of_infected
<chr>	<chr>	<chr>	<int>	<chr>	<int>
Afghanistan	AF	AFG	1980	m014	NA
Afghanistan	AF	AFG	1981	m014	NA
Afghanistan	AF	AFG	1982	m014	NA
Afghanistan	AF	AFG	1983	m014	NA
Afghanistan	AF	AFG	1984	m014	NA
Afghanistan	AF	AFG	1985	m014	NA