Tidy Data In R

Contents

  1. Introduction
  2. What is tidy data
  3. What is messy data
  4. Install tidyr
    1. Loading the tidyr package with library()
  5. How to tidy data in R - Example

Introduction

All manuals on tidying data start off with these two quotes:

“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

Tidy data is a specific organization of data that allows it to be used and manipulated easily for the purpose of data analysis. As we have learned from the introductory quotes, tidy data is uniform in its organization no matter its origin and size. And in contrast, data is always messy in a specific and unique way, making tidying messy data is a unique process in every case. For this reason, tidying data cannot be learned in a single guide, but is best learned from real world experience. The messier the datasets you have, you will become better at tidying data. Nevertheless, having a solid grasp on the basics of data tidying will help you becoming better at this process. Luckily, R has a wide array of packages, the wider tidyverse, and more specifically tidyr that provides powerful tools for tidying data.

What is tidy data?

Any dataset can be described as essentially a collection of values. Each value has a variable and an observation. Variables contain values that measure the same attribute, for example the sepal widths in the iris dataset. All values that are a result of measuring different attributes (i.e. variables) of a single entity are grouped into observations, for example one row in the iris dataset corresponds to all values derived from measuring a single plant.

In [1]:
head(iris)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

Here on the iris dataset we can see the basic rules of tidy data. Namely, every column is a variable, every row an observation and every cell of the dataframe a single value.

What is messy data?

As mentioned before, every messy dataset is unique in its own way. However there are some most common themes in messy datasets that should be mentioned. These are:

  1. Column headers are values, instead of being variable names
  2. Two or more variables stored in a single column
  3. Variables are stored in both rows and columns

These issues will be addressed individually in subsequent chapters of this guide.

Install tidyr

Before using the tidyr package it is necessary to install it, since it is not a part of basic R installation. The tidyr package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages().

To install using the following command...

install.packages("tidyr")

Alternatively, we can install the tidyverse which is a collection of packages (which includes tidyr).

install.packages("tidyverse")

Tip:

This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.

In [ ]:
list_of_packages <- c("tidyr", "tidyverse")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Loading the tidyr package with library()

Now that we have the tidyr package installed, we need to load it or the tidyverse in to R workspace. For this purpose we can use the library() function.

In [3]:
library("tidyverse")
#or 
library("tidyr")

How to tidy data in R - Example

Consider the following dataset, relig_income. Would you consider it tidy? At the first look, the dataset (see below) looks clean and it should be easy to manipulate it? But if you look closely, the header row is not right. However looks like the first column in the dataset has the headers. Let us try to convert this wide format to a long format using the pivot_longer function.

In [4]:
head(relig_income)
A tibble: 6 × 11
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/refused
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Agnostic 27 34 60 81 76 137122109 84 96
Atheist 12 27 37 52 35 70 73 59 74 76
Buddhist 27 21 30 34 33 58 62 39 53 54
Catholic 418617 73267063811169497926331489
Don’t know/refused 15 14 15 11 10 35 21 17 18 116
Evangelical Prot 575869106498288114869497234141529
In [5]:
long_relig_income <- relig_income %>% 
  pivot_longer(-religion, names_to = "income", values_to = "frequency")

head(long_relig_income)
A tibble: 6 × 3
religionincomefrequency
<chr><chr><dbl>
Agnostic<$10k 27
Agnostic$10-20k 34
Agnostic$20-30k 60
Agnostic$30-40k 81
Agnostic$40-50k 76
Agnostic$50-75k137

With -religion, we imply: Pivot (convert to long) all columns except religion. The names_to argument takes a string as the name of a new column. The values_to argument takes a string as the name of a new column that will contain values. This dataframe is now tidy since it has each column represents a variable and each row represents an observation. The opposite function of pivot_longer() is pivot_wider(), converting long dataframe format to wide.

The same two functions pivot_longer() and pivot_wider() can be used to tidy datasets that have multiple variables in a single column, let us take a look at the World Health Organization data on TB infections. This table has a row for each year and strain of tuberculosis (in 2nd and 3rd column). The remaining columns state the number of cases for a given demographic. For example, m1524 corresponds to males between 15 and 24 years old, and f1524 are females between age 15 and 24. As you can see each of these columns has two variables: sex and age.

In [6]:
my_who <- who
#for the purpose of this guide we will simplify this dataset
my_who <- my_who %>% select(country, iso2, iso3, year,grep("new_sp", colnames(my_who), value = T))
head(my_who)
A tibble: 6 × 18
countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564new_sp_m65new_sp_f014new_sp_f1524new_sp_f2534new_sp_f3544new_sp_f4554new_sp_f5564new_sp_f65
<chr><chr><chr><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int>
AfghanistanAFAFG1980NANANANANANANANANANANANANANA
AfghanistanAFAFG1981NANANANANANANANANANANANANANA
AfghanistanAFAFG1982NANANANANANANANANANANANANANA
AfghanistanAFAFG1983NANANANANANANANANANANANANANA
AfghanistanAFAFG1984NANANANANANANANANANANANANANA
AfghanistanAFAFG1985NANANANANANANANANANANANANANA

So in order to tidy this dataset, we need to perform two tasks. Change all the columns with the number of infections to long format

In [8]:
long_who <- gather(data = my_who, key = sex_and_age, number_of_infected, -country, -iso2, -iso3, -year)
head(long_who)
A tibble: 6 × 6
countryiso2iso3yearsex_and_agenumber_of_infected
<chr><chr><chr><int><chr><int>
AfghanistanAFAFG1980new_sp_m014NA
AfghanistanAFAFG1981new_sp_m014NA
AfghanistanAFAFG1982new_sp_m014NA
AfghanistanAFAFG1983new_sp_m014NA
AfghanistanAFAFG1984new_sp_m014NA
AfghanistanAFAFG1985new_sp_m014NA

Let us separate the male and female from the rest of the column name such as new_sp_m_2534. This way the sex and age will be separated into different columns.

In [9]:
#again we will simplify the data a little bit
library(stringr)
long_who$sex_and_age <- str_sub(long_who$sex_and_age, -4, -1)


head(long_who)
A tibble: 6 × 6
countryiso2iso3yearsex_and_agenumber_of_infected
<chr><chr><chr><int><chr><int>
AfghanistanAFAFG1980m014NA
AfghanistanAFAFG1981m014NA
AfghanistanAFAFG1982m014NA
AfghanistanAFAFG1983m014NA
AfghanistanAFAFG1984m014NA
AfghanistanAFAFG1985m014NA

Now we should break the sex_and_age column into separate columns. The column sex_and_age holds both the values of the sex and age of each observation. In tidy data they should be in separate columns. Conveniently we can use the separate() function from tidyr.

In [10]:
tidy_who <- separate(long_who, sex_and_age, c("sex", "age"), sep = 1)
head(tidy_who)
A tibble: 6 × 7
countryiso2iso3yearsexagenumber_of_infected
<chr><chr><chr><int><chr><chr><int>
AfghanistanAFAFG1980m014NA
AfghanistanAFAFG1981m014NA
AfghanistanAFAFG1982m014NA
AfghanistanAFAFG1983m014NA
AfghanistanAFAFG1984m014NA
AfghanistanAFAFG1985m014NA

Now we have a dataset that we can call tidy. However there are still some rows that have NAs. To drop rows with 'NA' in our dataset, we can use the drop_na() function. It will simply drop all rows that have NAs in from our dataframe.

In [11]:
tidy_who_no_NAs <- drop_na(tidy_who)
head(tidy_who_no_NAs)
A tibble: 6 × 7
countryiso2iso3yearsexagenumber_of_infected
<chr><chr><chr><int><chr><chr><int>
AfghanistanAFAFG1997m014 0
AfghanistanAFAFG1998m014 30
AfghanistanAFAFG1999m014 8
AfghanistanAFAFG2000m014 52
AfghanistanAFAFG2001m014129
AfghanistanAFAFG2002m014 90

A good way to check if your dataframe has NAs, is with the complete.cases function.

In [12]:
sum(!complete.cases(tidy_who_no_NAs)) == 0
#sums the number of NAs, if zero returns true
TRUE