Tidy Data In R
Contents
- Introduction
- What is tidy data
- What is messy data
- Install tidyr
- Loading the tidyr package with library()
- How to tidy data in R - Example
Introduction
All manuals on tidying data start off with these two quotes:
“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham
Tidy data is a specific organization of data that allows it to be used and manipulated easily for the purpose of data analysis. As we have learned from the introductory quotes, tidy data is uniform in its organization no matter its origin and size. And in contrast, data is always messy in a specific and unique way, making tidying messy data is a unique process in every case. For this reason, tidying data cannot be learned in a single guide, but is best learned from real world experience. The messier the datasets you have, you will become better at tidying data. Nevertheless, having a solid grasp on the basics of data tidying will help you becoming better at this process. Luckily, R has a wide array of packages, the wider tidyverse, and more specifically tidyr that provides powerful tools for tidying data.
What is tidy data?
Any dataset can be described as essentially a collection of values. Each value has a variable and an observation. Variables contain values that measure the same attribute, for example the sepal widths in the iris dataset. All values that are a result of measuring different attributes (i.e. variables) of a single entity are grouped into observations, for example one row in the iris dataset corresponds to all values derived from measuring a single plant.
head(iris)
Here on the iris dataset we can see the basic rules of tidy data. Namely, every column is a variable, every row an observation and every cell of the dataframe a single value.
What is messy data?
As mentioned before, every messy dataset is unique in its own way. However there are some most common themes in messy datasets that should be mentioned. These are:
- Column headers are values, instead of being variable names
- Two or more variables stored in a single column
- Variables are stored in both rows and columns
These issues will be addressed individually in subsequent chapters of this guide.
Install tidyr
Before using the tidyr package it is necessary to install it, since it is not a part of basic R installation. The tidyr package is hosted in the CRAN repository. Any package from the CRAN repository can be installed using the function install.packages()
.
To install using the following command...
install.packages("tidyr")
Alternatively, we can install the tidyverse which is a collection of packages (which includes tidyr).
install.packages("tidyverse")
Tip:
This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.
list_of_packages <- c("tidyr", "tidyverse")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
library("tidyverse")
#or
library("tidyr")
How to tidy data in R - Example
Consider the following dataset, relig_income. Would you consider it tidy? At the first look, the dataset (see below) looks clean and it should be easy to manipulate it? But if you look closely, the header row is not right. However looks like the first column in the dataset has the headers. Let us try to convert this wide format to a long format using the pivot_longer
function.
head(relig_income)
long_relig_income <- relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "frequency")
head(long_relig_income)
With -religion
, we imply: Pivot (convert to long) all columns except religion. The names_to
argument takes a string as the name of a new column. The values_to
argument takes a string as the name of a new column that will contain values. This dataframe is now tidy since it has each column represents a variable and each row represents an observation. The opposite function of pivot_longer()
is pivot_wider()
, converting long dataframe format to wide.
The same two functions pivot_longer()
and pivot_wider()
can be used to tidy datasets that have multiple variables in a single column, let us take a look at the World Health Organization data on TB infections. This table has a row for each year and strain of tuberculosis (in 2nd and 3rd column). The remaining columns state the number of cases for a given demographic. For example, m1524 corresponds to males between 15 and 24 years old, and f1524 are females between age 15 and 24. As you can see each of these columns has two variables: sex and age.
my_who <- who
#for the purpose of this guide we will simplify this dataset
my_who <- my_who %>% select(country, iso2, iso3, year,grep("new_sp", colnames(my_who), value = T))
head(my_who)
So in order to tidy this dataset, we need to perform two tasks. Change all the columns with the number of infections to long format
long_who <- gather(data = my_who, key = sex_and_age, number_of_infected, -country, -iso2, -iso3, -year)
head(long_who)
Let us separate the male and female from the rest of the column name such as new_sp_m_2534. This way the sex and age will be separated into different columns.
#again we will simplify the data a little bit
library(stringr)
long_who$sex_and_age <- str_sub(long_who$sex_and_age, -4, -1)
head(long_who)
Now we should break the sex_and_age column into separate columns. The column sex_and_age holds both the values of the sex and age of each observation. In tidy data they should be in separate columns. Conveniently we can use the separate()
function from tidyr.
tidy_who <- separate(long_who, sex_and_age, c("sex", "age"), sep = 1)
head(tidy_who)
Now we have a dataset that we can call tidy. However there are still some rows that have NAs. To drop rows with 'NA' in our dataset, we can use the drop_na()
function. It will simply drop all rows that have NAs in from our dataframe.
tidy_who_no_NAs <- drop_na(tidy_who)
head(tidy_who_no_NAs)
A good way to check if your dataframe has NAs, is with the complete.cases
function.
sum(!complete.cases(tidy_who_no_NAs)) == 0
#sums the number of NAs, if zero returns true
Related Notebooks
- How To Analyze Data Using Pyspark RDD
- How To Analyze Yahoo Finance Data With R
- How To Read JSON Data Using Python Pandas
- How to Analyze the CSV data in Pandas
- Summarising Aggregating and Grouping data in Python Pandas
- Data Cleaning With Python Pdpipe
- Data Analysis With Pyspark Dataframe
- How to Visualize Data Using Python - Matplotlib
- Boxplots In R