Introduction To R DataFrames

Introduction

In R, data frames are used to store tabular data. There is chance that you have encountered matrices in R, which are similar to data frames but differ in one important aspect. A matrix in R holds only numerical data, while a data frame can hold numerical, string, Boolean or any other type of data. Data frames are essentially a list of vectors, and rows and columns of a data frame can be accessed as such.

Installation

Data frames can be loaded into R from files, and data frames in R can be saved as files. R has built in functions to read and write csv files, but reading and writing MS spreadsheets like .xslx files requires specialized packages to be installed and loaded. In this tutorial we will use the readxl and xlsx package.

In the case of readxl, we will pass it as an argument for install.packages() and run it.

In [ ]:
install.packages("readxl")
install.packages("xlsx")

Tip:

This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.

In [ ]:
list_of_packages <- c("readxl", "xlsx")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Loading the readxl and xlsx packages with library()

Now that we have the readxl and xlsx package installed we need to load it or the tidyverse to our R workspace. For this purpose we can use the library() function.

In [ ]:
library("readxl")
library("xlsx") 

Examining a data frame

Small data frames and can be printed out as a whole easily, but in practice you may encounter huge data frames that are impractical to print out. Still there are ways you can peek into the data frame to get a feel for its structure and content. For this purpose you can use the head() an tail() functions. Lets try it on the iris data frame.

Head returns the the first n number of rows, 6 by default.

In [ ]:
head(iris)
In [ ]:
head(iris, 3) # you can also specify the number of top rows you want to look at

With tail() you can view the last n rows in a data frame. This can be useful since tools usually append data to the end of tables, so with tail() you can see the most recently added data.

In [ ]:
tail(iris)
In [ ]:
tail(iris, 1) # this will return the very last row 

A great way to provide an overview of a data frame is by using the str() function. It will return you the number of variables (columns) and rows (observations) as well as the type of data in the data frame.

In [ ]:
str(iris)

Making a R dataframe from scratch

Remember that R data frames are made of vectors, which need to be of equal length.
We will contruct a toy data frame to showcase how a dataframe can store variable

In [ ]:
names <- c("Pam", "Michael", "Jim", "Dwight") # a string vector
salary <- c(300.20, 900.95, 500.75, 500.1) # a double vector
vacation_days <- c(10, 4, 5, 30) 
is_manager <-c("No", "Yes", "No", "No") # a string vector

To create a data frame we can combine the vectors we made into a data frame using the data.frame() function. Note that with the row.names argument we specify that the first vector passes (names) will be used as row names in our data frame and not

In [ ]:
my_dataframe <- data.frame(names, salary, vacation_days, is_manager, row.names = 1)
my_dataframe

Let us examine our data frame using str()

In [ ]:
str(my_dataframe) 

Notice how the row names are not a variable and only 4 columns are present. Row names are not part of the data in the data frame, but can be obtained using the the row.names()

In [ ]:
row.names(my_dataframe)

Likewise, column names can be retrieved using the colnames()

In [ ]:
colnames(my_dataframe)

These are the names of the vectors that we used to create the data frame. We can change the column names in the following way. We create a vector of new column names and then assign them to the column names of a data frame.

In [ ]:
new_column_names <- c("Salary", "Vacation days", "Is manager?")
colnames(my_dataframe) <- new_column_names
my_dataframe

Selecting data from a data frame

You can select elements from a data frame using square brackets []. By using a comma, you can indicate what rows and columns to select respectively. For example:

In [ ]:
my_dataframe[3,] # returns third row from data frame
In [ ]:
my_dataframe[,2] # returns second column
In [ ]:
my_dataframe[2,3] # return value stored in second row and column

Note that column values can be changed by assigning new variables like so:

In [ ]:
my_dataframe[2,3] <- "No"
my_dataframe[2,3] 

Also column can be accessed by their names using the $ symbol:

In [ ]:
my_dataframe$`Is manager?`

Adding rows and columns in R Dataframe

Rows and columns can be added to an existing data frame using the rbind() and cbind() functions. First lets try to add a column.

In [ ]:
Height <- c(160, 170, 165, 185) # must be same length as other columns

my_dataframe <- cbind(my_dataframe, Height)
my_dataframe

Now we can try to add a row using rbind(). The vector that we will add as a new row needs to the same legnht as other rows now, which is five.

An easy to check the dimemsions of your data frame is with dim() function.

In [ ]:
dim(my_dataframe) # 4 rows by 4 columns
In [ ]:
Toby <- c(400, 14, "No", 175)
my_dataframe <- rbind(my_dataframe, Toby)
my_dataframe

Notice how the row name is set to 5 as in fifth row, we can change it the following way:

rownames() returns the row names, with [5] we select the fifth row name and we assign a variable to it in the usual manner.

In [ ]:
rownames(my_dataframe)[5] <- "Toby"
my_dataframe

Dropping rows and columns

The most basic way to drop a row or column from a data frame is with negation in front of the index of the column or row, we want to drop

In [ ]:
my_dataframe[-2,-4]

Note that you can pass a vector of negative integers to remove multiple columns.

In [ ]:
my_dataframe[c(-1,-2),]

Reading dataframe from file and writing dataframe to file

Saving you data frame as a csv file or excel file.

Firstly set the working directory where you want the file to be placed:

In [ ]:
setwd("path/to/work/dir")

To save a data frame as csv (comma separated values) file, we can use the write.csv2() function. The first argument is the R variable you want to save as a csv file while the second is the name of the file.

In [ ]:
write.csv2(my_dataframe, "my_dataframe.csv")

To load a file back into R we can use the read.csv2 function.

In [ ]:
my_dataframe_from_csv <- read.csv2("my_dataframe.csv", row.names = 1) 
my_dataframe_from_csv

We can write our data frame in to MS excel spreadsheet using the write.xlsx() function of the xlsx package.

In [ ]:
write.xlsx(my_dataframe, "my_dataframe.xlsx")

To load a xlsx file we can use the read_excel() function from the readxl package.

In [ ]:
my_dataframe_from_excel <- read_excel("my_dataframe.xlsx")