Introduction To R DataFrames
Introduction
In R, data frames are used to store tabular data. There is chance that you have encountered matrices in R, which are similar to data frames but differ in one important aspect. A matrix in R holds only numerical data, while a data frame can hold numerical, string, Boolean or any other type of data. Data frames are essentially a list of vectors, and rows and columns of a data frame can be accessed as such.
Installation
Data frames can be loaded into R from files, and data frames in R can be saved as files. R has built in functions to read and write csv files, but reading and writing MS spreadsheets like .xslx files requires specialized packages to be installed and loaded. In this tutorial we will use the readxl and xlsx package.
In the case of readxl, we will pass it as an argument for install.packages()
and run it.
install.packages("readxl")
install.packages("xlsx")
Tip:
This is a useful chunk of code to make sure all the packages you want to use are installed if they are not already installed.
list_of_packages <- c("readxl", "xlsx")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
library("readxl")
library("xlsx")
Examining a data frame
Small data frames and can be printed out as a whole easily, but in practice you may encounter huge data frames that are impractical to print out. Still there are ways you can peek into the data frame to get a feel for its structure and content. For this purpose you can use the head() an tail() functions. Lets try it on the iris data frame.
Head returns the the first n number of rows, 6 by default.
head(iris)
head(iris, 3) # you can also specify the number of top rows you want to look at
With tail() you can view the last n rows in a data frame. This can be useful since tools usually append data to the end of tables, so with tail() you can see the most recently added data.
tail(iris)
tail(iris, 1) # this will return the very last row
A great way to provide an overview of a data frame is by using the str() function. It will return you the number of variables (columns) and rows (observations) as well as the type of data in the data frame.
str(iris)
names <- c("Pam", "Michael", "Jim", "Dwight") # a string vector
salary <- c(300.20, 900.95, 500.75, 500.1) # a double vector
vacation_days <- c(10, 4, 5, 30)
is_manager <-c("No", "Yes", "No", "No") # a string vector
To create a data frame we can combine the vectors we made into a data frame using the data.frame() function. Note that with the row.names argument we specify that the first vector passes (names) will be used as row names in our data frame and not
my_dataframe <- data.frame(names, salary, vacation_days, is_manager, row.names = 1)
my_dataframe
Let us examine our data frame using str()
str(my_dataframe)
Notice how the row names are not a variable and only 4 columns are present. Row names are not part of the data in the data frame, but can be obtained using the the row.names()
row.names(my_dataframe)
Likewise, column names can be retrieved using the colnames()
colnames(my_dataframe)
These are the names of the vectors that we used to create the data frame. We can change the column names in the following way. We create a vector of new column names and then assign them to the column names of a data frame.
new_column_names <- c("Salary", "Vacation days", "Is manager?")
colnames(my_dataframe) <- new_column_names
my_dataframe
my_dataframe[3,] # returns third row from data frame
my_dataframe[,2] # returns second column
my_dataframe[2,3] # return value stored in second row and column
Note that column values can be changed by assigning new variables like so:
my_dataframe[2,3] <- "No"
my_dataframe[2,3]
Also column can be accessed by their names using the $ symbol:
my_dataframe$`Is manager?`
Height <- c(160, 170, 165, 185) # must be same length as other columns
my_dataframe <- cbind(my_dataframe, Height)
my_dataframe
Now we can try to add a row using rbind(). The vector that we will add as a new row needs to the same legnht as other rows now, which is five.
An easy to check the dimemsions of your data frame is with dim() function.
dim(my_dataframe) # 4 rows by 4 columns
Toby <- c(400, 14, "No", 175)
my_dataframe <- rbind(my_dataframe, Toby)
my_dataframe
Notice how the row name is set to 5 as in fifth row, we can change it the following way:
rownames() returns the row names, with [5] we select the fifth row name and we assign a variable to it in the usual manner.
rownames(my_dataframe)[5] <- "Toby"
my_dataframe
my_dataframe[-2,-4]
Note that you can pass a vector of negative integers to remove multiple columns.
my_dataframe[c(-1,-2),]
setwd("path/to/work/dir")
To save a data frame as csv (comma separated values) file, we can use the write.csv2() function. The first argument is the R variable you want to save as a csv file while the second is the name of the file.
write.csv2(my_dataframe, "my_dataframe.csv")
To load a file back into R we can use the read.csv2 function.
my_dataframe_from_csv <- read.csv2("my_dataframe.csv", row.names = 1)
my_dataframe_from_csv
We can write our data frame in to MS excel spreadsheet using the write.xlsx() function of the xlsx package.
write.xlsx(my_dataframe, "my_dataframe.xlsx")
To load a xlsx file we can use the read_excel() function from the readxl package.
my_dataframe_from_excel <- read_excel("my_dataframe.xlsx")