grep
is a utility for searching text for lines that match a regular expression. The name stands for: globally search for a regular expression and print matching lines. The grep
utility is available from the command line on Unix based operating systems, and can be imported to Widnows as well. On the command line the utility is used as follows:
grep [regex pattern] [input file]
This command will return all lines from the input file that yield a match for the regular expression. In R, the grep utility is achieved through following functions:
- grep()
- grepl()
- sub()
- gsub()
2. Basics of Regular Expressions in R
To properly use any grep utility, regardless of implementation, you will need a mastery of regular expressions, or regex for short. A regex pattern is a sequence of characters that specify a search pattern.
Tip: You can use sites like regexr.com to test and build regex patterns live.
To test regex patterns, we will use the base grep()
function on a custom vector of strings. In the next chapter we will cover a broader selection of grep functions in R.
Let us run some examples using different regex patterns in grep()
. First let us try to match only digits in the vector below.
Note: The grep()
returns the index poisition of the match, not the match itself.
Grep Digits in R
items <- c("Hospital", "SCHOOL", "house", "%%%%", 100, 200, 300,
"Steet 1", "Buiding_2", " ", "\t", "\n")
grep("\\d", items) #matches to any item which has a number
With below command, we will access the items from the vector using the index returened by grep.
items[grep("\\d", items)]
items[grep("\\D", items)] #matches to any non digit item
items[grep("[a-z]", items)] #matches any item with a lowercase letter
items[grep("[A-Z]", items)] #matches any item with a uppercase letter
items[grep("[A-z]", items) ]
items[grep("[A-z0-9]", items) ]
items[grep("\\w", items) ]
#alternatively [A-z0-9_], meaning match any alphanumeric character and underscore.
items[grep("\\W", items) ]
items[grep(" ", items) ]
items[grep("\\S", items) ]
items[grep("[[:cntrl:]]", items)]
3. Grep functions in R
In this chapter we will cover the main grep functions implemented in R: grep()
, grepl()
, sub()
, gsub()
.
grep()
Even though we covered the grep()
function in the previous chapter we will go over some more practical uses.
For example you can set value = TRUE to return the values of matches instead of indexes.
grep("\\d", items, value = TRUE) #Get values directly.
Lets try to grep all column names in iris that start with "sepal". To make it complex, we will add some strings.
new_colnames <- c("sepal.Length", "Color_of_Sepal", colnames(iris))
new_colnames
grep("Sepal", new_colnames, value = TRUE)
From the above results you notice that the pattern is case sensitive and not sensitive to the position of the match.
grep("^Sepal", new_colnames, value = TRUE)
And with the "^" symbol we specify matching only at the beginning of the string, where as "$" symbol means match at the end of the string. But what about case sensitivity, Check out the syntax ([Ss]) below for that...
grep("^[Ss]epal", new_colnames, value = TRUE)
items <- c("Hospital", "SCHOOL", "house", "%%%%", 100, 200, 300,
"Steet 1", "Buiding_2", " ", "\\t", "\\n")
grepl("\\d", items) #matches to any item with a numeric
Note we can extract matches with logical vector instead of indices.
items[grepl("\\d", items)]
colnames(iris)
Let us change the dots to underscores here. The sub()
and gsub()
functions accept 3 arguments: pattern, replacement and vector. So we just replace "." with "_" using below statement.
sub(".", "_", colnames(iris))
Not the result we had in mind? This is because "." stands for any character expect a newline. These are called metacharacters, and can be turned into regular characters by escaping them with "\"
sub("\\.", "_", colnames(iris))
For the next example, I have added ".1" to the end of iris column names.
iris_colnames.1 <- paste(colnames(iris), ".1", sep = "")
iris_colnames.1
Now lets try replacing dots with underscores again.
sub("\\.", "_", iris_colnames.1)
And rigth here you can see the difference between sub and gsub. The sub function substitutes only the first match, while gsub (global substitution) substitutes all matches.
gsub("\\.", "_", iris_colnames.1)
Related Notebooks
- How To Use R Dplyr Package
- How To Use Python Pip
- How To Use Pandas Correlation Matrix
- How To Use Selenium Webdriver To Crawl Websites
- How To Write DataFrame To CSV In R
- How To Plot Histogram In R
- How To Iterate Over Rows In A Dataframe In Pandas
- How to Generate Random Numbers in Python
- How To Run Logistic Regression In R