How To Use Grep In R

1. Introduction

grep is a utility for searching text for lines that match a regular expression. The name stands for: globally search for a regular expression and print matching lines. The grep utility is available from the command line on Unix based operating systems, and can be imported to Widnows as well. On the command line the utility is used as follows:

grep [regex pattern] [input file]

This command will return all lines from the input file that yield a match for the regular expression. In R, the grep utility is achieved through following functions:

  1. grep()
  2. grepl()
  3. sub()
  4. gsub()

2. Basics of Regular Expressions in R

To properly use any grep utility, regardless of implementation, you will need a mastery of regular expressions, or regex for short. A regex pattern is a sequence of characters that specify a search pattern.

Tip: You can use sites like regexr.com to test and build regex patterns live.

To test regex patterns, we will use the base grep() function on a custom vector of strings. In the next chapter we will cover a broader selection of grep functions in R.

Let us run some examples using different regex patterns in grep(). First let us try to match only digits in the vector below.

Note: The grep() returns the index poisition of the match, not the match itself.

Grep Digits in R

In [1]:
items <- c("Hospital", "SCHOOL", "house", "%%%%", 100, 200, 300, 
           "Steet 1", "Buiding_2", "   ", "\t", "\n") 
grep("\\d", items) #matches to any item which has a number
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9

With below command, we will access the items from the vector using the index returened by grep.

In [2]:
items[grep("\\d", items)]
  1. '100'
  2. '200'
  3. '300'
  4. 'Steet 1'
  5. 'Buiding_2'

Grep Non-digits in R

In [3]:
items[grep("\\D", items)] #matches to any non digit item
  1. 'Hospital'
  2. 'SCHOOL'
  3. 'house'
  4. '%%%%'
  5. 'Steet 1'
  6. 'Buiding_2'
  7. ' '
  8. '\t'
  9. '\n'

Grep words with Lower-case Letters in R

In [4]:
items[grep("[a-z]", items)] #matches any item with a lowercase letter
  1. 'Hospital'
  2. 'house'
  3. 'Steet 1'
  4. 'Buiding_2'

Grep Words with Upper-case Letters in R

In [5]:
items[grep("[A-Z]", items)] #matches any item with a uppercase letter
  1. 'Hospital'
  2. 'SCHOOL'
  3. 'Steet 1'
  4. 'Buiding_2'

Grep any alphabetic Character in R

In [6]:
items[grep("[A-z]", items) ]
  1. 'Hospital'
  2. 'SCHOOL'
  3. 'house'
  4. 'Steet 1'
  5. 'Buiding_2'

Grep any alphanumeric Character in R

In [7]:
items[grep("[A-z0-9]", items) ]
  1. 'Hospital'
  2. 'SCHOOL'
  3. 'house'
  4. '100'
  5. '200'
  6. '300'
  7. 'Steet 1'
  8. 'Buiding_2'

Grep Word characters in R

In [8]:
items[grep("\\w", items) ]
#alternatively [A-z0-9_], meaning match any alphanumeric character and underscore.
  1. 'Hospital'
  2. 'SCHOOL'
  3. 'house'
  4. '100'
  5. '200'
  6. '300'
  7. 'Steet 1'
  8. 'Buiding_2'

Grep Non-word characters in R

In [9]:
items[grep("\\W", items) ]
  1. '%%%%'
  2. 'Steet 1'
  3. ' '
  4. '\t'
  5. '\n'

Note that "Street 1" is both word and non word, since the "Street" and "1" match as word characters, and the space " " is matched as non word

Grep a Space or tab in R

In [10]:
items[grep(" ", items) ]
  1. 'Steet 1'
  2. ' '

Grep Not space in R

In [11]:
items[grep("\\S", items) ]
  1. 'Hospital'
  2. 'SCHOOL'
  3. 'house'
  4. '%%%%'
  5. '100'
  6. '200'
  7. '300'
  8. 'Steet 1'
  9. 'Buiding_2'

Control characters

Control characters like perform special formatting operation on text, like starting new lines or inserting tabs.

In [12]:
items[grep("[[:cntrl:]]", items)]
  1. '\t'
  2. '\n'

3. Grep functions in R

In this chapter we will cover the main grep functions implemented in R: grep(), grepl(), sub(), gsub().

grep()

Even though we covered the grep() function in the previous chapter we will go over some more practical uses. For example you can set value = TRUE to return the values of matches instead of indexes.

In [13]:
grep("\\d", items, value = TRUE) #Get values directly.
  1. '100'
  2. '200'
  3. '300'
  4. 'Steet 1'
  5. 'Buiding_2'

Lets try to grep all column names in iris that start with "sepal". To make it complex, we will add some strings.

In [14]:
new_colnames <- c("sepal.Length", "Color_of_Sepal", colnames(iris))
new_colnames
  1. 'sepal.Length'
  2. 'Color_of_Sepal'
  3. 'Sepal.Length'
  4. 'Sepal.Width'
  5. 'Petal.Length'
  6. 'Petal.Width'
  7. 'Species'
In [15]:
grep("Sepal", new_colnames, value = TRUE)
  1. 'Color_of_Sepal'
  2. 'Sepal.Length'
  3. 'Sepal.Width'

From the above results you notice that the pattern is case sensitive and not sensitive to the position of the match.

In [16]:
grep("^Sepal", new_colnames, value = TRUE)
  1. 'Sepal.Length'
  2. 'Sepal.Width'

And with the "^" symbol we specify matching only at the beginning of the string, where as "$" symbol means match at the end of the string. But what about case sensitivity, Check out the syntax ([Ss]) below for that...

In [17]:
grep("^[Ss]epal", new_colnames, value = TRUE)
  1. 'sepal.Length'
  2. 'Sepal.Length'
  3. 'Sepal.Width'

grepl()

The grepl() function performs the same way as the grep() function described previously, only difference is that instead of indices of matches, it returns 'TRUE' for each match and False otherwise. We can re-run a previous example using grep with grepl instead.

In [18]:
items <- c("Hospital", "SCHOOL", "house", "%%%%", 100, 200, 300, 
           "Steet 1", "Buiding_2", "   ", "\\t", "\\n") 
grepl("\\d", items) #matches to any item with a numeric
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. TRUE
  6. TRUE
  7. TRUE
  8. TRUE
  9. TRUE
  10. FALSE
  11. FALSE
  12. FALSE

Note we can extract matches with logical vector instead of indices.

In [19]:
items[grepl("\\d", items)]
  1. '100'
  2. '200'
  3. '300'
  4. 'Steet 1'
  5. 'Buiding_2'

sub() and gsub()

Some times we want to edit strings in a given vector, change underscores for dots or vice verca. The sub() and gsub() functions are very powerful tools in such cases, the "sub" being short for substitution.

In [20]:
colnames(iris)
  1. 'Sepal.Length'
  2. 'Sepal.Width'
  3. 'Petal.Length'
  4. 'Petal.Width'
  5. 'Species'

Let us change the dots to underscores here. The sub() and gsub() functions accept 3 arguments: pattern, replacement and vector. So we just replace "." with "_" using below statement.

In [21]:
sub(".", "_", colnames(iris))
  1. '_epal.Length'
  2. '_epal.Width'
  3. '_etal.Length'
  4. '_etal.Width'
  5. '_pecies'

Not the result we had in mind? This is because "." stands for any character expect a newline. These are called metacharacters, and can be turned into regular characters by escaping them with "\"

In [22]:
sub("\\.", "_", colnames(iris))
  1. 'Sepal_Length'
  2. 'Sepal_Width'
  3. 'Petal_Length'
  4. 'Petal_Width'
  5. 'Species'

For the next example, I have added ".1" to the end of iris column names.

In [23]:
iris_colnames.1 <- paste(colnames(iris), ".1", sep = "")
iris_colnames.1
  1. 'Sepal.Length.1'
  2. 'Sepal.Width.1'
  3. 'Petal.Length.1'
  4. 'Petal.Width.1'
  5. 'Species.1'

Now lets try replacing dots with underscores again.

In [24]:
sub("\\.", "_", iris_colnames.1)
  1. 'Sepal_Length.1'
  2. 'Sepal_Width.1'
  3. 'Petal_Length.1'
  4. 'Petal_Width.1'
  5. 'Species_1'

And rigth here you can see the difference between sub and gsub. The sub function substitutes only the first match, while gsub (global substitution) substitutes all matches.

In [25]:
gsub("\\.", "_", iris_colnames.1)
  1. 'Sepal_Length_1'
  2. 'Sepal_Width_1'
  3. 'Petal_Length_1'
  4. 'Petal_Width_1'
  5. 'Species_1'