Introduction to
Pattern search and
Replace
Regular expressions
A regular expression is an effective tool for find and replace the text.
Regular Expression in R –
grep, grepl, grepexpr, sub, gsub
- grep, grepl, regexpr and gregexpr search for matches to argument
pattern within each element of a character vector
- Sub performs replacement of the first and gsub for all matches.
Rupak Roy
Regular expressions: Grep(pattern, x)
Grep(pattern, x)
- Searches for a specified substring pattern in a vector X of strings
- It gives the position of the pattern.
>grep(“[au]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”))
Character class [au] is a list of character enclosed between [and]
which matches an character in that list. Now it will look for a or u
>[1] 1 2 This is position called as regexp.
1,2=Harry potter, game of thrones
>grep(“[Harry potter]”,c(“Harry Potter,”Game of Thrones”, “Lord of
Rings”))
> 1 2 3
Rupak Roy
Regular expressions: Grep(pattern, x)
>grep(“[^Harry potter]”,c(“Harry Potter”,”Game of Thrones”, “Lord of
Rings”))
#^ symbol: it matches any character not in the list,
#basically NOT CONDITION
> 2 3
>grep(“[letters]”,c(“Harry Potter”,”1234”, “Lord of Rings”))
>1 3
>grep(“[:lower:]”,c(“harry potter”,”1234”, “LORD of RINGS”))
>1
>grep(“[:punct:]”,c(“harry;; potter$”,”abc123”, “Lordof”))
> 1 2
Rupak Roy
Regular expressions: Grep(pattern, x)
# a period represents any single character
>grep(“t.e”,c(“Harry Potter”, “Game of Thrones”,”Lord of the rings”))
>[1] 1 3 where t_e in potter, the
>grep(“L..d”,c(“Harry Potter”, “Game of Thrones”,”Lord of the rings”))
>[1] 3
>name<-c(“a.txt”,”pqr”,”p.txt”) #here .acts as a meta character
>grep(“.txt”,name) #.means any character
>grep(“.”,c(“abc”,”de”,”f.e”)
[1] 1 2 3 because . means any character
>grep( “ .“,c(“abc”,”de”,”f.g”))
[1] 3  escape backslash are single  here well backslash itself must
be escaped which is acomplised by own back slash
Regular expressions: Grepl(pattern, x)
Grepl(pattern, x)
- Similar to grep, However it gives output in logical value
>grepl(“[au]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”))
>[1] True True False
>grepl(“[b]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”))
>[1] False False False
Rupak Roy
Regular expressions: regexpr(pattern,x)
regexpr(pattern, x)
- Finds the character position of the first instance of pattern within text.
>regexpr(“#”,c(“Harry#Potter”,”#Game of thrones”,”Lord of the rings”))
>[1] 7 9 13
>regexpr(“(Harry+)”,c(“Harry Potter Harry”, ”Game of thrones”))
>[1] 1 -1 -1 #only the 1st instance Harry
#position of the first instance “.” in the strings
>regexpr(“.”,c(“abc”, ”de”,”f.g”)) >[1] -1 -1 2
#position of the first instance of punctuation
>regexpr(“[:punct:]”,c(“harry;;Potter$”, ”>=<”,”1234”,”lof”)) >[1] 11 -1 -1
Rupak Roy
Regular expressions: gregexpr(pattern,x)
--- Finds the character position for all instances of pattern within text
gregexpr(“#”, c(“#Hary#Potter”, ”GameofThones”,”Lordofthe#Rings”))
>[1] 1, 8
gregexpr(“Harry+”, c(“Harry Potter
Harry ”, ”GameofThones”,”Lordofthe#Rings”))
>[1] 1 14
Rupak Roy
Regular expressions: sub
It helps to replaces a given string with another string but ‘sub’ only
replaces the first match in each string element
>sub( regular expression, replacement text, x)
>sub( “(th+)”, “e”, c(“the mountain the”, “ the hill hill”, “the city without
pollution is the peaceful is the peaceful city”, “the the”) , perl=TRUE)
The vector the will be replaced by e
>sub( “(th+)”, “1e”, c(“the mountain the”, “ the hill hill”, “the city
without pollution is the peaceful is the peaceful city”, “the the”) ,
perl=TRUE)
>[1] “thee mountain” “thee hill” “Thee city without population is the
peaceful city” “Thee the” #only the first instance
Rupak Roy
Regular expressions: gsub
It also replaces a given string with another string however unlike in sub
here all the matches in each string element is replaced.
>gsub( “(Th+)”,”e”,c(“The mountain The”, “ The hill hill”, “The city without
pollution is the peaceful city”, “the the”),perl=TRUE)
>[1] “ee mountain ee” “ee hill hill” “ee city without pollution is ee
peaceful city”, “ee the”
Rupak Roy
Regular expressions: EXAMPLE
>reviews<-read.csv(“…”, stringasFactors = FALSE)
>reviews<-data.frame(reviews=reviews$review_title)
>names(reviews)
>dim(reviews)
#checking which expression have “star”
#trying to understand the rating
>p<-reviews[grep(“ *star”,reviews$reviews),”reviews”]
#replace “ start” with the word “Ratings”
>sub(“(star)”,”rating”, P, perl=TRUE)
#position of star
>regeexpr(“star”,P)

Text Mining using Regular Expressions

  • 1.
  • 2.
    Regular expressions A regularexpression is an effective tool for find and replace the text. Regular Expression in R – grep, grepl, grepexpr, sub, gsub - grep, grepl, regexpr and gregexpr search for matches to argument pattern within each element of a character vector - Sub performs replacement of the first and gsub for all matches. Rupak Roy
  • 3.
    Regular expressions: Grep(pattern,x) Grep(pattern, x) - Searches for a specified substring pattern in a vector X of strings - It gives the position of the pattern. >grep(“[au]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”)) Character class [au] is a list of character enclosed between [and] which matches an character in that list. Now it will look for a or u >[1] 1 2 This is position called as regexp. 1,2=Harry potter, game of thrones >grep(“[Harry potter]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”)) > 1 2 3 Rupak Roy
  • 4.
    Regular expressions: Grep(pattern,x) >grep(“[^Harry potter]”,c(“Harry Potter”,”Game of Thrones”, “Lord of Rings”)) #^ symbol: it matches any character not in the list, #basically NOT CONDITION > 2 3 >grep(“[letters]”,c(“Harry Potter”,”1234”, “Lord of Rings”)) >1 3 >grep(“[:lower:]”,c(“harry potter”,”1234”, “LORD of RINGS”)) >1 >grep(“[:punct:]”,c(“harry;; potter$”,”abc123”, “Lordof”)) > 1 2 Rupak Roy
  • 5.
    Regular expressions: Grep(pattern,x) # a period represents any single character >grep(“t.e”,c(“Harry Potter”, “Game of Thrones”,”Lord of the rings”)) >[1] 1 3 where t_e in potter, the >grep(“L..d”,c(“Harry Potter”, “Game of Thrones”,”Lord of the rings”)) >[1] 3 >name<-c(“a.txt”,”pqr”,”p.txt”) #here .acts as a meta character >grep(“.txt”,name) #.means any character >grep(“.”,c(“abc”,”de”,”f.e”) [1] 1 2 3 because . means any character >grep( “ .“,c(“abc”,”de”,”f.g”)) [1] 3 escape backslash are single here well backslash itself must be escaped which is acomplised by own back slash
  • 6.
    Regular expressions: Grepl(pattern,x) Grepl(pattern, x) - Similar to grep, However it gives output in logical value >grepl(“[au]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”)) >[1] True True False >grepl(“[b]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”)) >[1] False False False Rupak Roy
  • 7.
    Regular expressions: regexpr(pattern,x) regexpr(pattern,x) - Finds the character position of the first instance of pattern within text. >regexpr(“#”,c(“Harry#Potter”,”#Game of thrones”,”Lord of the rings”)) >[1] 7 9 13 >regexpr(“(Harry+)”,c(“Harry Potter Harry”, ”Game of thrones”)) >[1] 1 -1 -1 #only the 1st instance Harry #position of the first instance “.” in the strings >regexpr(“.”,c(“abc”, ”de”,”f.g”)) >[1] -1 -1 2 #position of the first instance of punctuation >regexpr(“[:punct:]”,c(“harry;;Potter$”, ”>=<”,”1234”,”lof”)) >[1] 11 -1 -1 Rupak Roy
  • 8.
    Regular expressions: gregexpr(pattern,x) ---Finds the character position for all instances of pattern within text gregexpr(“#”, c(“#Hary#Potter”, ”GameofThones”,”Lordofthe#Rings”)) >[1] 1, 8 gregexpr(“Harry+”, c(“Harry Potter Harry ”, ”GameofThones”,”Lordofthe#Rings”)) >[1] 1 14 Rupak Roy
  • 9.
    Regular expressions: sub Ithelps to replaces a given string with another string but ‘sub’ only replaces the first match in each string element >sub( regular expression, replacement text, x) >sub( “(th+)”, “e”, c(“the mountain the”, “ the hill hill”, “the city without pollution is the peaceful is the peaceful city”, “the the”) , perl=TRUE) The vector the will be replaced by e >sub( “(th+)”, “1e”, c(“the mountain the”, “ the hill hill”, “the city without pollution is the peaceful is the peaceful city”, “the the”) , perl=TRUE) >[1] “thee mountain” “thee hill” “Thee city without population is the peaceful city” “Thee the” #only the first instance Rupak Roy
  • 10.
    Regular expressions: gsub Italso replaces a given string with another string however unlike in sub here all the matches in each string element is replaced. >gsub( “(Th+)”,”e”,c(“The mountain The”, “ The hill hill”, “The city without pollution is the peaceful city”, “the the”),perl=TRUE) >[1] “ee mountain ee” “ee hill hill” “ee city without pollution is ee peaceful city”, “ee the” Rupak Roy
  • 11.
    Regular expressions: EXAMPLE >reviews<-read.csv(“…”,stringasFactors = FALSE) >reviews<-data.frame(reviews=reviews$review_title) >names(reviews) >dim(reviews) #checking which expression have “star” #trying to understand the rating >p<-reviews[grep(“ *star”,reviews$reviews),”reviews”] #replace “ start” with the word “Ratings” >sub(“(star)”,”rating”, P, perl=TRUE) #position of star >regeexpr(“star”,P)