Text Mining using Regular Expressions

Introduction to
Pattern search and
Replace

Regular expressions
A regular expression is an effective tool for find and replace the text.
Regular Expression in R –
grep, grepl, grepexpr, sub, gsub
- grep, grepl, regexpr and gregexpr search for matches to argument
pattern within each element of a character vector
- Sub performs replacement of the first and gsub for all matches.
Rupak Roy

Regular expressions: Grep(pattern, x)
Grep(pattern, x)
- Searches for a specified substring pattern in a vector X of strings
- It gives the position of the pattern.
>grep(“[au]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”))
Character class [au] is a list of character enclosed between [and]
which matches an character in that list. Now it will look for a or u
>[1] 1 2 This is position called as regexp.
1,2=Harry potter, game of thrones
>grep(“[Harry potter]”,c(“Harry Potter,”Game of Thrones”, “Lord of
Rings”))
> 1 2 3
Rupak Roy

>grep(“[^Harry potter]”,c(“Harry Potter”,”Game of Thrones”, “Lord of
Rings”))
#^ symbol: it matches any character not in the list,
#basically NOT CONDITION
> 2 3
>grep(“[letters]”,c(“Harry Potter”,”1234”, “Lord of Rings”))
>1 3
>grep(“[:lower:]”,c(“harry potter”,”1234”, “LORD of RINGS”))
>1
>grep(“[:punct:]”,c(“harry;; potter$”,”abc123”, “Lordof”))
> 1 2
Rupak Roy

# a period represents any single character
>grep(“t.e”,c(“Harry Potter”, “Game of Thrones”,”Lord of the rings”))
>[1] 1 3 where t_e in potter, the
>grep(“L..d”,c(“Harry Potter”, “Game of Thrones”,”Lord of the rings”))
>[1] 3
>name<-c(“a.txt”,”pqr”,”p.txt”) #here .acts as a meta character
>grep(“.txt”,name) #.means any character
>grep(“.”,c(“abc”,”de”,”f.e”)
[1] 1 2 3 because . means any character
>grep( “ .“,c(“abc”,”de”,”f.g”))
[1] 3 escape backslash are single here well backslash itself must
be escaped which is acomplised by own back slash

Regular expressions: Grepl(pattern, x)
Grepl(pattern, x)
- Similar to grep, However it gives output in logical value
>grepl(“[au]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”))
>[1] True True False
>grepl(“[b]”,c(“Harry Potter,”Game of Thrones”, “Lord of Rings”))
>[1] False False False
Rupak Roy

Regular expressions: regexpr(pattern,x)
regexpr(pattern, x)
- Finds the character position of the first instance of pattern within text.
>regexpr(“#”,c(“Harry#Potter”,”#Game of thrones”,”Lord of the rings”))
>[1] 7 9 13
>regexpr(“(Harry+)”,c(“Harry Potter Harry”, ”Game of thrones”))
>[1] 1 -1 -1 #only the 1st instance Harry
#position of the first instance “.” in the strings
>regexpr(“.”,c(“abc”, ”de”,”f.g”)) >[1] -1 -1 2
#position of the first instance of punctuation
>regexpr(“[:punct:]”,c(“harry;;Potter$”, ”>=<”,”1234”,”lof”)) >[1] 11 -1 -1
Rupak Roy

Regular expressions: gregexpr(pattern,x)
--- Finds the character position for all instances of pattern within text
gregexpr(“#”, c(“#Hary#Potter”, ”GameofThones”,”Lordofthe#Rings”))
>[1] 1, 8
gregexpr(“Harry+”, c(“Harry Potter
Harry ”, ”GameofThones”,”Lordofthe#Rings”))
>[1] 1 14
Rupak Roy

Regular expressions: sub
It helps to replaces a given string with another string but ‘sub’ only
replaces the first match in each string element
>sub( regular expression, replacement text, x)
>sub( “(th+)”, “e”, c(“the mountain the”, “ the hill hill”, “the city without
pollution is the peaceful is the peaceful city”, “the the”) , perl=TRUE)
The vector the will be replaced by e
>sub( “(th+)”, “1e”, c(“the mountain the”, “ the hill hill”, “the city
without pollution is the peaceful is the peaceful city”, “the the”) ,
perl=TRUE)
>[1] “thee mountain” “thee hill” “Thee city without population is the
peaceful city” “Thee the” #only the first instance
Rupak Roy

Regular expressions: gsub
It also replaces a given string with another string however unlike in sub
here all the matches in each string element is replaced.
>gsub( “(Th+)”,”e”,c(“The mountain The”, “ The hill hill”, “The city without
pollution is the peaceful city”, “the the”),perl=TRUE)
>[1] “ee mountain ee” “ee hill hill” “ee city without pollution is ee
peaceful city”, “ee the”
Rupak Roy

Regular expressions: EXAMPLE
>reviews<-read.csv(“…”, stringasFactors = FALSE)
>reviews<-data.frame(reviews=reviews$review_title)
>names(reviews)
>dim(reviews)
#checking which expression have “star”
#trying to understand the rating
>p<-reviews[grep(“ *star”,reviews$reviews),”reviews”]
#replace “ start” with the word “Ratings”
>sub(“(star)”,”rating”, P, perl=TRUE)
#position of star
>regeexpr(“star”,P)

Text Mining using Regular Expressions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Mining using Regular Expressions

Similar to Text Mining using Regular Expressions (20)

More from Rupak Roy

More from Rupak Roy (20)

Recently uploaded

Recently uploaded (20)

Text Mining using Regular Expressions