SlideShare a Scribd company logo
Introducing NLP with R 10/6/14, 19:37 
Introducing NLP with R 
Charlie Redmon | SupStat Analytics 
Copyright Supstat Inc. All Rights Reserved Page 1 of 26
Introducing NLP with R 10/6/14, 19:37 
Introduction to NLP 
Foundational Frameworks 
Working with text in R 
Regular Expressions 
As pattern matching device 
Theoretical connection with finite state automaton 
Application in morphological analysis 
N-gram models 
Recognizing language 
Generating language 
Further reading 
2/26 Page 2 of 26
Introducing NLP with R 10/6/14, 19:37 
Natural Language Processing 
Briefly: Building models to facilitate human-computer interaction through language 
We say natural language here to distinguish languages like English, Hungarian, and Bengali 
from computer languages and other invented communication systems (e.g. Morse code) 
Major sub-disciplines: 
Speech Recognition/Synthesis 
Computational Morphology (word structure) 
Lexical Semantics (word meaning) 
Computational Syntax (phrase/sentence structure) 
Compositional Semantics (phrase/sentence meaning) 
Information Retrieval 
3/26 Page 3 of 26
Introducing NLP with R 10/6/14, 19:37 
R has powerful text processing capabilities 
Many useful NLP-related packages 
Many of the more sophisticated procedures in NLP generalize to statistical models, which is 
where R really excels 
4/26 Page 4 of 26
Introducing NLP with R 10/6/14, 19:37 
- Turing Machine: Finite State Automaton, Finite State Transducer 
- Regular Expressions 
- Regular Languages and their relation to natural languages 
N-gram models 
Information Theory 
Noisy Channel, Entropy models 
5/26 Page 5 of 26
Introducing NLP with R 10/6/14, 19:37 
1. Import and manipulate text in R 
2. Create data structures facilitating NLP operations 
3. Model implementation: 
Morphological parsing 
N-gram parsing 
N-gram language generation 
6/26 Page 6 of 26
Introducing NLP with R 10/6/14, 19:37 
· Primary importing functions: scan(), readLines() 
monty_text = scan('data/grail.txt', what="character", sep="", quote="") 
[1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" 
malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', 
what="character", sep="", quote="") 
[1] "#Date:" "01-10-2014" 
[3] "#----------------------------------------" "അേമരിkയിെലtിയ" 
[5] "+പധാനമ+nി" "നേര+nേമാദി" 
· Why might this data structure be a problem for many natural language structures? 
7/26 Page 7 of 26
Introducing NLP with R 10/6/14, 19:37 
monty_text = paste(monty_text, collapse=" ") 
malayalam_text = paste(malayalam_text, collapse=" ") 
length(monty_text); length(malayalam_text) 
[1] 1 
[1] 1 
substr(monty_text, 1, 70) 
[1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" 
substr(malayalam_text, 304, 400) 
[1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ" 
8/26 Page 8 of 26
Introducing NLP with R 10/6/14, 19:37 
[] Disjunction (set) / [Gg]oogle / = Google, google 
? 0 or 1 characters / savou?r / = savor, savour 
* 0 or more characters / hey!* / = hey, hey!, hey!!, ... 
 Escape character / hey? / = hey? 
+ 1 or more characters / a+h / = ah, aah, aaah, ... 
{n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... 
. Wildcard (any character) / #.* / = #rstats, #uofl, ... 
() Conjunction / (ha)+ / = ha, haha, hahaha, ... 
[^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 
9/26 Page 9 of 26
Introducing NLP with R 10/6/14, 19:37 
[x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... 
w Word character (alphanumeric) / w's / = that's, Jerry's, ... 
W Non-word character 
d Digit character (0-9) / d{3} / = 137, 254, ... 
D Non-digit character 
s Whitespace / w+s+w+ / = I am, I am, ... 
S Non-whitespace 
b Word boundary / btheb / = the, not then 
B Non-word boundary 
^ Beginning of line / [a-z] / = non-capitalized beg. 
$ End of line / #.*$ / = hashtags at end of line 
10/26 Page 10 of 26
Introducing NLP with R 10/6/14, 19:37 
The advantage of having all the text in a single element is we can now split the text into different-sized 
segments for different kinds of natural language tasks. 
#sentence level 
pattern = "(?<=[.?!])s+" 
monty_sentences = strsplit(monty_text, split=pattern, perl=T) 
monty_sentences = unlist(monty_sentences) 
[1] "King of the Britons, defeator of the Saxons, sovereign of all England!" 
[2] "SOLDIER #1: Pull the other one!" 
[3] "ARTHUR: I am, ..." 
[4] "and this is my trusty servant Patsy." 
11/26 Page 11 of 26
Introducing NLP with R 10/6/14, 19:37 
Of course, depending on the language you're working with you might have different definitions of 
sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period. 
hindi_text = scan('data/hindustan_full.txt', what="character", sep="") 
hindi_text = paste(hindi_text, collapse=" ") 
pattern = "(?<=[।?!])s+" 
hindi_sentences = strsplit(hindi_text, split=pattern, perl=T) 
hindi_sentences = unlist(hindi_sentences) 
[1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।" 
[2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।" 
[3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।" 
[4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।" 
12/26 Page 12 of 26
Introducing NLP with R 10/6/14, 19:37 
We can also split the original text according to word boundaries. 
#word level 
pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*" 
monty_words = strsplit(monty_text, split=pattern, perl=T) 
monty_words = unlist(monty_words) 
[1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop" 
[9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR" 
[17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon" 
[25] "from" "the" 
13/26 Page 13 of 26
Introducing NLP with R 10/6/14, 19:37 
For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with. 
Other researchers may have already built a text-formatted lexicon of the language you're using, but 
nevertheless it's useful to see how we might build one. 
#convert all words to lowercase 
monty_words = tolower(monty_words) 
[1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa" 
#convert vector of tokens to set of unique words 
monty_lexicon = unique(monty_words) 
[1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there" 
14/26 Page 14 of 26
Introducing NLP with R 10/6/14, 19:37 
[1] 11213 
[1] 1889 
15/26 Page 15 of 26
Introducing NLP with R 10/6/14, 19:37 
Now that we have our lexicon we can start to model the internal structure of the words in our corpus. 
Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky 
and Martin (2000) 
16/26 Page 16 of 26
Introducing NLP with R 10/6/14, 19:37 
But since it has already been proven that all regular expressions can be modeled as FSAs, and vice 
versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all 
the agentive nouns (e.g. builder, worker, shopper, etc.). 
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) 
[1] "soldier" "uther" "other" "master" "together" "winter" 
[7] "plover" "warmer" "matter" "order" "creeper" "under" 
[13] "cart-master" "customer" "better" "over" "bother" "ever" 
[19] "officer" "her" "water" "power" "mer" "villager" 
[25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper" 
· This isn't exactly what we want. How can we improve our results? 
17/26 Page 17 of 26
Introducing NLP with R 10/6/14, 19:37 
Take advantage of the lexicon. 
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) 
new_monty_agents = character(0) 
for (i in 1:length(monty_agents)) { 
word = monty_agents[i] 
stem_end = nchar(word) - 2 
stem = substr(word, 1, stem_end) 
if (is.element(stem, monty_lexicon)) { 
new_monty_agents[i] = word 
new_monty_agents = new_monty_agents[!] 
[1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper" 
18/26 Page 18 of 26
Introducing NLP with R 10/6/14, 19:37 
19/26 Page 19 of 26
Introducing NLP with R 10/6/14, 19:37 
Based on Markov model 
At their heart, n-grams answer the question: "What is the likelihood of one word (or character, 
phrase, sentence...) following another word or sequence of words?" 
The kernel equation: 
P(wn|wn−1 ) ≈ P( | ) 
1 wn wn−1 
N N 
where is the in N-gram (i.e. the number of words used to build the grammar) 
For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're 
moving along the string asking: P(Knights|are the), P(who|the Knights), ... 
20/26 Page 20 of 26
Introducing NLP with R 10/6/14, 19:37 
monty_bigram = ngram(monty_text, n=2) 
[1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged," 
[5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until" 
[9] "trade. I" "down, fell" 
monty_trigram = ngram(monty_text, n=3) 
[1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no" 
[5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!" 
[9] "Shut up! Um," "to a successful" 
21/26 Page 21 of 26
Introducing NLP with R 10/6/14, 19:37 
print(monty_bigram, full=TRUE) 
cannot tell, 
suffice {1} | 
away. Just 
ignore {1} | 
not 'is'. 
HEAD {1} | You {2} | Not {1} | 
bowels unplugged, 
And {1} | 
well, Arthur, 
for {1} | 
[twang] Wayy! 
[twang] {1} | 
22/26 Page 22 of 26
Introducing NLP with R 10/6/14, 19:37 
print(monty_trigram, full=TRUE) 
a good spanking! 
GIRLS: {1} | 
Oooh! GALAHAD: My 
God! {1} | 
is the capital 
of {1} | 
to you no 
more, {1} | 
Who's that then? 
you get back. 
GUARD {1} | 
23/26 Page 23 of 26
Introducing NLP with R 10/6/14, 19:37 
babble(monty_bigram, 8) 
[1] "must go too. OFFICER #1: Back. Right away. " 
babble(monty_bigram, 8) 
[1] "I'll do you up a treat mate! GALAHAD: " 
babble(monty_bigram, 8) 
[1] "from just stop him entering the room. GUARD " 
24/26 Page 24 of 26
Introducing NLP with R 10/6/14, 19:37 
babble(monty_trigram, 8) 
[1] "were still no nearer the Grail. Meanwhile, King " 
babble(monty_trigram, 8) 
[1] "the Britons. BEDEVERE: My liege! I would be " 
babble(monty_trigram, 8) 
[1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do " 
25/26 Page 25 of 26
Introducing NLP with R 10/6/14, 19:37 
Jurafsky and Martin (2008), Speech and Language Processing 
Manning (2008), An Introduction to Information Retrieval 
Gries (2009), Quantitative Corpus Linguistics with R 
26/26 Page 26 of 26

More Related Content

What's hot

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
TextMining with R
TextMining with RTextMining with R
TextMining with R
Aleksei Beloshytski
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
Pierre de Lacaze
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)
Pedro Rodrigues
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
Utkarsh Sengar
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
Prof. Wim Van Criekinge
Data mining techniques
Data mining techniquesData mining techniques
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
Dan Sullivan, Ph.D.
1. python programming
1. python programming1. python programming
1. python programming
Python basic
Python basicPython basic
Python basic
Saifuddin Kaijar
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
Sean Cribbs
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Vrije Universiteit Amsterdam
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
Maulik Borsaniya
Colin Su

What's hot (20)

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
TextMining with R
TextMining with RTextMining with R
TextMining with R
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
1. python programming
1. python programming1. python programming
1. python programming
Python basic
Python basicPython basic
Python basic
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...

Viewers also liked

Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Vivian S. Zhang
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Vivian S. Zhang
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Vivian S. Zhang
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Vivian S. Zhang
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Vivian S. Zhang
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang

Viewers also liked (14)

Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions

Similar to Introducing natural language processing(NLP) with r

Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Hamidreza Soleimani
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
Computation Chapter 4
Computation Chapter 4Computation Chapter 4
Computation Chapter 4
Inocentshuja Ahmad
Declarative Language Definition
Declarative Language DefinitionDeclarative Language Definition
Declarative Language Definition
Eelco Visser
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fu
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
name name2 n
name name2 nname name2 n
name name2 n
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
name name2 n2
name name2 n2name name2 n2
name name2 n2
test ppt
test ppttest ppt
test ppt
name name2 n
name name2 nname name2 n
name name2 n
name name2 n
name name2 nname name2 n
name name2 n

Similar to Introducing natural language processing(NLP) with r (20)

Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Computation Chapter 4
Computation Chapter 4Computation Chapter 4
Computation Chapter 4
Declarative Language Definition
Declarative Language DefinitionDeclarative Language Definition
Declarative Language Definition
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fu
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
name name2 n
name name2 nname name2 n
name name2 n
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
name name2 n2
name name2 n2name name2 n2
name name2 n2
test ppt
test ppttest ppt
test ppt
name name2 n
name name2 nname name2 n
name name2 n
name name2 n
name name2 nname name2 n
name name2 n

More from Vivian S. Zhang

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Vivian S. Zhang
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Vivian S. Zhang
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
Vivian S. Zhang
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
Vivian S. Zhang
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Vivian S. Zhang
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Vivian S. Zhang
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Vivian S. Zhang
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Vivian S. Zhang
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Vivian S. Zhang
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Vivian S. Zhang
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
Vivian S. Zhang
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Vivian S. Zhang

More from Vivian S. Zhang (16)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...

Introducing natural language processing(NLP) with r

  • 1. Introducing NLP with R 10/6/14, 19:37 Introducing NLP with R Charlie Redmon | SupStat Analytics Copyright Supstat Inc. All Rights Reserved Page 1 of 26
  • 2. Introducing NLP with R 10/6/14, 19:37 Outline Introduction to NLP Foundational Frameworks Working with text in R Regular Expressions As pattern matching device Theoretical connection with finite state automaton Application in morphological analysis - - - N-gram models Recognizing language Generating language - - Further reading · · · · · · 2/26 Page 2 of 26
  • 3. Introducing NLP with R 10/6/14, 19:37 What+is+NLP? Natural Language Processing Briefly: Building models to facilitate human-computer interaction through language We say natural language here to distinguish languages like English, Hungarian, and Bengali from computer languages and other invented communication systems (e.g. Morse code) - - Major sub-disciplines: · · Speech Recognition/Synthesis Computational Morphology (word structure) Lexical Semantics (word meaning) Computational Syntax (phrase/sentence structure) Compositional Semantics (phrase/sentence meaning) Information Retrieval - - - - - - 3/26 Page 3 of 26
  • 4. Introducing NLP with R 10/6/14, 19:37 Why+R? R has powerful text processing capabilities Many useful NLP-related packages Many of the more sophisticated procedures in NLP generalize to statistical models, which is where R really excels · · · 4/26 Page 4 of 26
  • 5. Introducing NLP with R 10/6/14, 19:37 Founda6onal+NLP+Frameworks Turing - Turing Machine: Finite State Automaton, Finite State Transducer Kleene - Regular Expressions Chomsky - Regular Languages and their relation to natural languages Markov: N-gram models HMMs - - Shannon · · · · · Information Theory Noisy Channel, Entropy models - - 5/26 Page 5 of 26
  • 6. Introducing NLP with R 10/6/14, 19:37 The+Workflow 1. Import and manipulate text in R 2. Create data structures facilitating NLP operations 3. Model implementation: Morphological parsing N-gram parsing N-gram language generation ... · · · · 6/26 Page 6 of 26
  • 7. Introducing NLP with R 10/6/14, 19:37 Impor6ng+text+into+R · Primary importing functions: scan(), readLines() monty_text = scan('data/grail.txt', what="character", sep="", quote="") monty_text[1:6] [1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', what="character", sep="", quote="") malayalam_text[15:20] [1] "#Date:" "01-10-2014" [3] "#----------------------------------------" "അേമരിkയിെലtിയ" [5] "+പധാനമ+nി" "നേര+nേമാദി" · Why might this data structure be a problem for many natural language structures? 7/26 Page 7 of 26
  • 8. Introducing NLP with R 10/6/14, 19:37 Condensing+to+single+text+stream monty_text = paste(monty_text, collapse=" ") malayalam_text = paste(malayalam_text, collapse=" ") length(monty_text); length(malayalam_text) [1] 1 [1] 1 substr(monty_text, 1, 70) [1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" substr(malayalam_text, 304, 400) [1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ" 8/26 Page 8 of 26
  • 9. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [] Disjunction (set) / [Gg]oogle / = Google, google ? 0 or 1 characters / savou?r / = savor, savour * 0 or more characters / hey!* / = hey, hey!, hey!!, ... Escape character / hey? / = hey? + 1 or more characters / a+h / = ah, aah, aaah, ... {n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... . Wildcard (any character) / #.* / = #rstats, #uofl, ... () Conjunction / (ha)+ / = ha, haha, hahaha, ... [^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 9/26 Page 9 of 26
  • 10. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... w Word character (alphanumeric) / w's / = that's, Jerry's, ... W Non-word character d Digit character (0-9) / d{3} / = 137, 254, ... D Non-digit character s Whitespace / w+s+w+ / = I am, I am, ... S Non-whitespace b Word boundary / btheb / = the, not then B Non-word boundary ^ Beginning of line / [a-z] / = non-capitalized beg. $ End of line / #.*$ / = hashtags at end of line 10/26 Page 10 of 26
  • 11. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on The advantage of having all the text in a single element is we can now split the text into different-sized segments for different kinds of natural language tasks. #sentence level pattern = "(?<=[.?!])s+" monty_sentences = strsplit(monty_text, split=pattern, perl=T) monty_sentences = unlist(monty_sentences) monty_sentences[5:8] [1] "King of the Britons, defeator of the Saxons, sovereign of all England!" [2] "SOLDIER #1: Pull the other one!" [3] "ARTHUR: I am, ..." [4] "and this is my trusty servant Patsy." 11/26 Page 11 of 26
  • 12. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on Of course, depending on the language you're working with you might have different definitions of sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period. hindi_text = scan('data/hindustan_full.txt', what="character", sep="") hindi_text = paste(hindi_text, collapse=" ") pattern = "(?<=[।?!])s+" hindi_sentences = strsplit(hindi_text, split=pattern, perl=T) hindi_sentences = unlist(hindi_sentences) hindi_sentences[5:8] [1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।" [2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।" [3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।" [4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।" 12/26 Page 12 of 26
  • 13. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on We can also split the original text according to word boundaries. #word level pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*" monty_words = strsplit(monty_text, split=pattern, perl=T) monty_words = unlist(monty_words) monty_words[5:30] [1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop" [9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR" [17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon" [25] "from" "the" 13/26 Page 13 of 26
  • 14. Introducing NLP with R 10/6/14, 19:37 Building+a+Lexicon For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with. Other researchers may have already built a text-formatted lexicon of the language you're using, but nevertheless it's useful to see how we might build one. #convert all words to lowercase monty_words = tolower(monty_words) monty_words[1:9] [1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa" #convert vector of tokens to set of unique words monty_lexicon = unique(monty_words) monty_lexicon[1:8] [1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there" 14/26 Page 14 of 26
  • 15. Introducing NLP with R 10/6/14, 19:37 Building+a+Lexicon length(monty_words) [1] 11213 length(monty_lexicon) [1] 1889 15/26 Page 15 of 26
  • 16. Introducing NLP with R 10/6/14, 19:37 Morphological+Analysis Now that we have our lexicon we can start to model the internal structure of the words in our corpus. Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky and Martin (2000) 16/26 Page 16 of 26
  • 17. Introducing NLP with R 10/6/14, 19:37 Morphological+Analysis But since it has already been proven that all regular expressions can be modeled as FSAs, and vice versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all the agentive nouns (e.g. builder, worker, shopper, etc.). monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) monty_agents[1:30] [1] "soldier" "uther" "other" "master" "together" "winter" [7] "plover" "warmer" "matter" "order" "creeper" "under" [13] "cart-master" "customer" "better" "over" "bother" "ever" [19] "officer" "her" "water" "power" "mer" "villager" [25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper" · This isn't exactly what we want. How can we improve our results? 17/26 Page 17 of 26
  • 18. Introducing NLP with R 10/6/14, 19:37 Morphological+Analysis Take advantage of the lexicon. monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) new_monty_agents = character(0) for (i in 1:length(monty_agents)) { word = monty_agents[i] stem_end = nchar(word) - 2 stem = substr(word, 1, stem_end) if (is.element(stem, monty_lexicon)) { new_monty_agents[i] = word } } new_monty_agents = new_monty_agents[!] new_monty_agents [1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper" 18/26 Page 18 of 26
  • 19. Introducing NLP with R 10/6/14, 19:37 Malayalam+FSA 19/26 Page 19 of 26
  • 20. Introducing NLP with R 10/6/14, 19:37 NHgram+Models Based on Markov model At their heart, n-grams answer the question: "What is the likelihood of one word (or character, phrase, sentence...) following another word or sequence of words?" The kernel equation: P(wn|wn−1 ) ≈ P( | ) 1 wn wn−1 n−N+1 N N where is the in N-gram (i.e. the number of words used to build the grammar) For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're moving along the string asking: P(Knights|are the), P(who|the Knights), ... · · · · 20/26 Page 20 of 26
  • 21. Introducing NLP with R 10/6/14, 19:37 NHgram+Models library(ngram) monty_bigram = ngram(monty_text, n=2) get.ngrams(monty_bigram)[1:10] [1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged," [5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until" [9] "trade. I" "down, fell" monty_trigram = ngram(monty_text, n=3) get.ngrams(monty_trigram)[1:10] [1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no" [5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!" [9] "Shut up! Um," "to a successful" 21/26 Page 21 of 26
  • 22. Introducing NLP with R 10/6/14, 19:37 NHgram+Models print(monty_bigram, full=TRUE) cannot tell, suffice {1} | away. Just ignore {1} | not 'is'. HEAD {1} | You {2} | Not {1} | bowels unplugged, And {1} | well, Arthur, for {1} | [twang] Wayy! [twang] {1} | 22/26 Page 22 of 26
  • 23. Introducing NLP with R 10/6/14, 19:37 NHgram+Models print(monty_trigram, full=TRUE) a good spanking! GIRLS: {1} | Oooh! GALAHAD: My God! {1} | is the capital of {1} | to you no more, {1} | Who's that then? CART-MASTER: {1} | you get back. GUARD {1} | 23/26 Page 23 of 26
  • 24. Introducing NLP with R 10/6/14, 19:37 NHgram+Models babble(monty_bigram, 8) [1] "must go too. OFFICER #1: Back. Right away. " babble(monty_bigram, 8) [1] "I'll do you up a treat mate! GALAHAD: " babble(monty_bigram, 8) [1] "from just stop him entering the room. GUARD " 24/26 Page 24 of 26
  • 25. Introducing NLP with R 10/6/14, 19:37 NHgram+Models babble(monty_trigram, 8) [1] "were still no nearer the Grail. Meanwhile, King " babble(monty_trigram, 8) [1] "the Britons. BEDEVERE: My liege! I would be " babble(monty_trigram, 8) [1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do " 25/26 Page 25 of 26
  • 26. Introducing NLP with R 10/6/14, 19:37 Further+Reading Jurafsky and Martin (2008), Speech and Language Processing Manning (2008), An Introduction to Information Retrieval Gries (2009), Quantitative Corpus Linguistics with R · · · 26/26 Page 26 of 26