SlideShare a Scribd company logo
Introducing NLP with R 10/6/14, 19:37 
Introducing NLP with R 
Charlie Redmon | SupStat Analytics 
Copyright Supstat Inc. All Rights Reserved 
http://docs.supstat.com/NLPwithR/#1 Page 1 of 26
Introducing NLP with R 10/6/14, 19:37 
Outline 
Introduction to NLP 
Foundational Frameworks 
Working with text in R 
Regular Expressions 
As pattern matching device 
Theoretical connection with finite state automaton 
Application in morphological analysis 
- 
- 
- 
N-gram models 
Recognizing language 
Generating language 
- 
- 
Further reading 
· 
· 
· 
· 
· 
· 
2/26 
http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
Introducing NLP with R 10/6/14, 19:37 
What+is+NLP? 
Natural Language Processing 
Briefly: Building models to facilitate human-computer interaction through language 
We say natural language here to distinguish languages like English, Hungarian, and Bengali 
from computer languages and other invented communication systems (e.g. Morse code) 
- 
- 
Major sub-disciplines: 
· 
· 
Speech Recognition/Synthesis 
Computational Morphology (word structure) 
Lexical Semantics (word meaning) 
Computational Syntax (phrase/sentence structure) 
Compositional Semantics (phrase/sentence meaning) 
Information Retrieval 
- 
- 
- 
- 
- 
- 
3/26 
http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
Introducing NLP with R 10/6/14, 19:37 
Why+R? 
R has powerful text processing capabilities 
Many useful NLP-related packages 
Many of the more sophisticated procedures in NLP generalize to statistical models, which is 
where R really excels 
· 
· 
· 
4/26 
http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
Introducing NLP with R 10/6/14, 19:37 
Founda6onal+NLP+Frameworks 
Turing 
- Turing Machine: Finite State Automaton, Finite State Transducer 
Kleene 
- Regular Expressions 
Chomsky 
- Regular Languages and their relation to natural languages 
Markov: 
N-gram models 
HMMs 
- 
- 
Shannon 
· 
· 
· 
· 
· 
Information Theory 
Noisy Channel, Entropy models 
- 
- 
5/26 
http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
Introducing NLP with R 10/6/14, 19:37 
The+Workflow 
1. Import and manipulate text in R 
2. Create data structures facilitating NLP operations 
3. Model implementation: 
Morphological parsing 
N-gram parsing 
N-gram language generation 
... 
· 
· 
· 
· 
6/26 
http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
Introducing NLP with R 10/6/14, 19:37 
Impor6ng+text+into+R 
· Primary importing functions: scan(), readLines() 
monty_text = scan('data/grail.txt', what="character", sep="", quote="") 
monty_text[1:6] 
[1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" 
malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', 
what="character", sep="", quote="") 
malayalam_text[15:20] 
[1] "#Date:" "01-10-2014" 
[3] "#----------------------------------------" "അേമരിkയിെലtിയ" 
[5] "+പധാനമ+nി" "നേര+nേമാദി" 
· Why might this data structure be a problem for many natural language structures? 
7/26 
http://docs.supstat.com/NLPwithR/#1 Page 7 of 26
Introducing NLP with R 10/6/14, 19:37 
Condensing+to+single+text+stream 
monty_text = paste(monty_text, collapse=" ") 
malayalam_text = paste(malayalam_text, collapse=" ") 
length(monty_text); length(malayalam_text) 
[1] 1 
[1] 1 
substr(monty_text, 1, 70) 
[1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" 
substr(malayalam_text, 304, 400) 
[1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ" 
8/26 
http://docs.supstat.com/NLPwithR/#1 Page 8 of 26
Introducing NLP with R 10/6/14, 19:37 
Regular+Expressions 
SYMBOL MEANING EXAMPLE 
[] Disjunction (set) / [Gg]oogle / = Google, google 
? 0 or 1 characters / savou?r / = savor, savour 
* 0 or more characters / hey!* / = hey, hey!, hey!!, ... 
 Escape character / hey? / = hey? 
+ 1 or more characters / a+h / = ah, aah, aaah, ... 
{n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... 
. Wildcard (any character) / #.* / = #rstats, #uofl, ... 
() Conjunction / (ha)+ / = ha, haha, hahaha, ... 
[^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 
9/26 
http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
Introducing NLP with R 10/6/14, 19:37 
Regular+Expressions 
SYMBOL MEANING EXAMPLE 
[x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... 
w Word character (alphanumeric) / w's / = that's, Jerry's, ... 
W Non-word character 
d Digit character (0-9) / d{3} / = 137, 254, ... 
D Non-digit character 
s Whitespace / w+s+w+ / = I am, I am, ... 
S Non-whitespace 
b Word boundary / btheb / = the, not then 
B Non-word boundary 
^ Beginning of line / [a-z] / = non-capitalized beg. 
$ End of line / #.*$ / = hashtags at end of line 
10/26 
http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
Introducing NLP with R 10/6/14, 19:37 
Manual+segmenta6on 
The advantage of having all the text in a single element is we can now split the text into different-sized 
segments for different kinds of natural language tasks. 
#sentence level 
pattern = "(?<=[.?!])s+" 
monty_sentences = strsplit(monty_text, split=pattern, perl=T) 
monty_sentences = unlist(monty_sentences) 
monty_sentences[5:8] 
[1] "King of the Britons, defeator of the Saxons, sovereign of all England!" 
[2] "SOLDIER #1: Pull the other one!" 
[3] "ARTHUR: I am, ..." 
[4] "and this is my trusty servant Patsy." 
11/26 
http://docs.supstat.com/NLPwithR/#1 Page 11 of 26
Introducing NLP with R 10/6/14, 19:37 
Manual+segmenta6on 
Of course, depending on the language you're working with you might have different definitions of 
sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period. 
hindi_text = scan('data/hindustan_full.txt', what="character", sep="") 
hindi_text = paste(hindi_text, collapse=" ") 
pattern = "(?<=[।?!])s+" 
hindi_sentences = strsplit(hindi_text, split=pattern, perl=T) 
hindi_sentences = unlist(hindi_sentences) 
hindi_sentences[5:8] 
[1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।" 
[2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।" 
[3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।" 
[4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।" 
12/26 
http://docs.supstat.com/NLPwithR/#1 Page 12 of 26
Introducing NLP with R 10/6/14, 19:37 
Manual+segmenta6on 
We can also split the original text according to word boundaries. 
#word level 
pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*" 
monty_words = strsplit(monty_text, split=pattern, perl=T) 
monty_words = unlist(monty_words) 
monty_words[5:30] 
[1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop" 
[9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR" 
[17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon" 
[25] "from" "the" 
13/26 
http://docs.supstat.com/NLPwithR/#1 Page 13 of 26
Introducing NLP with R 10/6/14, 19:37 
Building+a+Lexicon 
For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with. 
Other researchers may have already built a text-formatted lexicon of the language you're using, but 
nevertheless it's useful to see how we might build one. 
#convert all words to lowercase 
monty_words = tolower(monty_words) 
monty_words[1:9] 
[1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa" 
#convert vector of tokens to set of unique words 
monty_lexicon = unique(monty_words) 
monty_lexicon[1:8] 
[1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there" 
14/26 
http://docs.supstat.com/NLPwithR/#1 Page 14 of 26
Introducing NLP with R 10/6/14, 19:37 
Building+a+Lexicon 
length(monty_words) 
[1] 11213 
length(monty_lexicon) 
[1] 1889 
15/26 
http://docs.supstat.com/NLPwithR/#1 Page 15 of 26
Introducing NLP with R 10/6/14, 19:37 
Morphological+Analysis 
Now that we have our lexicon we can start to model the internal structure of the words in our corpus. 
Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky 
and Martin (2000) 
16/26 
http://docs.supstat.com/NLPwithR/#1 Page 16 of 26
Introducing NLP with R 10/6/14, 19:37 
Morphological+Analysis 
But since it has already been proven that all regular expressions can be modeled as FSAs, and vice 
versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all 
the agentive nouns (e.g. builder, worker, shopper, etc.). 
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) 
monty_agents[1:30] 
[1] "soldier" "uther" "other" "master" "together" "winter" 
[7] "plover" "warmer" "matter" "order" "creeper" "under" 
[13] "cart-master" "customer" "better" "over" "bother" "ever" 
[19] "officer" "her" "water" "power" "mer" "villager" 
[25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper" 
· This isn't exactly what we want. How can we improve our results? 
17/26 
http://docs.supstat.com/NLPwithR/#1 Page 17 of 26
Introducing NLP with R 10/6/14, 19:37 
Morphological+Analysis 
Take advantage of the lexicon. 
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) 
new_monty_agents = character(0) 
for (i in 1:length(monty_agents)) { 
word = monty_agents[i] 
stem_end = nchar(word) - 2 
stem = substr(word, 1, stem_end) 
if (is.element(stem, monty_lexicon)) { 
new_monty_agents[i] = word 
} 
} 
new_monty_agents = new_monty_agents[!is.na(new_monty_agents)] 
new_monty_agents 
[1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper" 
18/26 
http://docs.supstat.com/NLPwithR/#1 Page 18 of 26
Introducing NLP with R 10/6/14, 19:37 
Malayalam+FSA 
19/26 
http://docs.supstat.com/NLPwithR/#1 Page 19 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
Based on Markov model 
At their heart, n-grams answer the question: "What is the likelihood of one word (or character, 
phrase, sentence...) following another word or sequence of words?" 
The kernel equation: 
P(wn|wn−1 ) ≈ P( | ) 
1 wn wn−1 
n−N+1 
N N 
where is the in N-gram (i.e. the number of words used to build the grammar) 
For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're 
moving along the string asking: P(Knights|are the), P(who|the Knights), ... 
· 
· 
· 
· 
20/26 
http://docs.supstat.com/NLPwithR/#1 Page 20 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
library(ngram) 
monty_bigram = ngram(monty_text, n=2) 
get.ngrams(monty_bigram)[1:10] 
[1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged," 
[5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until" 
[9] "trade. I" "down, fell" 
monty_trigram = ngram(monty_text, n=3) 
get.ngrams(monty_trigram)[1:10] 
[1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no" 
[5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!" 
[9] "Shut up! Um," "to a successful" 
21/26 
http://docs.supstat.com/NLPwithR/#1 Page 21 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
print(monty_bigram, full=TRUE) 
cannot tell, 
suffice {1} | 
away. Just 
ignore {1} | 
not 'is'. 
HEAD {1} | You {2} | Not {1} | 
bowels unplugged, 
And {1} | 
well, Arthur, 
for {1} | 
[twang] Wayy! 
[twang] {1} | 
22/26 
http://docs.supstat.com/NLPwithR/#1 Page 22 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
print(monty_trigram, full=TRUE) 
a good spanking! 
GIRLS: {1} | 
Oooh! GALAHAD: My 
God! {1} | 
is the capital 
of {1} | 
to you no 
more, {1} | 
Who's that then? 
CART-MASTER: {1} | 
you get back. 
GUARD {1} | 
23/26 
http://docs.supstat.com/NLPwithR/#1 Page 23 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
babble(monty_bigram, 8) 
[1] "must go too. OFFICER #1: Back. Right away. " 
babble(monty_bigram, 8) 
[1] "I'll do you up a treat mate! GALAHAD: " 
babble(monty_bigram, 8) 
[1] "from just stop him entering the room. GUARD " 
24/26 
http://docs.supstat.com/NLPwithR/#1 Page 24 of 26
Introducing NLP with R 10/6/14, 19:37 
NHgram+Models 
babble(monty_trigram, 8) 
[1] "were still no nearer the Grail. Meanwhile, King " 
babble(monty_trigram, 8) 
[1] "the Britons. BEDEVERE: My liege! I would be " 
babble(monty_trigram, 8) 
[1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do " 
25/26 
http://docs.supstat.com/NLPwithR/#1 Page 25 of 26
Introducing NLP with R 10/6/14, 19:37 
Further+Reading 
Jurafsky and Martin (2008), Speech and Language Processing 
Manning (2008), An Introduction to Information Retrieval 
Gries (2009), Quantitative Corpus Linguistics with R 
· 
· 
· 
26/26 
http://docs.supstat.com/NLPwithR/#1 Page 26 of 26

More Related Content

What's hot

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
Aleksei Beloshytski
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
Pierre de Lacaze
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
 
Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)
Pedro Rodrigues
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
Utkarsh Sengar
 
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
Prof. Wim Van Criekinge
 
Data mining techniques
Data mining techniquesData mining techniques
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
Dan Sullivan, Ph.D.
 
1. python programming
1. python programming1. python programming
1. python programming
sreeLekha51
 
Python basic
Python basicPython basic
Python basic
Saifuddin Kaijar
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
Sean Cribbs
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Vrije Universiteit Amsterdam
 
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
Maulik Borsaniya
 
Python-FileIO
Python-FileIOPython-FileIO
Python-FileIO
Colin Su
 

What's hot (20)

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
 
Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)Introduction to the basics of Python programming (part 3)
Introduction to the basics of Python programming (part 3)
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
defense
defensedefense
defense
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
1. python programming
1. python programming1. python programming
1. python programming
 
Python basic
Python basicPython basic
Python basic
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
 
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
 
Python-FileIO
Python-FileIOPython-FileIO
Python-FileIO
 

Viewers also liked

Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Vivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Vivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Vivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
 
Xgboost
XgboostXgboost
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
 

Viewers also liked (14)

Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Similar to Introducing natural language processing(NLP) with r

Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Hamidreza Soleimani
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
telss09
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
outsider2
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Computation Chapter 4
Computation Chapter 4Computation Chapter 4
Computation Chapter 4
Inocentshuja Ahmad
 
Declarative Language Definition
Declarative Language DefinitionDeclarative Language Definition
Declarative Language Definition
Eelco Visser
 
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fu
climatewarrior
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
 
ppt7
ppt7ppt7
ppt7
callroom
 
ppt2
ppt2ppt2
ppt2
callroom
 
name name2 n
name name2 nname name2 n
name name2 n
callroom
 
ppt9
ppt9ppt9
ppt9
callroom
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
amiable_indian
 
name name2 n2
name name2 n2name name2 n2
name name2 n2
callroom
 
test ppt
test ppttest ppt
test ppt
callroom
 
name name2 n
name name2 nname name2 n
name name2 n
callroom
 
ppt21
ppt21ppt21
ppt21
callroom
 
name name2 n
name name2 nname name2 n
name name2 n
callroom
 
ppt17
ppt17ppt17
ppt17
callroom
 

Similar to Introducing natural language processing(NLP) with r (20)

Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Computation Chapter 4
Computation Chapter 4Computation Chapter 4
Computation Chapter 4
 
Declarative Language Definition
Declarative Language DefinitionDeclarative Language Definition
Declarative Language Definition
 
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fu
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
ppt7
ppt7ppt7
ppt7
 
ppt2
ppt2ppt2
ppt2
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt9
ppt9ppt9
ppt9
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
 
name name2 n2
name name2 n2name name2 n2
name name2 n2
 
test ppt
test ppttest ppt
test ppt
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt21
ppt21ppt21
ppt21
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt17
ppt17ppt17
ppt17
 

More from Vivian S. Zhang

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Vivian S. Zhang
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Vivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
Vivian S. Zhang
 
Xgboost
XgboostXgboost
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Vivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Vivian S. Zhang
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
Vivian S. Zhang
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Vivian S. Zhang
 

More from Vivian S. Zhang (16)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

Introducing natural language processing(NLP) with r

  • 1. Introducing NLP with R 10/6/14, 19:37 Introducing NLP with R Charlie Redmon | SupStat Analytics Copyright Supstat Inc. All Rights Reserved http://docs.supstat.com/NLPwithR/#1 Page 1 of 26
  • 2. Introducing NLP with R 10/6/14, 19:37 Outline Introduction to NLP Foundational Frameworks Working with text in R Regular Expressions As pattern matching device Theoretical connection with finite state automaton Application in morphological analysis - - - N-gram models Recognizing language Generating language - - Further reading · · · · · · 2/26 http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
  • 3. Introducing NLP with R 10/6/14, 19:37 What+is+NLP? Natural Language Processing Briefly: Building models to facilitate human-computer interaction through language We say natural language here to distinguish languages like English, Hungarian, and Bengali from computer languages and other invented communication systems (e.g. Morse code) - - Major sub-disciplines: · · Speech Recognition/Synthesis Computational Morphology (word structure) Lexical Semantics (word meaning) Computational Syntax (phrase/sentence structure) Compositional Semantics (phrase/sentence meaning) Information Retrieval - - - - - - 3/26 http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
  • 4. Introducing NLP with R 10/6/14, 19:37 Why+R? R has powerful text processing capabilities Many useful NLP-related packages Many of the more sophisticated procedures in NLP generalize to statistical models, which is where R really excels · · · 4/26 http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
  • 5. Introducing NLP with R 10/6/14, 19:37 Founda6onal+NLP+Frameworks Turing - Turing Machine: Finite State Automaton, Finite State Transducer Kleene - Regular Expressions Chomsky - Regular Languages and their relation to natural languages Markov: N-gram models HMMs - - Shannon · · · · · Information Theory Noisy Channel, Entropy models - - 5/26 http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
  • 6. Introducing NLP with R 10/6/14, 19:37 The+Workflow 1. Import and manipulate text in R 2. Create data structures facilitating NLP operations 3. Model implementation: Morphological parsing N-gram parsing N-gram language generation ... · · · · 6/26 http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
  • 7. Introducing NLP with R 10/6/14, 19:37 Impor6ng+text+into+R · Primary importing functions: scan(), readLines() monty_text = scan('data/grail.txt', what="character", sep="", quote="") monty_text[1:6] [1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', what="character", sep="", quote="") malayalam_text[15:20] [1] "#Date:" "01-10-2014" [3] "#----------------------------------------" "അേമരിkയിെലtിയ" [5] "+പധാനമ+nി" "നേര+nേമാദി" · Why might this data structure be a problem for many natural language structures? 7/26 http://docs.supstat.com/NLPwithR/#1 Page 7 of 26
  • 8. Introducing NLP with R 10/6/14, 19:37 Condensing+to+single+text+stream monty_text = paste(monty_text, collapse=" ") malayalam_text = paste(malayalam_text, collapse=" ") length(monty_text); length(malayalam_text) [1] 1 [1] 1 substr(monty_text, 1, 70) [1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" substr(malayalam_text, 304, 400) [1] "െത4ായി ഉcരിc് അേdഹെt അനാദരിcുെവn് െക.പി.സി.സി. +പസിഡn് വി.എം. സുധീരD. േമാഹDദ" 8/26 http://docs.supstat.com/NLPwithR/#1 Page 8 of 26
  • 9. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [] Disjunction (set) / [Gg]oogle / = Google, google ? 0 or 1 characters / savou?r / = savor, savour * 0 or more characters / hey!* / = hey, hey!, hey!!, ... Escape character / hey? / = hey? + 1 or more characters / a+h / = ah, aah, aaah, ... {n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... . Wildcard (any character) / #.* / = #rstats, #uofl, ... () Conjunction / (ha)+ / = ha, haha, hahaha, ... [^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 9/26 http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
  • 10. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... w Word character (alphanumeric) / w's / = that's, Jerry's, ... W Non-word character d Digit character (0-9) / d{3} / = 137, 254, ... D Non-digit character s Whitespace / w+s+w+ / = I am, I am, ... S Non-whitespace b Word boundary / btheb / = the, not then B Non-word boundary ^ Beginning of line / [a-z] / = non-capitalized beg. $ End of line / #.*$ / = hashtags at end of line 10/26 http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
  • 11. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on The advantage of having all the text in a single element is we can now split the text into different-sized segments for different kinds of natural language tasks. #sentence level pattern = "(?<=[.?!])s+" monty_sentences = strsplit(monty_text, split=pattern, perl=T) monty_sentences = unlist(monty_sentences) monty_sentences[5:8] [1] "King of the Britons, defeator of the Saxons, sovereign of all England!" [2] "SOLDIER #1: Pull the other one!" [3] "ARTHUR: I am, ..." [4] "and this is my trusty servant Patsy." 11/26 http://docs.supstat.com/NLPwithR/#1 Page 11 of 26
  • 12. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on Of course, depending on the language you're working with you might have different definitions of sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period. hindi_text = scan('data/hindustan_full.txt', what="character", sep="") hindi_text = paste(hindi_text, collapse=" ") pattern = "(?<=[।?!])s+" hindi_sentences = strsplit(hindi_text, split=pattern, perl=T) hindi_sentences = unlist(hindi_sentences) hindi_sentences[5:8] [1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।" [2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।" [3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।" [4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।" 12/26 http://docs.supstat.com/NLPwithR/#1 Page 12 of 26
  • 13. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on We can also split the original text according to word boundaries. #word level pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*" monty_words = strsplit(monty_text, split=pattern, perl=T) monty_words = unlist(monty_words) monty_words[5:30] [1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop" [9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR" [17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon" [25] "from" "the" 13/26 http://docs.supstat.com/NLPwithR/#1 Page 13 of 26
  • 14. Introducing NLP with R 10/6/14, 19:37 Building+a+Lexicon For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with. Other researchers may have already built a text-formatted lexicon of the language you're using, but nevertheless it's useful to see how we might build one. #convert all words to lowercase monty_words = tolower(monty_words) monty_words[1:9] [1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa" #convert vector of tokens to set of unique words monty_lexicon = unique(monty_words) monty_lexicon[1:8] [1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there" 14/26 http://docs.supstat.com/NLPwithR/#1 Page 14 of 26
  • 15. Introducing NLP with R 10/6/14, 19:37 Building+a+Lexicon length(monty_words) [1] 11213 length(monty_lexicon) [1] 1889 15/26 http://docs.supstat.com/NLPwithR/#1 Page 15 of 26
  • 16. Introducing NLP with R 10/6/14, 19:37 Morphological+Analysis Now that we have our lexicon we can start to model the internal structure of the words in our corpus. Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky and Martin (2000) 16/26 http://docs.supstat.com/NLPwithR/#1 Page 16 of 26
  • 17. Introducing NLP with R 10/6/14, 19:37 Morphological+Analysis But since it has already been proven that all regular expressions can be modeled as FSAs, and vice versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all the agentive nouns (e.g. builder, worker, shopper, etc.). monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) monty_agents[1:30] [1] "soldier" "uther" "other" "master" "together" "winter" [7] "plover" "warmer" "matter" "order" "creeper" "under" [13] "cart-master" "customer" "better" "over" "bother" "ever" [19] "officer" "her" "water" "power" "mer" "villager" [25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper" · This isn't exactly what we want. How can we improve our results? 17/26 http://docs.supstat.com/NLPwithR/#1 Page 17 of 26
  • 18. Introducing NLP with R 10/6/14, 19:37 Morphological+Analysis Take advantage of the lexicon. monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T) new_monty_agents = character(0) for (i in 1:length(monty_agents)) { word = monty_agents[i] stem_end = nchar(word) - 2 stem = substr(word, 1, stem_end) if (is.element(stem, monty_lexicon)) { new_monty_agents[i] = word } } new_monty_agents = new_monty_agents[!is.na(new_monty_agents)] new_monty_agents [1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper" 18/26 http://docs.supstat.com/NLPwithR/#1 Page 18 of 26
  • 19. Introducing NLP with R 10/6/14, 19:37 Malayalam+FSA 19/26 http://docs.supstat.com/NLPwithR/#1 Page 19 of 26
  • 20. Introducing NLP with R 10/6/14, 19:37 NHgram+Models Based on Markov model At their heart, n-grams answer the question: "What is the likelihood of one word (or character, phrase, sentence...) following another word or sequence of words?" The kernel equation: P(wn|wn−1 ) ≈ P( | ) 1 wn wn−1 n−N+1 N N where is the in N-gram (i.e. the number of words used to build the grammar) For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're moving along the string asking: P(Knights|are the), P(who|the Knights), ... · · · · 20/26 http://docs.supstat.com/NLPwithR/#1 Page 20 of 26
  • 21. Introducing NLP with R 10/6/14, 19:37 NHgram+Models library(ngram) monty_bigram = ngram(monty_text, n=2) get.ngrams(monty_bigram)[1:10] [1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged," [5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until" [9] "trade. I" "down, fell" monty_trigram = ngram(monty_text, n=3) get.ngrams(monty_trigram)[1:10] [1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no" [5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!" [9] "Shut up! Um," "to a successful" 21/26 http://docs.supstat.com/NLPwithR/#1 Page 21 of 26
  • 22. Introducing NLP with R 10/6/14, 19:37 NHgram+Models print(monty_bigram, full=TRUE) cannot tell, suffice {1} | away. Just ignore {1} | not 'is'. HEAD {1} | You {2} | Not {1} | bowels unplugged, And {1} | well, Arthur, for {1} | [twang] Wayy! [twang] {1} | 22/26 http://docs.supstat.com/NLPwithR/#1 Page 22 of 26
  • 23. Introducing NLP with R 10/6/14, 19:37 NHgram+Models print(monty_trigram, full=TRUE) a good spanking! GIRLS: {1} | Oooh! GALAHAD: My God! {1} | is the capital of {1} | to you no more, {1} | Who's that then? CART-MASTER: {1} | you get back. GUARD {1} | 23/26 http://docs.supstat.com/NLPwithR/#1 Page 23 of 26
  • 24. Introducing NLP with R 10/6/14, 19:37 NHgram+Models babble(monty_bigram, 8) [1] "must go too. OFFICER #1: Back. Right away. " babble(monty_bigram, 8) [1] "I'll do you up a treat mate! GALAHAD: " babble(monty_bigram, 8) [1] "from just stop him entering the room. GUARD " 24/26 http://docs.supstat.com/NLPwithR/#1 Page 24 of 26
  • 25. Introducing NLP with R 10/6/14, 19:37 NHgram+Models babble(monty_trigram, 8) [1] "were still no nearer the Grail. Meanwhile, King " babble(monty_trigram, 8) [1] "the Britons. BEDEVERE: My liege! I would be " babble(monty_trigram, 8) [1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do " 25/26 http://docs.supstat.com/NLPwithR/#1 Page 25 of 26
  • 26. Introducing NLP with R 10/6/14, 19:37 Further+Reading Jurafsky and Martin (2008), Speech and Language Processing Manning (2008), An Introduction to Information Retrieval Gries (2009), Quantitative Corpus Linguistics with R · · · 26/26 http://docs.supstat.com/NLPwithR/#1 Page 26 of 26