SlideShare a Scribd company logo
1 of 16
Download to read offline
Introduction to Natural Language Processing
Lecture 2. Tokenization and word counts
Ekaterina Chernyak, Dmitry Ilvovsky
echernyak@hse.ru, dilvovsky@hse.ru
National Research University
Higher School of Economics (Moscow)
July 13, 2015
How many words?
“The rain in Spain stays mainly in the plain.”
9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain
7 (or 8) types: T / the rain, in, Spain, stays, mainly, plain
Type and token
Type is an element of the vocabulary.
Token is an instance of that type in the text.
N = number of tokens
V - vocabulary (i.e. all types)
|V | = size of vocabulary (i.e. number of types)
How are N and |V | related?
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 2 / 16
Zipf’s law
Zipf’s law [Gelbukh, Sidorov, 2001])
In any large enough text, the frequency ranks (starting from the highest)
of types are inversely proportional to the corresponding frequencies:
f =
1
r
f – frequency of a type
r – rank of a type (its position in the list of all types in order of their
frequency of occurrence)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 3 / 16
Zipf’s law: example
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 4 / 16
Heaps’ law
Heaps’ law [Gelbukh, Sidorov, 2001])
The number of different types in a text is roughly proportional to an
exponent of its size:
|V | = K ∗ Nb
N = number of tokens
|V | = size of vocabulary (i.e. number of types)
K, b – free parameters, K ∈ [10, 100], b ∈ [0.4, 0.6]
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 5 / 16
Why tokenization is difficult?
Easy example: “Good muffins cost $3.88 in New York. Please buy me
two of them. Thanks.”
is “.” a token?
is $3.88 a single token?
is “New York” a single token?
Real data may contain noise in it: code, markup, URLs, faulty
punctuation
Real data contains misspellings: “an dthen she aksed”
Period “.” does not always mean the end of sentence: m.p.h., PhD.
Nevertheless tokenization is important for all other text processing steps.
There are rule-based and machine learning-based approaches to
development of tokenizers.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 6 / 16
Rule-based tokenization
For example, define a token as a sequence of upper and lower case letters:
A-Za-z. Reqular expression is a nice tool for programming such rules.
RE in Python
In[1]: import re
In[2]: prog = re.compile(’[A-Za-z]+’)
In[3]: prog.findall("Words, words, words.")
Out[1]: [’Words’, ’words’, ’words’]
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 7 / 16
Sentence segmentation (1)
What are the sentence boundaries?
?, ! are usually unambiguous
Period “.” is an issue
Direct speech is also an issue: She said, ”What time will you be
home?” and I said, ”I don’t know! ”. Even worse in Russian!
Let us learn a classifier for sentence segmentation.
Binary classifier
A binary classifier f : X =⇒ 0, 1 takes input data X (a set of sentences)
and decides EndOfSentence (0) or NotEndOfSentence (1).
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 8 / 16
Sentence segmentation (2)
What can be the features for classification? I am a period, am I
EndOfSentence?
Lots of blanks after me?
Lots of lower case letters and ? or ! after me?
Do I belong to abbreviation?
etc.
We need a lot of hand-markup.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 9 / 16
Do we need to program this?
No! There is Natural Language Toolkit (NLTK) for everything.
NLTK tokenizers
In[1]: from nltk.tokenize import RegexpTokenizer,
wordpunct tokenize
In[2]: s = ’Good muffins cost $3.88 in New York. Please
buy me two of them. Thanks.’
In[3]: tokenizer = RegexpTokenizer(’w+| $ [d .]+ | S
+’)
In[4]: tokenizer.tokenize(s)
In[5]: wordpunct tokenize(s)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 10 / 16
Learning to tokenize
nltk.tokenize.punkt is a tool for learning to tokenize from your data.
It includes pre-trained Punkt tokenizer for English.
Punkt tokenizer
In[1]: import nltk.data
In[2]: sent detector =
nltk.data.load(’tokenizers/punkt/english.pickle’)
In[3]: sent detector.tokenize(s)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 11 / 16
Exercise 1.1 Word counts
Now it is time for some programming!
Exercise 1.1
Input: Alice in Wonderland (alice.txt) or your text
Output 1: number of tokens
Output 2: number of types
Use nltk.FreqDist() to count types. nltk.FreqDist() is a frequency
dictionary: [key, frequency(key)].
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 12 / 16
Lemmatization (Normalization)
Each word has a base form:
has, had, have =⇒ have
cats, cat, cat’s =⇒ cat
Windows =⇒ window or Windows?
Lemmatization [Jurafsky, Martin, 1999]
Lemmatization (or normalization) is used to reduce inflections or variant
forms to base forms. A dictionary with headwords is required.
Lemmatization
In[1]: from nltk.stem import WordNetLemmatizer
In[2]: lemmatizer = WordNetLemmatizer()
In[3]: lemmatizer.lemmatize(t)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 13 / 16
Stemming
A word is built with morphems: word = stem + affixes. Sometimes we do
not need affixes.
translate, translation, translator =⇒ translat
Stemming [Jurafsky, Martin, 1999]
Reduce terms to their stems in information retrieval and text classification.
Porter’s algorithm is the most common English stemmer.
Stemming
In[1]: from nltk.stem.porter import PorterStemmer
In[2]: stemmer = PorterStemmer()
In[3]: stemmer.stem(t)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 14 / 16
Exercise 1.2 Word counts (continued)
Exercise 1.2
Input: Alice in Wonderland (alice.txt) or your text
Output 1: 20 most common lemmata
Output 2: 20 most common stems
Use FreqDist() to count lemmata and stems. Use
FreqDist().most common() to find most common lemmata and stems.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 15 / 16
Exercise 1.3 Do we need all words?
Stopword is a not meaningful word: prepositions, adjunctions, pronouns,
articles, etc.
Stopwords
In[1]: from nltk.corpus import stopwords
In[2]: print stopwords.words(’english’)
Exercise 1.3
Input: Alice in Wonderland (alice.txt) or your text
Output 1: 20 most common lemmata without stop words
Use not in operator to exclude not stopwords in a cycle.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 16 / 16

More Related Content

What's hot

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingSaurabh Kaushik
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games ResearchJose Zagal
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processingHareem Naz
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 

What's hot (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos tagging
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Nlp
NlpNlp
Nlp
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processing
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Introduction to NLTK
Introduction to NLTKIntroduction to NLTK
Introduction to NLTK
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 

Similar to Intro to NLP. Lecture 2

13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
Artificial Intelligence_NLP
Artificial Intelligence_NLPArtificial Intelligence_NLP
Artificial Intelligence_NLPThenmozhiK5
 
AI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxAI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxprakashvs7
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNetDomyoung Lee
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
 

Similar to Intro to NLP. Lecture 2 (20)

L3 v2
L3 v2L3 v2
L3 v2
 
Nltk
NltkNltk
Nltk
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
AI_08_NLP.pptx
AI_08_NLP.pptxAI_08_NLP.pptx
AI_08_NLP.pptx
 
Document similarity
Document similarityDocument similarity
Document similarity
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
FinalReport
FinalReportFinalReport
FinalReport
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Artificial Intelligence_NLP
Artificial Intelligence_NLPArtificial Intelligence_NLP
Artificial Intelligence_NLP
 
AI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxAI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptx
 
NLP and Deep Learning
NLP and Deep LearningNLP and Deep Learning
NLP and Deep Learning
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNet
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 

More from Ekaterina Chernyak (16)

Chernyak_defense
Chernyak_defenseChernyak_defense
Chernyak_defense
 
Backgammon
BackgammonBackgammon
Backgammon
 
Desktop game agar.io
Desktop game agar.ioDesktop game agar.io
Desktop game agar.io
 
Gayazov
GayazovGayazov
Gayazov
 
Koptsov.web.introduction
Koptsov.web.introductionKoptsov.web.introduction
Koptsov.web.introduction
 
Intro to NLP (RU)
Intro to NLP (RU)Intro to NLP (RU)
Intro to NLP (RU)
 
редактор параллельных разметок
редактор параллельных разметокредактор параллельных разметок
редактор параллельных разметок
 
Hse project introduction_22012015
Hse project introduction_22012015Hse project introduction_22012015
Hse project introduction_22012015
 
зачем нужен чистый корпус
зачем нужен чистый корпусзачем нужен чистый корпус
зачем нужен чистый корпус
 
Yacovlev
YacovlevYacovlev
Yacovlev
 
Suhoroslov
SuhoroslovSuhoroslov
Suhoroslov
 
Koptsov
KoptsovKoptsov
Koptsov
 
Gusakov
GusakovGusakov
Gusakov
 
Hse.projects 17.01.2015
Hse.projects 17.01.2015Hse.projects 17.01.2015
Hse.projects 17.01.2015
 
Ignat vita artur
Ignat vita arturIgnat vita artur
Ignat vita artur
 
Ivan p
Ivan pIvan p
Ivan p
 

Recently uploaded

Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 

Recently uploaded (20)

Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 

Intro to NLP. Lecture 2

  • 1. Introduction to Natural Language Processing Lecture 2. Tokenization and word counts Ekaterina Chernyak, Dmitry Ilvovsky echernyak@hse.ru, dilvovsky@hse.ru National Research University Higher School of Economics (Moscow) July 13, 2015
  • 2. How many words? “The rain in Spain stays mainly in the plain.” 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T / the rain, in, Spain, stays, mainly, plain Type and token Type is an element of the vocabulary. Token is an instance of that type in the text. N = number of tokens V - vocabulary (i.e. all types) |V | = size of vocabulary (i.e. number of types) How are N and |V | related? Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 2 / 16
  • 3. Zipf’s law Zipf’s law [Gelbukh, Sidorov, 2001]) In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies: f = 1 r f – frequency of a type r – rank of a type (its position in the list of all types in order of their frequency of occurrence) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 3 / 16
  • 4. Zipf’s law: example Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 4 / 16
  • 5. Heaps’ law Heaps’ law [Gelbukh, Sidorov, 2001]) The number of different types in a text is roughly proportional to an exponent of its size: |V | = K ∗ Nb N = number of tokens |V | = size of vocabulary (i.e. number of types) K, b – free parameters, K ∈ [10, 100], b ∈ [0.4, 0.6] Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 5 / 16
  • 6. Why tokenization is difficult? Easy example: “Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.” is “.” a token? is $3.88 a single token? is “New York” a single token? Real data may contain noise in it: code, markup, URLs, faulty punctuation Real data contains misspellings: “an dthen she aksed” Period “.” does not always mean the end of sentence: m.p.h., PhD. Nevertheless tokenization is important for all other text processing steps. There are rule-based and machine learning-based approaches to development of tokenizers. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 6 / 16
  • 7. Rule-based tokenization For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules. RE in Python In[1]: import re In[2]: prog = re.compile(’[A-Za-z]+’) In[3]: prog.findall("Words, words, words.") Out[1]: [’Words’, ’words’, ’words’] Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 7 / 16
  • 8. Sentence segmentation (1) What are the sentence boundaries? ?, ! are usually unambiguous Period “.” is an issue Direct speech is also an issue: She said, ”What time will you be home?” and I said, ”I don’t know! ”. Even worse in Russian! Let us learn a classifier for sentence segmentation. Binary classifier A binary classifier f : X =⇒ 0, 1 takes input data X (a set of sentences) and decides EndOfSentence (0) or NotEndOfSentence (1). Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 8 / 16
  • 9. Sentence segmentation (2) What can be the features for classification? I am a period, am I EndOfSentence? Lots of blanks after me? Lots of lower case letters and ? or ! after me? Do I belong to abbreviation? etc. We need a lot of hand-markup. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 9 / 16
  • 10. Do we need to program this? No! There is Natural Language Toolkit (NLTK) for everything. NLTK tokenizers In[1]: from nltk.tokenize import RegexpTokenizer, wordpunct tokenize In[2]: s = ’Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.’ In[3]: tokenizer = RegexpTokenizer(’w+| $ [d .]+ | S +’) In[4]: tokenizer.tokenize(s) In[5]: wordpunct tokenize(s) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 10 / 16
  • 11. Learning to tokenize nltk.tokenize.punkt is a tool for learning to tokenize from your data. It includes pre-trained Punkt tokenizer for English. Punkt tokenizer In[1]: import nltk.data In[2]: sent detector = nltk.data.load(’tokenizers/punkt/english.pickle’) In[3]: sent detector.tokenize(s) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 11 / 16
  • 12. Exercise 1.1 Word counts Now it is time for some programming! Exercise 1.1 Input: Alice in Wonderland (alice.txt) or your text Output 1: number of tokens Output 2: number of types Use nltk.FreqDist() to count types. nltk.FreqDist() is a frequency dictionary: [key, frequency(key)]. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 12 / 16
  • 13. Lemmatization (Normalization) Each word has a base form: has, had, have =⇒ have cats, cat, cat’s =⇒ cat Windows =⇒ window or Windows? Lemmatization [Jurafsky, Martin, 1999] Lemmatization (or normalization) is used to reduce inflections or variant forms to base forms. A dictionary with headwords is required. Lemmatization In[1]: from nltk.stem import WordNetLemmatizer In[2]: lemmatizer = WordNetLemmatizer() In[3]: lemmatizer.lemmatize(t) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 13 / 16
  • 14. Stemming A word is built with morphems: word = stem + affixes. Sometimes we do not need affixes. translate, translation, translator =⇒ translat Stemming [Jurafsky, Martin, 1999] Reduce terms to their stems in information retrieval and text classification. Porter’s algorithm is the most common English stemmer. Stemming In[1]: from nltk.stem.porter import PorterStemmer In[2]: stemmer = PorterStemmer() In[3]: stemmer.stem(t) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 14 / 16
  • 15. Exercise 1.2 Word counts (continued) Exercise 1.2 Input: Alice in Wonderland (alice.txt) or your text Output 1: 20 most common lemmata Output 2: 20 most common stems Use FreqDist() to count lemmata and stems. Use FreqDist().most common() to find most common lemmata and stems. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 15 / 16
  • 16. Exercise 1.3 Do we need all words? Stopword is a not meaningful word: prepositions, adjunctions, pronouns, articles, etc. Stopwords In[1]: from nltk.corpus import stopwords In[2]: print stopwords.words(’english’) Exercise 1.3 Input: Alice in Wonderland (alice.txt) or your text Output 1: 20 most common lemmata without stop words Use not in operator to exclude not stopwords in a cycle. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 16 / 16