SlideShare a Scribd company logo
1 of 26
Download to read offline
Natural language 
processing with Python
Hi all! 
Iván Compañy 
@pyivanc
Index 
1. Introduction 
1. NLP 
2. MyTaste 
2. Theory 
1. Data processing 
1. Normalizing 
2. Tokenizing 
3. Tagging 
2. Data post-processing 
1. Classification 
2. Word count 
3. Practice
WTF Natural language processing (NLP) is a field of 
computer science, artificial intelligence, and linguistics 
concerned with the interactions between computers and 
human (natural) languages 
Wikipedia 
1.1. NLP 
Natural language processing (NLP) is all about human 
text processing 
Me
1.2. MyTaste 
http://www.mytaste.com/
1.2. MyTaste
1.2. MyTaste 
1. How do I know where is the important information? 
2. How do I know if the information is a cooking recipe?
2. Theory
2.1. Data processing 
yo soy programador python 
DET VRB N N 
Yo soy programador Python 
1. Normalize: Adapt the text depending of your needs 
2. Tokenize: Split text in identifiable linguistic units 
3. Tagging: Classify words into their parts-of-speech
Recipes 
I love the chocolate 
Everybody loves the chocolate 
I still love chocolate! 
Why you don’t love chocolate? 
My GF is chocolate 
Non Recipes 
Once upon a time.. 
Remember, remember… 
print(“hello Python”) 
Fly, you fools! 
Let there be rock 
2.2. Data post-processing 
Content selection 
1. Classification
2.2. Data post-processing 1. Classification 
Feature selection 
• “A recipe must contain the word ‘chocolate’” 
• “A recipe must contain at least one verb” 
What! 
Training 
MACHINE LEARNING ALGORITHM
2.2. Data post-processing 1. Classification 
Machine 
learning 
algorithm 
This is my recipe of chocolate 
This is the beginning of anything 
I love chocolate 
Why you don’t love chocolate? 
I’m becoming insane 
recipe 
non recipe 
recipe 
recipe 
non recipe 
This is my recipe of chocolate 
This is the beginning of anything 
I love chocolate 
Why you don’t love chocolate? 
I’m becoming insane 
… 
Test input
2.2. Data post-processing 2. Word count 
Frequency 
distribution 
algotihm 
This is my recipe of chocolate 
This is the beginning of anything 
I love chocolate 
Why you don’t love chocolate? 
I’m becoming insane 
… 
Input
NLTK 
http://www.nltk.org/
DownloadingIn csotarpllionrga:: pythpoinp -icn s“itmaplolr tn lntlktk;nltk.download()”
Data processing 
import nltk 
#Sample text 
text = “I am a Python programmer” 
#Normalize (the ‘I’ character in english must be in uppercase) 
text = text.capitalize() 
#Tokenize 
text_tokenized = nltk.word_tokenize(text) 
#We create the tagger 
from nltk.corpus import brown 
default_tag = nltk.DefaultTagger('UNK') 
tagger = nltk.UnigramTagger(brown.tagged_sents(), backoff=default_tag) 
#Tagging 
text_tagged = tagger.tag(text_tokenized) 
#Output 
#[('I', u'PPSS'), ('am', u'BEM'), ('a', u'AT'), ('python', u'NN'), ('programmer', u'NN')]
Data post-processing: Classification 
from nltk.corpus import brown 
with open(‘tagged_recipes.txt’, ‘rb’) as f: 
recipes = [] 
for recipe in f.readlines(): 
recipes.append(recipe, ‘recipe’) 
non_recipes = [] 
for line in brown.tagged_sents()[:1000]: 
non_recipes.append((line, ‘non_recipe’)) 
all_texts = (recipes + non_recipes) 
random.shuffle(all_texts)
Data post-processing: Classification 
def feature(text): 
has_chocolate = any(['chocolate' in word.lower() for (word, pos) in text]) 
is_american = any(['america' in word.lower() for (word, pos) in text]) 
return {'has_chocolate': has_chocolate, 'is_american': is_american} 
featuresets = [(feature(text), t) for (text, t) in all_sents] 
train_set, test_set = featuresets[:18000], featuresets[18000:] 
classifier = nltk.NaiveBayesClassifier.train(train_set)
Data post-processing: Classification 
print classifier.show_most_informative_features(5) 
“”” 
Most Informative Features 
has_chocolate = True recipe : non_re = 86.6 : 1.0 
is_american = True non_re : recipe = 70.4 : 1.0 
is_american = False recipe : non_re = 1.0 : 1.0 
has_chocolate = False non_re : recipe = 1.0 : 1.0 
“”” 
print nltk.classify.accuracy(classifier, test_set) 
“”” 
0.514 
“””
Data post-processing: Word count 
import nltk 
with open(“recipes.txt”) as f: 
tokenized_text = nltk.word_tokenize(f.read()) 
fdist = nltk.FreqDist(tokenized_text) 
fdist.plot(50)
Data post-processing: Classification
Examples of usage: 
• Text recognition 
• Tag detection in websites 
• Market analysis 
• Ads system 
• Translators 
• …
More things I didn’t explain: 
• Analyze group of texts (Chunk tagging) 
• Semantic analysis 
• How to create a corpus 
• …
More info:
Thank you! 
Iván Compañy 
@pyivanc

More Related Content

What's hot

Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 

What's hot (20)

Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Using Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookUsing Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a book
 
Python_in_Detail
Python_in_DetailPython_in_Detail
Python_in_Detail
 
Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLPNatural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLP
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
You too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkYou too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Python made easy
Python made easy Python made easy
Python made easy
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Chapter 5 - THREADING & REGULAR exp - MAULIK BORSANIYA
Chapter 5 - THREADING & REGULAR exp - MAULIK BORSANIYAChapter 5 - THREADING & REGULAR exp - MAULIK BORSANIYA
Chapter 5 - THREADING & REGULAR exp - MAULIK BORSANIYA
 

Viewers also liked

Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
charlyheus
 
แรงดันในของเหลว
แรงดันในของเหลวแรงดันในของเหลว
แรงดันในของเหลว
tewin2553
 
Edge Amsterdam profile
Edge Amsterdam profileEdge Amsterdam profile
Edge Amsterdam profile
charlyheus
 
Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...
Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...
Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...
Eric Eggert
 
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
charlyheus
 
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint BrandingSharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
Marcy Kellar
 
поездка на родину
поездка на родинупоездка на родину
поездка на родину
zizari
 
Edge Amsterdam Profile
Edge Amsterdam ProfileEdge Amsterdam Profile
Edge Amsterdam Profile
charlyheus
 

Viewers also liked (20)

Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
 
แรงดันในของเหลว
แรงดันในของเหลวแรงดันในของเหลว
แรงดันในของเหลว
 
Edge Amsterdam profile
Edge Amsterdam profileEdge Amsterdam profile
Edge Amsterdam profile
 
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
 
Verb to be
Verb to beVerb to be
Verb to be
 
Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...
Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...
Neue aufregende Web Technologien, HTML5 + CSS3 anhand von Praxisbeispielen le...
 
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
 
subrat
 subrat subrat
subrat
 
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint BrandingSharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identification
 
Aids to creativity
Aids to creativityAids to creativity
Aids to creativity
 
Trabajo grupal
Trabajo grupalTrabajo grupal
Trabajo grupal
 
Django & Drupal: A Tale of Two Cities
Django & Drupal: A Tale of Two CitiesDjango & Drupal: A Tale of Two Cities
Django & Drupal: A Tale of Two Cities
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
SharePointFest Konferenz 2016 - Creating a Great User Experience in SharePoint
SharePointFest Konferenz 2016 - Creating a Great User Experience in SharePointSharePointFest Konferenz 2016 - Creating a Great User Experience in SharePoint
SharePointFest Konferenz 2016 - Creating a Great User Experience in SharePoint
 
TRABAJO GRUPAL
TRABAJO GRUPALTRABAJO GRUPAL
TRABAJO GRUPAL
 
поездка на родину
поездка на родинупоездка на родину
поездка на родину
 
Examples of BT using SharePoint 2010
Examples of BT using SharePoint 2010Examples of BT using SharePoint 2010
Examples of BT using SharePoint 2010
 
Edge Amsterdam Profile
Edge Amsterdam ProfileEdge Amsterdam Profile
Edge Amsterdam Profile
 
Conversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pagesConversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pages
 

Similar to Natural language processing

Mining the social web ch8 - 1
Mining the social web ch8 - 1Mining the social web ch8 - 1
Mining the social web ch8 - 1
scor7910
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
Chris Birchall
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
kalai75
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
singingfish
 

Similar to Natural language processing (20)

Mining the social web ch8 - 1
Mining the social web ch8 - 1Mining the social web ch8 - 1
Mining the social web ch8 - 1
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Natural Language Processing sample code by Aiden
Natural Language Processing sample code by AidenNatural Language Processing sample code by Aiden
Natural Language Processing sample code by Aiden
 
1.Online Billing System 4-2022.pdf
1.Online Billing System 4-2022.pdf1.Online Billing System 4-2022.pdf
1.Online Billing System 4-2022.pdf
 
Automating Tinder w/ Eigenfaces and StanfordNLP
Automating Tinder w/ Eigenfaces and StanfordNLPAutomating Tinder w/ Eigenfaces and StanfordNLP
Automating Tinder w/ Eigenfaces and StanfordNLP
 
Python: The Dynamic!
Python: The Dynamic!Python: The Dynamic!
Python: The Dynamic!
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
Data Science
Data Science Data Science
Data Science
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Algorithms overview
Algorithms overviewAlgorithms overview
Algorithms overview
 
ARCHIVED: new version available. 2016 - METASPACE Training Course
ARCHIVED: new version available. 2016 - METASPACE Training CourseARCHIVED: new version available. 2016 - METASPACE Training Course
ARCHIVED: new version available. 2016 - METASPACE Training Course
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
 
The Ring programming language version 1.5.1 book - Part 10 of 180
The Ring programming language version 1.5.1 book - Part 10 of 180The Ring programming language version 1.5.1 book - Part 10 of 180
The Ring programming language version 1.5.1 book - Part 10 of 180
 
Magento code audit
Magento code auditMagento code audit
Magento code audit
 
Logic programming in python
Logic programming in pythonLogic programming in python
Logic programming in python
 
CoreData - there is an ORM you can like!
CoreData - there is an ORM you can like!CoreData - there is an ORM you can like!
CoreData - there is an ORM you can like!
 

Recently uploaded

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 

Recently uploaded (20)

Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

Natural language processing

  • 2. Hi all! Iván Compañy @pyivanc
  • 3. Index 1. Introduction 1. NLP 2. MyTaste 2. Theory 1. Data processing 1. Normalizing 2. Tokenizing 3. Tagging 2. Data post-processing 1. Classification 2. Word count 3. Practice
  • 4. WTF Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages Wikipedia 1.1. NLP Natural language processing (NLP) is all about human text processing Me
  • 7. 1.2. MyTaste 1. How do I know where is the important information? 2. How do I know if the information is a cooking recipe?
  • 9. 2.1. Data processing yo soy programador python DET VRB N N Yo soy programador Python 1. Normalize: Adapt the text depending of your needs 2. Tokenize: Split text in identifiable linguistic units 3. Tagging: Classify words into their parts-of-speech
  • 10. Recipes I love the chocolate Everybody loves the chocolate I still love chocolate! Why you don’t love chocolate? My GF is chocolate Non Recipes Once upon a time.. Remember, remember… print(“hello Python”) Fly, you fools! Let there be rock 2.2. Data post-processing Content selection 1. Classification
  • 11. 2.2. Data post-processing 1. Classification Feature selection • “A recipe must contain the word ‘chocolate’” • “A recipe must contain at least one verb” What! Training MACHINE LEARNING ALGORITHM
  • 12. 2.2. Data post-processing 1. Classification Machine learning algorithm This is my recipe of chocolate This is the beginning of anything I love chocolate Why you don’t love chocolate? I’m becoming insane recipe non recipe recipe recipe non recipe This is my recipe of chocolate This is the beginning of anything I love chocolate Why you don’t love chocolate? I’m becoming insane … Test input
  • 13. 2.2. Data post-processing 2. Word count Frequency distribution algotihm This is my recipe of chocolate This is the beginning of anything I love chocolate Why you don’t love chocolate? I’m becoming insane … Input
  • 14.
  • 16. DownloadingIn csotarpllionrga:: pythpoinp -icn s“itmaplolr tn lntlktk;nltk.download()”
  • 17. Data processing import nltk #Sample text text = “I am a Python programmer” #Normalize (the ‘I’ character in english must be in uppercase) text = text.capitalize() #Tokenize text_tokenized = nltk.word_tokenize(text) #We create the tagger from nltk.corpus import brown default_tag = nltk.DefaultTagger('UNK') tagger = nltk.UnigramTagger(brown.tagged_sents(), backoff=default_tag) #Tagging text_tagged = tagger.tag(text_tokenized) #Output #[('I', u'PPSS'), ('am', u'BEM'), ('a', u'AT'), ('python', u'NN'), ('programmer', u'NN')]
  • 18. Data post-processing: Classification from nltk.corpus import brown with open(‘tagged_recipes.txt’, ‘rb’) as f: recipes = [] for recipe in f.readlines(): recipes.append(recipe, ‘recipe’) non_recipes = [] for line in brown.tagged_sents()[:1000]: non_recipes.append((line, ‘non_recipe’)) all_texts = (recipes + non_recipes) random.shuffle(all_texts)
  • 19. Data post-processing: Classification def feature(text): has_chocolate = any(['chocolate' in word.lower() for (word, pos) in text]) is_american = any(['america' in word.lower() for (word, pos) in text]) return {'has_chocolate': has_chocolate, 'is_american': is_american} featuresets = [(feature(text), t) for (text, t) in all_sents] train_set, test_set = featuresets[:18000], featuresets[18000:] classifier = nltk.NaiveBayesClassifier.train(train_set)
  • 20. Data post-processing: Classification print classifier.show_most_informative_features(5) “”” Most Informative Features has_chocolate = True recipe : non_re = 86.6 : 1.0 is_american = True non_re : recipe = 70.4 : 1.0 is_american = False recipe : non_re = 1.0 : 1.0 has_chocolate = False non_re : recipe = 1.0 : 1.0 “”” print nltk.classify.accuracy(classifier, test_set) “”” 0.514 “””
  • 21. Data post-processing: Word count import nltk with open(“recipes.txt”) as f: tokenized_text = nltk.word_tokenize(f.read()) fdist = nltk.FreqDist(tokenized_text) fdist.plot(50)
  • 23. Examples of usage: • Text recognition • Tag detection in websites • Market analysis • Ads system • Translators • …
  • 24. More things I didn’t explain: • Analyze group of texts (Chunk tagging) • Semantic analysis • How to create a corpus • …
  • 26. Thank you! Iván Compañy @pyivanc