SlideShare a Scribd company logo
Week 8
The Natural Language Toolkit
(NLTK)
Except where otherwise noted, this work is licensed under:
http://creativecommons.org/licenses/by-nc-sa/3.0
2
List methods
• Getting information about a list
– list.index(item)
– list.count(item)
• These modify the list in-place, unlike str operations
– list.append(item)
– list.insert(index, item)
– list.remove(item)
– list.extend(list2)
• same as list += list2
– list.sort()
– list.reverse()
3
List exercise
• Write a script to print the most frequent token in a text file.
4
And now for something completely different
5
• So far, we've studied programming syntax and techniques
• What about tasks for programming?
– Homework
– Mathematics, statistics
– Biology
– Animation
– Website development
– Game development
– Natural language processing
Programming tasks?
(Sage)
(Biopython)
(Blender)
(Django)
(PyGame)
(NLTK)
6
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular
7
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation
8
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS
9
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?
10
Tokenization (cont.)
• How do we split his speech into tokens?
>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']
• Now, how often does he use the word "booyah"?
>>> martysSpeech.split().count("booyah")
0
>>> # What the!
11
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words
12
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?
13
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:
>>> dir("hello world!")
[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]
14
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs
>>> # Get the (word, tag) pair at list index 0
...
>>> pair = nltk.corpus.treebank.tagged_words()[0]
>>> pair
('Pierre', 'NNP')
>>> word = pair[0]
>>> tag = pair[1]
>>> print word, tag
Pierre NNP
>>> word, tag = pair # or unpack in 1 line!
>>> print word, tag
Pierre NNP
15
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)
– Try these tagger objects:
• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method
>>> tagger = nltk.UnigramTagger(tagged_sentences)
>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]
16
POS tagging (cont.)
• Exercise: Mad Libs
– I have a passage I want filled with the right parts of speech
– Let's use random picks from our own data!
– This code will print it out:
print properNoun1, "has always been a", adjective1, 
singularNoun, "unlike the", adjective2, 
properNoun2, "who I", pastVerb, "as he was", 
ingVerb, "yesterday."
17
Eliza (NLG)
• Eliza simulates a Rogerian psychotherapist
• With while loops and tokenization, you can make a chat bot!
– Try:
• nltk.chat.eliza.eliza_chat()
18
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()
19
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples

More Related Content

Similar to NLTK Python Basic Natural Language Processing.ppt

Baabtra.com little coder chapter - 3
Baabtra.com little coder   chapter - 3Baabtra.com little coder   chapter - 3
Baabtra.com little coder chapter - 3
baabtra.com - No. 1 supplier of quality freshers
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
Matt Harrison
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Keunwoo Choi
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
aiclub_slides
 
Class 5: If, while & lists
Class 5: If, while & listsClass 5: If, while & lists
Class 5: If, while & lists
Marc Gouw
 
ELUTE
ELUTEELUTE
Python Workshop
Python  Workshop Python  Workshop
Python Workshop
Assem CHELLI
 
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Worajedt Sitthidumrong
 
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Worajedt Sitthidumrong
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Class 6: Lists & dictionaries
Class 6: Lists & dictionariesClass 6: Lists & dictionaries
Class 6: Lists & dictionaries
Marc Gouw
 
Python language data types
Python language data typesPython language data types
Python language data types
James Wong
 
Python language data types
Python language data typesPython language data types
Python language data types
Harry Potter
 
Python language data types
Python language data typesPython language data types
Python language data types
Hoang Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Young Alista
 
Python language data types
Python language data typesPython language data types
Python language data types
Luis Goldster
 
Python language data types
Python language data typesPython language data types
Python language data types
Tony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Fraboni Ec
 
Music as data
Music as dataMusic as data
Music as data
John Vlachoyiannis
 

Similar to NLTK Python Basic Natural Language Processing.ppt (20)

Baabtra.com little coder chapter - 3
Baabtra.com little coder   chapter - 3Baabtra.com little coder   chapter - 3
Baabtra.com little coder chapter - 3
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
 
Class 5: If, while & lists
Class 5: If, while & listsClass 5: If, while & lists
Class 5: If, while & lists
 
ELUTE
ELUTEELUTE
ELUTE
 
Python Workshop
Python  Workshop Python  Workshop
Python Workshop
 
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)Well Grounded Python Coding - Revision 1 (Day 1 Slides)
Well Grounded Python Coding - Revision 1 (Day 1 Slides)
 
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
Well Grounded Python Coding - Revision 1 (Day 1 Handouts)
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
Class 6: Lists & dictionaries
Class 6: Lists & dictionariesClass 6: Lists & dictionaries
Class 6: Lists & dictionaries
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Music as data
Music as dataMusic as data
Music as data
 

Recently uploaded

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 

Recently uploaded (20)

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 

NLTK Python Basic Natural Language Processing.ppt

  • 1. Week 8 The Natural Language Toolkit (NLTK) Except where otherwise noted, this work is licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0
  • 2. 2 List methods • Getting information about a list – list.index(item) – list.count(item) • These modify the list in-place, unlike str operations – list.append(item) – list.insert(index, item) – list.remove(item) – list.extend(list2) • same as list += list2 – list.sort() – list.reverse()
  • 3. 3 List exercise • Write a script to print the most frequent token in a text file.
  • 4. 4 And now for something completely different
  • 5. 5 • So far, we've studied programming syntax and techniques • What about tasks for programming? – Homework – Mathematics, statistics – Biology – Animation – Website development – Game development – Natural language processing Programming tasks? (Sage) (Biopython) (Blender) (Django) (PyGame) (NLTK)
  • 6. 6 Natural Language Processing (NLP) • How can we make a computer understand language? – Can a human write/talk to the computer? • Or can the computer guess/predict the input? – Can the computer talk back? – Based on language rules, patterns, or statistics • For now, statistics are more accurate and popular
  • 7. 7 Some areas of NLP • shallow processing – the surface level – tokenization – part-of-speech tagging – forms of words • deep processing – the underlying structures of language – word order (syntax) – meaning – translation • natural language generation
  • 8. 8 The NLTK • A collection of: – Python functions and objects for accomplishing NLP tasks – sample texts (corpora) • Available at: http://nltk.sourceforge.net – Requires Python 2.4 or higher – Click 'Download' and follow instructions for your OS
  • 9. 9 Tokenization • Say we want to know the words in Marty's vocabulary – "You know what I hate? Anybody who drives an S.U.V. I'd really like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him square in the teeth. Booyah. Be like, I'm Marty Stepp, the best ever. Booyah!" • How do we split his speech into tokens?
  • 10. 10 Tokenization (cont.) • How do we split his speech into tokens? >>> martysSpeech.split() ['You', 'know', 'what', 'I', 'hate?', 'Anybody', 'who', 'drives', 'an', 'S.U.V.', "I'd", 'really', 'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100- Dollars-To-Gas-Up', 'and', 'kick', 'him', 'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be', 'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best', 'ever.', 'Booyah!'] • Now, how often does he use the word "booyah"? >>> martysSpeech.split().count("booyah") 0 >>> # What the!
  • 11. 11 Tokenization (cont.) • We could lowercase the speech • We could write our own method to split on "." split on ",", split on "-", etc. • The NLTK already has several tokenizer options • Try: • nltk.tokenize.WordPunctTokenizer – tokenizes on all punctuation • nltk.tokenize.PunktWordTokenizer – trained algorithm to statistically split on words
  • 12. 12 Part-of-speech (POS) tagging • If you know a token's POS you know: – is it the subject? – is it the verb? – is it introducing a grammatical structure? – is it a proper name?
  • 13. 13 Part-of-speech (POS) tagging • Exercise: most frequent proper noun in the Penn Treebank? – Try: • nltk.corpus.treebank • Python's dir() to list attributes of an object – Example: >>> dir("hello world!") [..., 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', ...]
  • 14. 14 Tuples • tagged_words() gives us a list of tuples – tuple: the same thing as a list, but you can't change it – in this case, the tuples are a (word, tag) pairs >>> # Get the (word, tag) pair at list index 0 ... >>> pair = nltk.corpus.treebank.tagged_words()[0] >>> pair ('Pierre', 'NNP') >>> word = pair[0] >>> tag = pair[1] >>> print word, tag Pierre NNP >>> word, tag = pair # or unpack in 1 line! >>> print word, tag Pierre NNP
  • 15. 15 POS tagging (cont.) • How do we tag plain sentences? – A NLTK tagger needs a list of tagged sentences to train on • We'll use nltk.corpus.treebank.tagged_sents() – Then it is ready to tag any input! (but how well?) – Try these tagger objects: • nltk.UnigramTagger(tagged_sentences) • nltk.TrigramTagger(tagged_sentences) – Call the tagger's tag(tokens) method >>> tagger = nltk.UnigramTagger(tagged_sentences) >>> result = tagger.tag(tokens) >>> result [('You', 'PRP'), ('know', 'VB'), ('what', 'WP'), ('I', 'PRP'), ('hate', None), ('?', '.'), ...]
  • 16. 16 POS tagging (cont.) • Exercise: Mad Libs – I have a passage I want filled with the right parts of speech – Let's use random picks from our own data! – This code will print it out: print properNoun1, "has always been a", adjective1, singularNoun, "unlike the", adjective2, properNoun2, "who I", pastVerb, "as he was", ingVerb, "yesterday."
  • 17. 17 Eliza (NLG) • Eliza simulates a Rogerian psychotherapist • With while loops and tokenization, you can make a chat bot! – Try: • nltk.chat.eliza.eliza_chat()
  • 18. 18 Parsing • Syntax is as important for a compiler as it is for natural language • Realizing the hidden structure of a sentence is useful for: – translation – meaning analysis – relationship analysis – a cool demo! • Try: – nltk.draw.rdparser.demo()
  • 19. 19 Conclusion • NLTK: NLP made easy with Python – Functions and objects for: • tokenization, tagging, generation, parsing, ... • and much more! – Even armed with these tools, NLP has a lot of difficult problems! • Also saw: – List methods – dir() – Tuples