SlideShare a Scribd company logo
1 of 60
Download to read offline
WWW.HACKYALE.COM
NLP IN PYTHON
MANIPULATING TEXT
WEEK 0
NLP IN PYTHON
COURSE
INTRODUCTION
WHO AM I?
Nick Hathaway, BR ‘17
CS major, former English major (*gasp*)
Digital Humanities, Linguistics, Journalism
Teeth Slam Poets
NATURAL LANGUAGE PROCESSING
Programs that:
“Understand” human speech or text
Mainly statistical
Natural vs. artificial languages
Programming languages are unambiguous
Human speech, not so much
YOU’LL LEARN
Text processing
Tokenization, stemming, built-in text corpora
Brown, Reuters, WordNet, make your own!
Text classification
Naive Bayes inference, n-gram language models, sentence
parsing, etc.
Classify by topic (Reuters) or by sentiment (imdb)
YOU’LL LEARN
Evaluating your models
F-scores, MaxEnt, selecting appropriate training and testing
sets
Applications, MOOCs, and books for further study
SOURCES
Natural Language Processing with Python, by
Steven Bird, Ewan Klein, and Edward Loper.
O'Reilly Media, 978-0-596-51649-9.
MOOC: Natural Language Processing with Dan
Jurafsky and Christopher Manning (Stanford,
Coursera)
APPLICATIONS
SPELL CHECKING
CREDIT: GROUPMAIL
SPAM DETECTION
CREDIT: HUTCH CARPENTER, I’M NOT ACTUALLY A GEEK BLOG
VIA SENTIMENT140 & SENTIMENT ANALYZER
SENTIMENT ANALYSIS
MACHINE TRANSLATION
INFORMATION EXTRACTION
IBM WATSON
CREDIT: CNME ONLINE
DIALOG SYSTEMS
DIALOG SYSTEMS
CREDIT: ABOIET COUNSELING
& THERAPY GROUP
ELIMINATE ALL HUMANS
CREDIT: TERMINATOR FRANCHISE &
OUR FUTURE ROBOT OVERLORDS
GETTING
STARTED
WHY PYTHON?
String processing methods
Excellent statistical libraries
Pandas, numpy, etc.
Natural Language Toolkit
Created in 2002
> 50 text resources
ALTERNATIVES
Java
StanfordNLP, mallet, OpenNLP
Ruby
Treat, open-nlp
STRING MANIPULATION
string	
  =	
  “This	
  is	
  a	
  great	
  sentence.”	
  
string	
  =	
  string.split()	
  
string[:4]	
  
=>	
  [“This”,	
  “is”,	
  “a”,	
  “great”]	
  
string[-­‐1:]	
  
=>	
  “sentence.”	
  
string[::-­‐1]	
  
=>	
  [“sentence.”,	
  “great”,	
  “a”,	
  “is”,	
  “This”]	
  
FILE I/O
f	
  =	
  open(‘sample.txt’,	
  ‘r’)	
  
for	
  line	
  in	
  f:

	
   line	
  =	
  line.split()	
  
	
   for	
  word	
  in	
  line:

	
   	
   if	
  word	
  [-­‐2:]	
  ==	
  ‘ly’:	
  
	
   	
   	
   print	
  word
SET-UP
NLTK:	
  http://www.nltk.org/install.html	
  
NLTK	
  DATA:	
  
import	
  nltk	
  
nltk.download()	
  
Matplotlib:

$	
  pip	
  install	
  matplotlib	
  
NLTK.DOWNLOAD()
WEEK 0
nltk.corpus
nltk.tokenize
nltk.chat
WHAT IS A
CORPUS?
TEXT CORPORA
Large body of text
Some are general (Brown), others are domain specific
(nps_chat)
Some are tagged (Brown), others are raw
Some have very specific uses (movie_reviews - sentiment
analysis)
NLTK BOOK: CORPUS CHART
CREDIT: NLTK BOOK
BUILT-IN CORPORA
Sentences:

[“I”, “hate”, “cats”, “.”]
Paragraphs:
[[“I”, “hate”, “cats”, “.”], [“Dogs”, “are”, “worse”, “.”]]
Entire file:
List of paragraphs, which are lists of sentences
MANIPULATING
CORPORA
BROWN CORPUS
1960’s at Brown University
> 1,000,000 words
15 categories
News, mystery, editorial, humor, etc.
Part-of-speech tagged
Widely cited
NAVIGATING CORPORA
from	
  nltk.corpus	
  import	
  brown	
  
brown.categories()	
  
len(brown.categories())	
  #	
  num	
  categories	
  
brown.words(categories=[‘news’,	
  ‘editorial’])	
  
brown.sents()	
  
brown.paras()	
  
brown.fileids()	
  
brown.words(fileids=[‘cr01’])	
  
TOKENS VS.TYPES
Tokens
All of the words (or whatever you choose as your smallest
particle of meaning)
I really love this song. It’s the best song I have ever heard.
I: 2
song: 2
heard: 1
TOKENS VS.TYPES
Types
The vocabulary of your text
Everything gets counted exactly once
I really love this song. It’s the best song I have ever heard.
I: 1
song: 1
heard: 1
LEXICAL DIVERSITY
from	
  nltk.corpus	
  import	
  brown	
  
from	
  __future__	
  import	
  division	
  
tokens	
  =	
  brown.words()	
  
types	
  =	
  set(brown.words())	
  
#	
  lexical	
  diversity	
  
len(tokens)/len(types)	
  
=>	
  20.714	
  
	
  
BY GENRE?
#	
  pick	
  categories	
  to	
  compare	
  from:	
  
brown.categories()	
  
#	
  calculate	
  lexical	
  diversity	
  of	
  news	
  articles	
  
news	
  =	
  brown.words(categories=‘news’)	
  
len(news)	
  /	
  len(set(news))	
  
#	
  versus	
  fiction	
  
fiction	
  =	
  brown.words(categories=‘fiction’)	
  
len(fiction)	
  /	
  len(set(fiction))	
  
GUTENBERG CORPUS
Curated by the NLTK team
Just 18 books/poetry collections
Also look at nltk.corpus.genesis
BY AUTHOR?
from	
  nltk.corpus	
  import	
  gutenberg	
  
gutenberg.fileids()	
  
shakespeare	
  =	
  gutenberg.words(fileids=‘blah’)	
  
len(shakespeare)	
  /	
  len(set(shakespeare))	
  
#	
  versus	
  	
  
len(austen)	
  /	
  len(set(austen))	
  
WHAT ARE WE MISSING?
What does the Brown corpus mean by “fiction”?
Is this representative of the question we want to ask?
Is 68,488 words enough?
Can the different categories be compared?
Similarly sized data?
Same genres of writing?
INVESTIGATE YOUR CORPORA!
from	
  nltk.corpus	
  import	
  brown	
  
brown.fileids(categories=‘fiction’)	
  
brown.abspath(‘ck01’)	
  
#	
  poke	
  around	
  the	
  corpora	
  in	
  /…/nltk_data/corpora	
  
#	
  check	
  out	
  the	
  README,	
  category	
  list,	
  actual	
  files
TEXT OBJECTS
VS. CORPORA
OBJECTS
TEXT OBJECT VS. CORPUS OBJECT
You have to convert some_corpus.words() to a text
object using the Text method
Ex:
from nltk.text import Text
text = Text(brown.words())
TEXT OBJECT VS. CORPUS OBJECT
#	
  you	
  have	
  to	
  convert	
  a	
  corpus	
  object	
  to	
  a	
  text	
  	
  
#	
  object	
  to	
  access	
  additional	
  methods	
  
from	
  nltk.text	
  import	
  Text	
  
text	
  =	
  Text(brown.words())	
  
TEXT METHODS
from	
  nltk.corpus	
  import	
  some_corpus	
  
from	
  nltk.text	
  import	
  Text	
  
text	
  =	
  Text(some_corpus.words())	
  
text.concordance(“word”)	
  
text.similar(“word”)	
  
text.common_contexts([“word1”,	
  “word2”])	
  
text.count(“word”)	
  
text.collocations()
WEBTEXT CORPUS
from	
  nltk.corpus	
  import	
  webtext	
  
from	
  nltk.text	
  import	
  Text	
  
dating	
  =	
  Text(webtext.words(fileids=‘singles.txt’))	
  
dating.collocations()
NPS_CHAT CORPUS
10,567 forum posts by age
Out of 500,000 collected by the naval postgraduate school
Can we identify online sexual predators?
Or categorize aspects of written communication that vary by
age?
INVESTIGATING CORPORA
from	
  nltk.corpus	
  import	
  nps_chat	
  
nps_chat.sents()	
  
AHHH, DOESN’T WORK
#	
  if	
  you	
  ever	
  get	
  stuck,	
  use	
  these	
  techniques	
  
#	
  to	
  find	
  appropriate	
  methods	
  for	
  a	
  corpus	
  
#	
  gives	
  you	
  list	
  of	
  methods	
  
dir(nps_chat)	
  
#	
  gives	
  you	
  documentation	
  
help(nltk.corpus)	
  
help(nltk.corpus.nps_chat)
FREQUENCY
DISTRIBUTIONS
FREQUENCY DISTRIBUTIONS
from	
  nltk	
  import	
  FreqDist	
  
from	
  nltk.text	
  import	
  Text	
  
import	
  matplotlib	
  
text	
  =	
  Text(some_corpus.words())	
  
fdist	
  =	
  FreqDist(text)	
  
=>	
  dict	
  of	
  {‘token1’:count1,	
  ‘token2’:count2,…}	
  
fdist[‘the’]	
  	
  
=>	
  returns	
  the	
  #	
  of	
  times	
  ‘the’	
  appears	
  in	
  text	
  
FREQUENCY DISTRIBUTIONS
#	
  top	
  50	
  words	
  
fdst.most_common(50)	
  
#	
  frequency	
  plot	
  
fdist.plot(25,	
  cumulative=[True	
  or	
  False])	
  
#	
  list	
  of	
  all	
  1-­‐count	
  tokens	
  
fdist.hapaxes()
FREQUENCY DISTRIBUTIONS
#	
  most	
  frequent	
  token	
  
fdist.max()	
  
#	
  decimal	
  frequency	
  of	
  a	
  given	
  token	
  
fdist.freq(‘token’)	
  
#	
  total	
  #	
  of	
  samples	
  
fdist.N()
PROBLEM
FREQ. DIST. OF BROWN CORPUS
PROBLEM
We’re mostly counting words that don’t tell us
anything about the text
“The” is most frequent
Followed by ‘,’
We need stopwords
Commonly used words that you can remove from a corpus
before processing
STOPWORDS
from	
  nltk.corpus	
  import	
  stopwords	
  
sw	
  =	
  stopwords.words(‘english’)	
  
old_brown	
  =	
  brown.words()	
  
new	
  =	
  [w	
  for	
  w	
  in	
  old_brown	
  if	
  w.lower()	
  not	
  in	
  sw]	
  
#	
  better,	
  but	
  we’re	
  still	
  counting	
  punctuation	
  
#	
  sample	
  code	
  -­‐	
  brown.py
PROBLEM
FREQ. DIST. OF BROWN CORPUS
STOPWORDS & PUNCTUATION
from	
  nltk.corpus	
  import	
  stopwords	
  
sw	
  =	
  stopwords.words(‘english’)	
  
old_brown	
  =	
  brown.words()	
  
new	
  =	
  [w	
  for	
  w	
  in	
  old_brown	
  if	
  w.lower()	
  not	
  in	
  sw	
  
	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   and	
  w.isalnum()]	
  
STILL PROBLEMS
FREQ. DIST. OF BROWN’S NEWS CORPUS
HOMEWORK
Play around with new corpora
What are the frequent words of diff. genres?
Look at webtext.fileids() for a diversity of text files
Find one in your area of interest
movie_reviews, udhr, wordnet, etc.
dir(some_corpus) => how do the methods differ?
some_corpus.abspath() => what’s actually in the corpus?

More Related Content

What's hot

Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingMustafa Jarrar
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Chatbots from first principles
Chatbots from first principlesChatbots from first principles
Chatbots from first principlesJonathan Mugan
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 

What's hot (20)

Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language Processing
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
NLP
NLPNLP
NLP
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Chatbots from first principles
Chatbots from first principlesChatbots from first principles
Chatbots from first principles
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 

Similar to HackYale NLP Week 0

Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4Glenn De Backer
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azurecloudbeatsch
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processingBabu Priyavrat
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIsErik Rose
 
Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.James Curran
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysisBarry DeCicco
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 

Similar to HackYale NLP Week 0 (20)

Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Nltk
NltkNltk
Nltk
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Poetic APIs
Poetic APIsPoetic APIs
Poetic APIs
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIs
 
Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 

Recently uploaded

buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Recently uploaded (20)

buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

HackYale NLP Week 0

  • 4. WHO AM I? Nick Hathaway, BR ‘17 CS major, former English major (*gasp*) Digital Humanities, Linguistics, Journalism Teeth Slam Poets
  • 5. NATURAL LANGUAGE PROCESSING Programs that: “Understand” human speech or text Mainly statistical Natural vs. artificial languages Programming languages are unambiguous Human speech, not so much
  • 6. YOU’LL LEARN Text processing Tokenization, stemming, built-in text corpora Brown, Reuters, WordNet, make your own! Text classification Naive Bayes inference, n-gram language models, sentence parsing, etc. Classify by topic (Reuters) or by sentiment (imdb)
  • 7. YOU’LL LEARN Evaluating your models F-scores, MaxEnt, selecting appropriate training and testing sets Applications, MOOCs, and books for further study
  • 8. SOURCES Natural Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 978-0-596-51649-9. MOOC: Natural Language Processing with Dan Jurafsky and Christopher Manning (Stanford, Coursera)
  • 12. CREDIT: HUTCH CARPENTER, I’M NOT ACTUALLY A GEEK BLOG VIA SENTIMENT140 & SENTIMENT ANALYZER SENTIMENT ANALYSIS
  • 17. DIALOG SYSTEMS CREDIT: ABOIET COUNSELING & THERAPY GROUP
  • 18. ELIMINATE ALL HUMANS CREDIT: TERMINATOR FRANCHISE & OUR FUTURE ROBOT OVERLORDS
  • 20. WHY PYTHON? String processing methods Excellent statistical libraries Pandas, numpy, etc. Natural Language Toolkit Created in 2002 > 50 text resources
  • 22. STRING MANIPULATION string  =  “This  is  a  great  sentence.”   string  =  string.split()   string[:4]   =>  [“This”,  “is”,  “a”,  “great”]   string[-­‐1:]   =>  “sentence.”   string[::-­‐1]   =>  [“sentence.”,  “great”,  “a”,  “is”,  “This”]  
  • 23. FILE I/O f  =  open(‘sample.txt’,  ‘r’)   for  line  in  f:
   line  =  line.split()     for  word  in  line:
     if  word  [-­‐2:]  ==  ‘ly’:         print  word
  • 24. SET-UP NLTK:  http://www.nltk.org/install.html   NLTK  DATA:   import  nltk   nltk.download()   Matplotlib:
 $  pip  install  matplotlib  
  • 28. TEXT CORPORA Large body of text Some are general (Brown), others are domain specific (nps_chat) Some are tagged (Brown), others are raw Some have very specific uses (movie_reviews - sentiment analysis)
  • 29. NLTK BOOK: CORPUS CHART CREDIT: NLTK BOOK
  • 30. BUILT-IN CORPORA Sentences:
 [“I”, “hate”, “cats”, “.”] Paragraphs: [[“I”, “hate”, “cats”, “.”], [“Dogs”, “are”, “worse”, “.”]] Entire file: List of paragraphs, which are lists of sentences
  • 32. BROWN CORPUS 1960’s at Brown University > 1,000,000 words 15 categories News, mystery, editorial, humor, etc. Part-of-speech tagged Widely cited
  • 33. NAVIGATING CORPORA from  nltk.corpus  import  brown   brown.categories()   len(brown.categories())  #  num  categories   brown.words(categories=[‘news’,  ‘editorial’])   brown.sents()   brown.paras()   brown.fileids()   brown.words(fileids=[‘cr01’])  
  • 34. TOKENS VS.TYPES Tokens All of the words (or whatever you choose as your smallest particle of meaning) I really love this song. It’s the best song I have ever heard. I: 2 song: 2 heard: 1
  • 35. TOKENS VS.TYPES Types The vocabulary of your text Everything gets counted exactly once I really love this song. It’s the best song I have ever heard. I: 1 song: 1 heard: 1
  • 36. LEXICAL DIVERSITY from  nltk.corpus  import  brown   from  __future__  import  division   tokens  =  brown.words()   types  =  set(brown.words())   #  lexical  diversity   len(tokens)/len(types)   =>  20.714    
  • 37. BY GENRE? #  pick  categories  to  compare  from:   brown.categories()   #  calculate  lexical  diversity  of  news  articles   news  =  brown.words(categories=‘news’)   len(news)  /  len(set(news))   #  versus  fiction   fiction  =  brown.words(categories=‘fiction’)   len(fiction)  /  len(set(fiction))  
  • 38. GUTENBERG CORPUS Curated by the NLTK team Just 18 books/poetry collections Also look at nltk.corpus.genesis
  • 39. BY AUTHOR? from  nltk.corpus  import  gutenberg   gutenberg.fileids()   shakespeare  =  gutenberg.words(fileids=‘blah’)   len(shakespeare)  /  len(set(shakespeare))   #  versus     len(austen)  /  len(set(austen))  
  • 40. WHAT ARE WE MISSING? What does the Brown corpus mean by “fiction”? Is this representative of the question we want to ask? Is 68,488 words enough? Can the different categories be compared? Similarly sized data? Same genres of writing?
  • 41. INVESTIGATE YOUR CORPORA! from  nltk.corpus  import  brown   brown.fileids(categories=‘fiction’)   brown.abspath(‘ck01’)   #  poke  around  the  corpora  in  /…/nltk_data/corpora   #  check  out  the  README,  category  list,  actual  files
  • 43. TEXT OBJECT VS. CORPUS OBJECT You have to convert some_corpus.words() to a text object using the Text method Ex: from nltk.text import Text text = Text(brown.words())
  • 44. TEXT OBJECT VS. CORPUS OBJECT #  you  have  to  convert  a  corpus  object  to  a  text     #  object  to  access  additional  methods   from  nltk.text  import  Text   text  =  Text(brown.words())  
  • 45. TEXT METHODS from  nltk.corpus  import  some_corpus   from  nltk.text  import  Text   text  =  Text(some_corpus.words())   text.concordance(“word”)   text.similar(“word”)   text.common_contexts([“word1”,  “word2”])   text.count(“word”)   text.collocations()
  • 46. WEBTEXT CORPUS from  nltk.corpus  import  webtext   from  nltk.text  import  Text   dating  =  Text(webtext.words(fileids=‘singles.txt’))   dating.collocations()
  • 47. NPS_CHAT CORPUS 10,567 forum posts by age Out of 500,000 collected by the naval postgraduate school Can we identify online sexual predators? Or categorize aspects of written communication that vary by age?
  • 48. INVESTIGATING CORPORA from  nltk.corpus  import  nps_chat   nps_chat.sents()  
  • 49. AHHH, DOESN’T WORK #  if  you  ever  get  stuck,  use  these  techniques   #  to  find  appropriate  methods  for  a  corpus   #  gives  you  list  of  methods   dir(nps_chat)   #  gives  you  documentation   help(nltk.corpus)   help(nltk.corpus.nps_chat)
  • 51. FREQUENCY DISTRIBUTIONS from  nltk  import  FreqDist   from  nltk.text  import  Text   import  matplotlib   text  =  Text(some_corpus.words())   fdist  =  FreqDist(text)   =>  dict  of  {‘token1’:count1,  ‘token2’:count2,…}   fdist[‘the’]     =>  returns  the  #  of  times  ‘the’  appears  in  text  
  • 52. FREQUENCY DISTRIBUTIONS #  top  50  words   fdst.most_common(50)   #  frequency  plot   fdist.plot(25,  cumulative=[True  or  False])   #  list  of  all  1-­‐count  tokens   fdist.hapaxes()
  • 53. FREQUENCY DISTRIBUTIONS #  most  frequent  token   fdist.max()   #  decimal  frequency  of  a  given  token   fdist.freq(‘token’)   #  total  #  of  samples   fdist.N()
  • 54. PROBLEM FREQ. DIST. OF BROWN CORPUS
  • 55. PROBLEM We’re mostly counting words that don’t tell us anything about the text “The” is most frequent Followed by ‘,’ We need stopwords Commonly used words that you can remove from a corpus before processing
  • 56. STOPWORDS from  nltk.corpus  import  stopwords   sw  =  stopwords.words(‘english’)   old_brown  =  brown.words()   new  =  [w  for  w  in  old_brown  if  w.lower()  not  in  sw]   #  better,  but  we’re  still  counting  punctuation   #  sample  code  -­‐  brown.py
  • 57. PROBLEM FREQ. DIST. OF BROWN CORPUS
  • 58. STOPWORDS & PUNCTUATION from  nltk.corpus  import  stopwords   sw  =  stopwords.words(‘english’)   old_brown  =  brown.words()   new  =  [w  for  w  in  old_brown  if  w.lower()  not  in  sw                         and  w.isalnum()]  
  • 59. STILL PROBLEMS FREQ. DIST. OF BROWN’S NEWS CORPUS
  • 60. HOMEWORK Play around with new corpora What are the frequent words of diff. genres? Look at webtext.fileids() for a diversity of text files Find one in your area of interest movie_reviews, udhr, wordnet, etc. dir(some_corpus) => how do the methods differ? some_corpus.abspath() => what’s actually in the corpus?