SlideShare a Scribd company logo
NLTK
A Tool Kit for Natural Language Processing
About Me
•My name is Md. Fasihul Kabir
•Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 –
Present)
•BSc in CSE from AUST (April, 2013).
•MSc in CSE from UIU.
•Research interests are NLP, IR, ML and Compiler Design.
Agenda
• What is NLTK?
• What is NLP?
• Installing NLTK
• NLTK Modules & Functionality
• NLP with NLTK
• Accessing Text Corpora & Lexical Resources
• Tokenization
• Normalizing Text
• POS Tagging
• NER
• Language Model
Natural Language Toolkit (NLTK)
• A collection of Python programs, modules, data set and tutorial to support
research and development in Natural Language Processing (NLP)
• Written by Steven Bird, Edvard Loper and Ewan Klien
• NLTK is
• Free and Open source
• Easy to use
• Modular
• Well documented
• Simple and extensible
• http://www.nltk.org/
What is Natural Language Processing
•Computer aided text analysis of human language
•The goal is to enable machines to understand human language and
extract meaning from text
•It is a field of study which falls under the category of machine
learning and more specifically computational linguistics
Application of NLP
•Automatic summarization
•Machine translation
•Natural language generation
•Natural language understanding
•Optical character recognition
•Question answering
•Speech Recognition
•Text-to-Speech
Installing NLTK
•Install PyYAML, Numpy, Matplotlib
•NLTK Source Installation
• Download NLTK source ( http://nltk.googlecode.com/)
• Unzip it & Go to the new unzipped folder
• Just do it!
➢ python setup.py install
•To install data
• Start python interpreter
>>> import nltk
>>> nltk.download()
NLTK Modules & Functionality
NLTK Modules Functionality
nltk.corpus Corpus
nltk.tokenize, nltk.stem Tokenizers, stemmers
nltk.collocations t-test, chi-squared, mutual-info
nltk.tag n-gram, backoff,Brill, HMM, TnT
nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means
nltk.chunk Regex,n-gram, named entity
nltk.parsing Parsing
nltk.sem, nltk.interence Semantic interpretation
nltk.metrics Evaluation metrics
nltk.probability Probability & Estimation
nltk.app, nltk.chat Applications
Accessing Text Corpora & Lexical Resources
•NLTK provides over 50 corpora and lexical resources.
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> len(brown.sents())
57340
>>> len(brown.words())
1161192
•http://www.nltk.org/book/ch02.html
Tokenization
• Tokenization is the process of breaking a stream of text up into words, phrases,
symbols, or other meaningful elements called tokens.
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.'''
• Word Punctuation Tokenization
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
• Sentence Tokenization
>>> sent_tokenize(s)
['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.']
• Word Tokenization
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of',
'them', '.'], ['Thanks', '.']]
Normalizing Text
• Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form
, generally a written word form.
• Porter Stemming Algorithm
>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
• LancasterStemmer Algorithm
>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
• SnowballStemmer Algorithm (supports 15 languages)
>>> from nltk.stem import SnowballStemmer
>>> stemmer = SnowballStemmer('english')
>>> stemmer.stem('cooking')
'cook'
Normalizing Text (Cont.)
•Lemmatization process involves first determining the part of speech
of a word, and applying different normalization rules for each part of
speech.
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
Normalizing Text (Cont.)
•Comparison between stemming and lemmatizing.
>>> stemmer.stem('believes')
'believ'
>>> lemmatizer.lemmatize('believes')
'belief'
Part-of-speech Tagging
•Part-of-speech Tagging is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> words = word_tokenize('And now for something completely different')
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]
•https://www.ling.upenn.
edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Named-entity Recognition
•Named-entity recognition is a subtask of information extraction that
seeks to locate and classify elements in text into pre-defined
categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
>>> from nltk import pos_tag, ne_chunk
>>> from nltk.tokenize import wordpunct_tokenize
>>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.'
>>> ne_chunk(pos_tag(wordpunct_tokenize(sent)))
Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'),
('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'),
('2006', 'CD'), ('.', '.')])
Language model
•A statistical language model assigns a probability to a sequence of m
words P(w1, w2, …., wm) by means of a probability distribution.
>>> import nltk
>>> from nltk.corpus import gutenberg
>>> from nltk.model import NgramModel
>>> from nltk.probability import LidstoneProbDist
>>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')]
>>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1))
>>> ssm.prob('of',('the','name'))
0.907524932004
>>> ssm.prob('if',('the','name'))
0.0124444830775
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014

More Related Content

What's hot

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
Jaganadh Gopinadhan
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
cloudbeatsch
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
Francesco Bruni
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
David Sanchez
 
Why Python (for Statisticians)
Why Python (for Statisticians)Why Python (for Statisticians)
Why Python (for Statisticians)
Matt Harrison
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and More
Matt Harrison
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
José Héctor Gálvez
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
source{d}
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Iván Compañy Avi
 
Screaming fast json parsing on Android
Screaming fast json parsing on AndroidScreaming fast json parsing on Android
Screaming fast json parsing on Android
Karthik Ramgopal
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
Nick Hathaway
 
Control your Voice like a Bene Gesserit
Control your Voice like a Bene GesseritControl your Voice like a Bene Gesserit
Control your Voice like a Bene Gesserit
Jorge Ortiz
 
Py jail talk
Py jail talkPy jail talk
Python interview questions
Python interview questionsPython interview questions
Python interview questions
Pragati Singh
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
Narendra Sisodiya
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
trygub
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 

What's hot (20)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
 
Why Python (for Statisticians)
Why Python (for Statisticians)Why Python (for Statisticians)
Why Python (for Statisticians)
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and More
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Screaming fast json parsing on Android
Screaming fast json parsing on AndroidScreaming fast json parsing on Android
Screaming fast json parsing on Android
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
Control your Voice like a Bene Gesserit
Control your Voice like a Bene GesseritControl your Voice like a Bene Gesserit
Control your Voice like a Bene Gesserit
 
Py jail talk
Py jail talkPy jail talk
Py jail talk
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 

Viewers also liked

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
shanbady
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
Jacob Perkins
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Yogendra Tamang
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Jaganadh Gopinadhan
 
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
I4MS_eu
 
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHAREJose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
I4MS_eu
 
디지털미디어특강 과제
디지털미디어특강 과제디지털미디어특강 과제
디지털미디어특강 과제HyeAhn
 
Desarrollo humanoy capacidades
Desarrollo humanoy capacidadesDesarrollo humanoy capacidades
Desarrollo humanoy capacidades
Yehudi Omar Salazar Toledo
 
Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)
I4MS_eu
 
Food for thought
Food for thoughtFood for thought
Food for thought
Iman Ali
 
Co2 portfolio
Co2 portfolioCo2 portfolio
Co2 portfolio
Iman Ali
 
Keseimbangan ekosistem
Keseimbangan ekosistemKeseimbangan ekosistem
Keseimbangan ekosistemsantivia
 
Korupsi
KorupsiKorupsi
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyondFrancesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
I4MS_eu
 
Pei salud
Pei   saludPei   salud
Ales Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (ReconcellAles Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (Reconcell
I4MS_eu
 
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
I4MS_eu
 

Viewers also liked (20)

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
 
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHAREJose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
 
디지털미디어특강 과제
디지털미디어특강 과제디지털미디어특강 과제
디지털미디어특강 과제
 
Desarrollo humanoy capacidades
Desarrollo humanoy capacidadesDesarrollo humanoy capacidades
Desarrollo humanoy capacidades
 
Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)
 
Food for thought
Food for thoughtFood for thought
Food for thought
 
makalah
makalahmakalah
makalah
 
Co2 portfolio
Co2 portfolioCo2 portfolio
Co2 portfolio
 
Keseimbangan ekosistem
Keseimbangan ekosistemKeseimbangan ekosistem
Keseimbangan ekosistem
 
Korupsi
KorupsiKorupsi
Korupsi
 
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyondFrancesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
 
Pei salud
Pei   saludPei   salud
Pei salud
 
Ales Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (ReconcellAles Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (Reconcell
 
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
 

Similar to Nltk:a tool for_nlp - py_con-dhaka-2014

NLTK
NLTKNLTK
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
NLTK
NLTKNLTK
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptx
jatinchand3
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Fabio Benedetti
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
HackYale NLP Week 0
HackYale NLP Week 0HackYale NLP Week 0
HackYale NLP Week 0
Nick Hathaway
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
VeenaSKumar2
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
Trey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
Trey Grainger
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
Barry DeCicco
 
Nltk
NltkNltk
Nltk
Anirudh
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 

Similar to Nltk:a tool for_nlp - py_con-dhaka-2014 (20)

NLTK
NLTKNLTK
NLTK
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
NLTK
NLTKNLTK
NLTK
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Taming Text
Taming TextTaming Text
Taming Text
 
HackYale NLP Week 0
HackYale NLP Week 0HackYale NLP Week 0
HackYale NLP Week 0
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
Nltk
NltkNltk
Nltk
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 

Nltk:a tool for_nlp - py_con-dhaka-2014

  • 1. NLTK A Tool Kit for Natural Language Processing
  • 2. About Me •My name is Md. Fasihul Kabir •Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 – Present) •BSc in CSE from AUST (April, 2013). •MSc in CSE from UIU. •Research interests are NLP, IR, ML and Compiler Design.
  • 3. Agenda • What is NLTK? • What is NLP? • Installing NLTK • NLTK Modules & Functionality • NLP with NLTK • Accessing Text Corpora & Lexical Resources • Tokenization • Normalizing Text • POS Tagging • NER • Language Model
  • 4. Natural Language Toolkit (NLTK) • A collection of Python programs, modules, data set and tutorial to support research and development in Natural Language Processing (NLP) • Written by Steven Bird, Edvard Loper and Ewan Klien • NLTK is • Free and Open source • Easy to use • Modular • Well documented • Simple and extensible • http://www.nltk.org/
  • 5. What is Natural Language Processing •Computer aided text analysis of human language •The goal is to enable machines to understand human language and extract meaning from text •It is a field of study which falls under the category of machine learning and more specifically computational linguistics
  • 6. Application of NLP •Automatic summarization •Machine translation •Natural language generation •Natural language understanding •Optical character recognition •Question answering •Speech Recognition •Text-to-Speech
  • 7. Installing NLTK •Install PyYAML, Numpy, Matplotlib •NLTK Source Installation • Download NLTK source ( http://nltk.googlecode.com/) • Unzip it & Go to the new unzipped folder • Just do it! ➢ python setup.py install •To install data • Start python interpreter >>> import nltk >>> nltk.download()
  • 8. NLTK Modules & Functionality NLTK Modules Functionality nltk.corpus Corpus nltk.tokenize, nltk.stem Tokenizers, stemmers nltk.collocations t-test, chi-squared, mutual-info nltk.tag n-gram, backoff,Brill, HMM, TnT nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means nltk.chunk Regex,n-gram, named entity nltk.parsing Parsing nltk.sem, nltk.interence Semantic interpretation nltk.metrics Evaluation metrics nltk.probability Probability & Estimation nltk.app, nltk.chat Applications
  • 9. Accessing Text Corpora & Lexical Resources •NLTK provides over 50 corpora and lexical resources. >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> len(brown.sents()) 57340 >>> len(brown.words()) 1161192 •http://www.nltk.org/book/ch02.html
  • 10. Tokenization • Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. >>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize >>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.''' • Word Punctuation Tokenization >>> wordpunct_tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] • Sentence Tokenization >>> sent_tokenize(s) ['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.'] • Word Tokenization >>> [word_tokenize(t) for t in sent_tokenize(s)] [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
  • 11. Normalizing Text • Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form , generally a written word form. • Porter Stemming Algorithm >>> from nltk.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> stemmer.stem('cooking') 'cook' • LancasterStemmer Algorithm >>> from nltk.stem import LancasterStemmer >>> stemmer = LancasterStemmer() >>> stemmer.stem('cooking') 'cook' • SnowballStemmer Algorithm (supports 15 languages) >>> from nltk.stem import SnowballStemmer >>> stemmer = SnowballStemmer('english') >>> stemmer.stem('cooking') 'cook'
  • 12. Normalizing Text (Cont.) •Lemmatization process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() >>> lemmatizer.lemmatize('cooking') 'cooking' >>> lemmatizer.lemmatize('cooking', pos='v') 'cook'
  • 13. Normalizing Text (Cont.) •Comparison between stemming and lemmatizing. >>> stemmer.stem('believes') 'believ' >>> lemmatizer.lemmatize('believes') 'belief'
  • 14. Part-of-speech Tagging •Part-of-speech Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech >>> from nltk.tokenize import word_tokenize >>> from nltk.tag import pos_tag >>> words = word_tokenize('And now for something completely different') >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] •https://www.ling.upenn. edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  • 15. Named-entity Recognition •Named-entity recognition is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. >>> from nltk import pos_tag, ne_chunk >>> from nltk.tokenize import wordpunct_tokenize >>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.' >>> ne_chunk(pos_tag(wordpunct_tokenize(sent))) Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'), ('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'), ('2006', 'CD'), ('.', '.')])
  • 16. Language model •A statistical language model assigns a probability to a sequence of m words P(w1, w2, …., wm) by means of a probability distribution. >>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.model import NgramModel >>> from nltk.probability import LidstoneProbDist >>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')] >>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1)) >>> ssm.prob('of',('the','name')) 0.907524932004 >>> ssm.prob('if',('the','name')) 0.0124444830775