SlideShare a Scribd company logo
1 of 18
Download to read offline
NLTK
A Tool Kit for Natural Language Processing
About Me
•My name is Md. Fasihul Kabir
•Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 –
Present)
•BSc in CSE from AUST (April, 2013).
•MSc in CSE from UIU.
•Research interests are NLP, IR, ML and Compiler Design.
Agenda
• What is NLTK?
• What is NLP?
• Installing NLTK
• NLTK Modules & Functionality
• NLP with NLTK
• Accessing Text Corpora & Lexical Resources
• Tokenization
• Normalizing Text
• POS Tagging
• NER
• Language Model
Natural Language Toolkit (NLTK)
• A collection of Python programs, modules, data set and tutorial to support
research and development in Natural Language Processing (NLP)
• Written by Steven Bird, Edvard Loper and Ewan Klien
• NLTK is
• Free and Open source
• Easy to use
• Modular
• Well documented
• Simple and extensible
• http://www.nltk.org/
What is Natural Language Processing
•Computer aided text analysis of human language
•The goal is to enable machines to understand human language and
extract meaning from text
•It is a field of study which falls under the category of machine
learning and more specifically computational linguistics
Application of NLP
•Automatic summarization
•Machine translation
•Natural language generation
•Natural language understanding
•Optical character recognition
•Question answering
•Speech Recognition
•Text-to-Speech
Installing NLTK
•Install PyYAML, Numpy, Matplotlib
•NLTK Source Installation
• Download NLTK source ( http://nltk.googlecode.com/)
• Unzip it & Go to the new unzipped folder
• Just do it!
➢ python setup.py install
•To install data
• Start python interpreter
>>> import nltk
>>> nltk.download()
NLTK Modules & Functionality
NLTK Modules Functionality
nltk.corpus Corpus
nltk.tokenize, nltk.stem Tokenizers, stemmers
nltk.collocations t-test, chi-squared, mutual-info
nltk.tag n-gram, backoff,Brill, HMM, TnT
nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means
nltk.chunk Regex,n-gram, named entity
nltk.parsing Parsing
nltk.sem, nltk.interence Semantic interpretation
nltk.metrics Evaluation metrics
nltk.probability Probability & Estimation
nltk.app, nltk.chat Applications
Accessing Text Corpora & Lexical Resources
•NLTK provides over 50 corpora and lexical resources.
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> len(brown.sents())
57340
>>> len(brown.words())
1161192
•http://www.nltk.org/book/ch02.html
Tokenization
• Tokenization is the process of breaking a stream of text up into words, phrases,
symbols, or other meaningful elements called tokens.
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.'''
• Word Punctuation Tokenization
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
• Sentence Tokenization
>>> sent_tokenize(s)
['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.']
• Word Tokenization
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of',
'them', '.'], ['Thanks', '.']]
Normalizing Text
• Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form
, generally a written word form.
• Porter Stemming Algorithm
>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
• LancasterStemmer Algorithm
>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
• SnowballStemmer Algorithm (supports 15 languages)
>>> from nltk.stem import SnowballStemmer
>>> stemmer = SnowballStemmer('english')
>>> stemmer.stem('cooking')
'cook'
Normalizing Text (Cont.)
•Lemmatization process involves first determining the part of speech
of a word, and applying different normalization rules for each part of
speech.
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
Normalizing Text (Cont.)
•Comparison between stemming and lemmatizing.
>>> stemmer.stem('believes')
'believ'
>>> lemmatizer.lemmatize('believes')
'belief'
Part-of-speech Tagging
•Part-of-speech Tagging is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> words = word_tokenize('And now for something completely different')
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]
•https://www.ling.upenn.
edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Named-entity Recognition
•Named-entity recognition is a subtask of information extraction that
seeks to locate and classify elements in text into pre-defined
categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
>>> from nltk import pos_tag, ne_chunk
>>> from nltk.tokenize import wordpunct_tokenize
>>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.'
>>> ne_chunk(pos_tag(wordpunct_tokenize(sent)))
Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'),
('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'),
('2006', 'CD'), ('.', '.')])
Language model
•A statistical language model assigns a probability to a sequence of m
words P(w1, w2, …., wm) by means of a probability distribution.
>>> import nltk
>>> from nltk.corpus import gutenberg
>>> from nltk.model import NgramModel
>>> from nltk.probability import LidstoneProbDist
>>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')]
>>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1))
>>> ssm.prob('of',('the','name'))
0.907524932004
>>> ssm.prob('if',('the','name'))
0.0124444830775
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014

More Related Content

What's hot

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azurecloudbeatsch
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTKFrancesco Bruni
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackDavid Sanchez
 
Why Python (for Statisticians)
Why Python (for Statisticians)Why Python (for Statisticians)
Why Python (for Statisticians)Matt Harrison
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MoreMatt Harrison
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for BioinformaticsJosé Héctor Gálvez
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Screaming fast json parsing on Android
Screaming fast json parsing on AndroidScreaming fast json parsing on Android
Screaming fast json parsing on AndroidKarthik Ramgopal
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)Nick Hathaway
 
Control your Voice like a Bene Gesserit
Control your Voice like a Bene GesseritControl your Voice like a Bene Gesserit
Control your Voice like a Bene GesseritJorge Ortiz
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 

What's hot (20)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
 
Why Python (for Statisticians)
Why Python (for Statisticians)Why Python (for Statisticians)
Why Python (for Statisticians)
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and More
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Screaming fast json parsing on Android
Screaming fast json parsing on AndroidScreaming fast json parsing on Android
Screaming fast json parsing on Android
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
Control your Voice like a Bene Gesserit
Control your Voice like a Bene GesseritControl your Voice like a Bene Gesserit
Control your Voice like a Bene Gesserit
 
Py jail talk
Py jail talkPy jail talk
Py jail talk
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 

Viewers also liked

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKJacob Perkins
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...I4MS_eu
 
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHAREJose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHAREI4MS_eu
 
디지털미디어특강 과제
디지털미디어특강 과제디지털미디어특강 과제
디지털미디어특강 과제HyeAhn
 
Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)I4MS_eu
 
Food for thought
Food for thoughtFood for thought
Food for thoughtIman Ali
 
Co2 portfolio
Co2 portfolioCo2 portfolio
Co2 portfolioIman Ali
 
Keseimbangan ekosistem
Keseimbangan ekosistemKeseimbangan ekosistem
Keseimbangan ekosistemsantivia
 
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyondFrancesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyondI4MS_eu
 
Ales Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (ReconcellAles Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (ReconcellI4MS_eu
 
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)I4MS_eu
 

Viewers also liked (20)

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
 
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHAREJose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
Jose A. Ramos de Campos | Laser and Sensors- Experience from industry: LASHARE
 
디지털미디어특강 과제
디지털미디어특강 과제디지털미디어특강 과제
디지털미디어특강 과제
 
Desarrollo humanoy capacidades
Desarrollo humanoy capacidadesDesarrollo humanoy capacidades
Desarrollo humanoy capacidades
 
Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)Stefan van der Elst (KE-Works - NL)
Stefan van der Elst (KE-Works - NL)
 
Food for thought
Food for thoughtFood for thought
Food for thought
 
makalah
makalahmakalah
makalah
 
Co2 portfolio
Co2 portfolioCo2 portfolio
Co2 portfolio
 
Keseimbangan ekosistem
Keseimbangan ekosistemKeseimbangan ekosistem
Keseimbangan ekosistem
 
Korupsi
KorupsiKorupsi
Korupsi
 
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyondFrancesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
Francesca Flamigni | New opportunities under I4MS-Phase 2 and beyond
 
Pei salud
Pei   saludPei   salud
Pei salud
 
Ales Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (ReconcellAles Ude, Jozef Stefan Institute, SI (Reconcell
Ales Ude, Jozef Stefan Institute, SI (Reconcell
 
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
Alessandro Arcidiacono, Enginsoft, IT (Fortissimo)
 

Similar to Nltk:a tool for_nlp - py_con-dhaka-2014

Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptxjatinchand3
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysisBarry DeCicco
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 

Similar to Nltk:a tool for_nlp - py_con-dhaka-2014 (20)

NLTK
NLTKNLTK
NLTK
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
NLTK
NLTKNLTK
NLTK
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Taming Text
Taming TextTaming Text
Taming Text
 
HackYale NLP Week 0
HackYale NLP Week 0HackYale NLP Week 0
HackYale NLP Week 0
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
Nltk
NltkNltk
Nltk
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 

Recently uploaded

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Nltk:a tool for_nlp - py_con-dhaka-2014

  • 1. NLTK A Tool Kit for Natural Language Processing
  • 2. About Me •My name is Md. Fasihul Kabir •Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 – Present) •BSc in CSE from AUST (April, 2013). •MSc in CSE from UIU. •Research interests are NLP, IR, ML and Compiler Design.
  • 3. Agenda • What is NLTK? • What is NLP? • Installing NLTK • NLTK Modules & Functionality • NLP with NLTK • Accessing Text Corpora & Lexical Resources • Tokenization • Normalizing Text • POS Tagging • NER • Language Model
  • 4. Natural Language Toolkit (NLTK) • A collection of Python programs, modules, data set and tutorial to support research and development in Natural Language Processing (NLP) • Written by Steven Bird, Edvard Loper and Ewan Klien • NLTK is • Free and Open source • Easy to use • Modular • Well documented • Simple and extensible • http://www.nltk.org/
  • 5. What is Natural Language Processing •Computer aided text analysis of human language •The goal is to enable machines to understand human language and extract meaning from text •It is a field of study which falls under the category of machine learning and more specifically computational linguistics
  • 6. Application of NLP •Automatic summarization •Machine translation •Natural language generation •Natural language understanding •Optical character recognition •Question answering •Speech Recognition •Text-to-Speech
  • 7. Installing NLTK •Install PyYAML, Numpy, Matplotlib •NLTK Source Installation • Download NLTK source ( http://nltk.googlecode.com/) • Unzip it & Go to the new unzipped folder • Just do it! ➢ python setup.py install •To install data • Start python interpreter >>> import nltk >>> nltk.download()
  • 8. NLTK Modules & Functionality NLTK Modules Functionality nltk.corpus Corpus nltk.tokenize, nltk.stem Tokenizers, stemmers nltk.collocations t-test, chi-squared, mutual-info nltk.tag n-gram, backoff,Brill, HMM, TnT nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means nltk.chunk Regex,n-gram, named entity nltk.parsing Parsing nltk.sem, nltk.interence Semantic interpretation nltk.metrics Evaluation metrics nltk.probability Probability & Estimation nltk.app, nltk.chat Applications
  • 9. Accessing Text Corpora & Lexical Resources •NLTK provides over 50 corpora and lexical resources. >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> len(brown.sents()) 57340 >>> len(brown.words()) 1161192 •http://www.nltk.org/book/ch02.html
  • 10. Tokenization • Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. >>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize >>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.''' • Word Punctuation Tokenization >>> wordpunct_tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] • Sentence Tokenization >>> sent_tokenize(s) ['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.'] • Word Tokenization >>> [word_tokenize(t) for t in sent_tokenize(s)] [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
  • 11. Normalizing Text • Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form , generally a written word form. • Porter Stemming Algorithm >>> from nltk.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> stemmer.stem('cooking') 'cook' • LancasterStemmer Algorithm >>> from nltk.stem import LancasterStemmer >>> stemmer = LancasterStemmer() >>> stemmer.stem('cooking') 'cook' • SnowballStemmer Algorithm (supports 15 languages) >>> from nltk.stem import SnowballStemmer >>> stemmer = SnowballStemmer('english') >>> stemmer.stem('cooking') 'cook'
  • 12. Normalizing Text (Cont.) •Lemmatization process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() >>> lemmatizer.lemmatize('cooking') 'cooking' >>> lemmatizer.lemmatize('cooking', pos='v') 'cook'
  • 13. Normalizing Text (Cont.) •Comparison between stemming and lemmatizing. >>> stemmer.stem('believes') 'believ' >>> lemmatizer.lemmatize('believes') 'belief'
  • 14. Part-of-speech Tagging •Part-of-speech Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech >>> from nltk.tokenize import word_tokenize >>> from nltk.tag import pos_tag >>> words = word_tokenize('And now for something completely different') >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] •https://www.ling.upenn. edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  • 15. Named-entity Recognition •Named-entity recognition is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. >>> from nltk import pos_tag, ne_chunk >>> from nltk.tokenize import wordpunct_tokenize >>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.' >>> ne_chunk(pos_tag(wordpunct_tokenize(sent))) Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'), ('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'), ('2006', 'CD'), ('.', '.')])
  • 16. Language model •A statistical language model assigns a probability to a sequence of m words P(w1, w2, …., wm) by means of a probability distribution. >>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.model import NgramModel >>> from nltk.probability import LidstoneProbDist >>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')] >>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1)) >>> ssm.prob('of',('the','name')) 0.907524932004 >>> ssm.prob('if',('the','name')) 0.0124444830775