NLTK
A Tool Kit for Natural Language Processing
About Me
•My name is Md. Fasihul Kabir
•Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 –
Present)
•BSc in CSE from AUST (April, 2013).
•MSc in CSE from UIU.
•Research interests are NLP, IR, ML and Compiler Design.
Agenda
• What is NLTK?
• What is NLP?
• Installing NLTK
• NLTK Modules & Functionality
• NLP with NLTK
• Accessing Text Corpora & Lexical Resources
• Tokenization
• Normalizing Text
• POS Tagging
• NER
• Language Model
Natural Language Toolkit (NLTK)
• A collection of Python programs, modules, data set and tutorial to support
research and development in Natural Language Processing (NLP)
• Written by Steven Bird, Edvard Loper and Ewan Klien
• NLTK is
• Free and Open source
• Easy to use
• Modular
• Well documented
• Simple and extensible
• http://www.nltk.org/
What is Natural Language Processing
•Computer aided text analysis of human language
•The goal is to enable machines to understand human language and
extract meaning from text
•It is a field of study which falls under the category of machine
learning and more specifically computational linguistics
Application of NLP
•Automatic summarization
•Machine translation
•Natural language generation
•Natural language understanding
•Optical character recognition
•Question answering
•Speech Recognition
•Text-to-Speech
Installing NLTK
•Install PyYAML, Numpy, Matplotlib
•NLTK Source Installation
• Download NLTK source ( http://nltk.googlecode.com/)
• Unzip it & Go to the new unzipped folder
• Just do it!
➢ python setup.py install
•To install data
• Start python interpreter
>>> import nltk
>>> nltk.download()
NLTK Modules & Functionality
NLTK Modules Functionality
nltk.corpus Corpus
nltk.tokenize, nltk.stem Tokenizers, stemmers
nltk.collocations t-test, chi-squared, mutual-info
nltk.tag n-gram, backoff,Brill, HMM, TnT
nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means
nltk.chunk Regex,n-gram, named entity
nltk.parsing Parsing
nltk.sem, nltk.interence Semantic interpretation
nltk.metrics Evaluation metrics
nltk.probability Probability & Estimation
nltk.app, nltk.chat Applications
Accessing Text Corpora & Lexical Resources
•NLTK provides over 50 corpora and lexical resources.
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> len(brown.sents())
57340
>>> len(brown.words())
1161192
•http://www.nltk.org/book/ch02.html
Tokenization
• Tokenization is the process of breaking a stream of text up into words, phrases,
symbols, or other meaningful elements called tokens.
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.'''
• Word Punctuation Tokenization
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
• Sentence Tokenization
>>> sent_tokenize(s)
['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.']
• Word Tokenization
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of',
'them', '.'], ['Thanks', '.']]
Normalizing Text
• Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form
, generally a written word form.
• Porter Stemming Algorithm
>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
• LancasterStemmer Algorithm
>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
• SnowballStemmer Algorithm (supports 15 languages)
>>> from nltk.stem import SnowballStemmer
>>> stemmer = SnowballStemmer('english')
>>> stemmer.stem('cooking')
'cook'
Normalizing Text (Cont.)
•Lemmatization process involves first determining the part of speech
of a word, and applying different normalization rules for each part of
speech.
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
Normalizing Text (Cont.)
•Comparison between stemming and lemmatizing.
>>> stemmer.stem('believes')
'believ'
>>> lemmatizer.lemmatize('believes')
'belief'
Part-of-speech Tagging
•Part-of-speech Tagging is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> words = word_tokenize('And now for something completely different')
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]
•https://www.ling.upenn.
edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Named-entity Recognition
•Named-entity recognition is a subtask of information extraction that
seeks to locate and classify elements in text into pre-defined
categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
>>> from nltk import pos_tag, ne_chunk
>>> from nltk.tokenize import wordpunct_tokenize
>>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.'
>>> ne_chunk(pos_tag(wordpunct_tokenize(sent)))
Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'),
('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'),
('2006', 'CD'), ('.', '.')])
Language model
•A statistical language model assigns a probability to a sequence of m
words P(w1, w2, …., wm) by means of a probability distribution.
>>> import nltk
>>> from nltk.corpus import gutenberg
>>> from nltk.model import NgramModel
>>> from nltk.probability import LidstoneProbDist
>>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')]
>>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1))
>>> ssm.prob('of',('the','name'))
0.907524932004
>>> ssm.prob('if',('the','name'))
0.0124444830775
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014

Nltk:a tool for_nlp - py_con-dhaka-2014

  • 1.
    NLTK A Tool Kitfor Natural Language Processing
  • 2.
    About Me •My nameis Md. Fasihul Kabir •Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 – Present) •BSc in CSE from AUST (April, 2013). •MSc in CSE from UIU. •Research interests are NLP, IR, ML and Compiler Design.
  • 3.
    Agenda • What isNLTK? • What is NLP? • Installing NLTK • NLTK Modules & Functionality • NLP with NLTK • Accessing Text Corpora & Lexical Resources • Tokenization • Normalizing Text • POS Tagging • NER • Language Model
  • 4.
    Natural Language Toolkit(NLTK) • A collection of Python programs, modules, data set and tutorial to support research and development in Natural Language Processing (NLP) • Written by Steven Bird, Edvard Loper and Ewan Klien • NLTK is • Free and Open source • Easy to use • Modular • Well documented • Simple and extensible • http://www.nltk.org/
  • 5.
    What is NaturalLanguage Processing •Computer aided text analysis of human language •The goal is to enable machines to understand human language and extract meaning from text •It is a field of study which falls under the category of machine learning and more specifically computational linguistics
  • 6.
    Application of NLP •Automaticsummarization •Machine translation •Natural language generation •Natural language understanding •Optical character recognition •Question answering •Speech Recognition •Text-to-Speech
  • 7.
    Installing NLTK •Install PyYAML,Numpy, Matplotlib •NLTK Source Installation • Download NLTK source ( http://nltk.googlecode.com/) • Unzip it & Go to the new unzipped folder • Just do it! ➢ python setup.py install •To install data • Start python interpreter >>> import nltk >>> nltk.download()
  • 8.
    NLTK Modules &Functionality NLTK Modules Functionality nltk.corpus Corpus nltk.tokenize, nltk.stem Tokenizers, stemmers nltk.collocations t-test, chi-squared, mutual-info nltk.tag n-gram, backoff,Brill, HMM, TnT nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means nltk.chunk Regex,n-gram, named entity nltk.parsing Parsing nltk.sem, nltk.interence Semantic interpretation nltk.metrics Evaluation metrics nltk.probability Probability & Estimation nltk.app, nltk.chat Applications
  • 9.
    Accessing Text Corpora& Lexical Resources •NLTK provides over 50 corpora and lexical resources. >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> len(brown.sents()) 57340 >>> len(brown.words()) 1161192 •http://www.nltk.org/book/ch02.html
  • 10.
    Tokenization • Tokenization isthe process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. >>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize >>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.''' • Word Punctuation Tokenization >>> wordpunct_tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] • Sentence Tokenization >>> sent_tokenize(s) ['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.'] • Word Tokenization >>> [word_tokenize(t) for t in sent_tokenize(s)] [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
  • 11.
    Normalizing Text • Stemmingis the process for reducing inected (or sometimes derived) words to their stem, base or root form , generally a written word form. • Porter Stemming Algorithm >>> from nltk.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> stemmer.stem('cooking') 'cook' • LancasterStemmer Algorithm >>> from nltk.stem import LancasterStemmer >>> stemmer = LancasterStemmer() >>> stemmer.stem('cooking') 'cook' • SnowballStemmer Algorithm (supports 15 languages) >>> from nltk.stem import SnowballStemmer >>> stemmer = SnowballStemmer('english') >>> stemmer.stem('cooking') 'cook'
  • 12.
    Normalizing Text (Cont.) •Lemmatizationprocess involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() >>> lemmatizer.lemmatize('cooking') 'cooking' >>> lemmatizer.lemmatize('cooking', pos='v') 'cook'
  • 13.
    Normalizing Text (Cont.) •Comparisonbetween stemming and lemmatizing. >>> stemmer.stem('believes') 'believ' >>> lemmatizer.lemmatize('believes') 'belief'
  • 14.
    Part-of-speech Tagging •Part-of-speech Taggingis the process of marking up a word in a text (corpus) as corresponding to a particular part of speech >>> from nltk.tokenize import word_tokenize >>> from nltk.tag import pos_tag >>> words = word_tokenize('And now for something completely different') >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] •https://www.ling.upenn. edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  • 15.
    Named-entity Recognition •Named-entity recognitionis a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. >>> from nltk import pos_tag, ne_chunk >>> from nltk.tokenize import wordpunct_tokenize >>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.' >>> ne_chunk(pos_tag(wordpunct_tokenize(sent))) Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'), ('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'), ('2006', 'CD'), ('.', '.')])
  • 16.
    Language model •A statisticallanguage model assigns a probability to a sequence of m words P(w1, w2, …., wm) by means of a probability distribution. >>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.model import NgramModel >>> from nltk.probability import LidstoneProbDist >>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')] >>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1)) >>> ssm.prob('of',('the','name')) 0.907524932004 >>> ssm.prob('if',('the','name')) 0.0124444830775