Nltk:a tool for_nlp - py_con-dhaka-2014


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Nltk:a tool for_nlp - py_con-dhaka-2014

  1. 1. NLTK A Tool Kit for Natural Language Processing
  2. 2. About Me •My name is Md. Fasihul Kabir •Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 – Present) •BSc in CSE from AUST (April, 2013). •MSc in CSE from UIU. •Research interests are NLP, IR, ML and Compiler Design.
  3. 3. Agenda • What is NLTK? • What is NLP? • Installing NLTK • NLTK Modules & Functionality • NLP with NLTK • Accessing Text Corpora & Lexical Resources • Tokenization • Normalizing Text • POS Tagging • NER • Language Model
  4. 4. Natural Language Toolkit (NLTK) • A collection of Python programs, modules, data set and tutorial to support research and development in Natural Language Processing (NLP) • Written by Steven Bird, Edvard Loper and Ewan Klien • NLTK is • Free and Open source • Easy to use • Modular • Well documented • Simple and extensible •
  5. 5. What is Natural Language Processing •Computer aided text analysis of human language •The goal is to enable machines to understand human language and extract meaning from text •It is a field of study which falls under the category of machine learning and more specifically computational linguistics
  6. 6. Application of NLP •Automatic summarization •Machine translation •Natural language generation •Natural language understanding •Optical character recognition •Question answering •Speech Recognition •Text-to-Speech
  7. 7. Installing NLTK •Install PyYAML, Numpy, Matplotlib •NLTK Source Installation • Download NLTK source ( • Unzip it & Go to the new unzipped folder • Just do it! ➢ python install •To install data • Start python interpreter >>> import nltk >>>
  8. 8. NLTK Modules & Functionality NLTK Modules Functionality nltk.corpus Corpus nltk.tokenize, nltk.stem Tokenizers, stemmers nltk.collocations t-test, chi-squared, mutual-info nltk.tag n-gram, backoff,Brill, HMM, TnT nltk.classify, nltk.cluster Decision tree, Naive bayes, K-means nltk.chunk Regex,n-gram, named entity nltk.parsing Parsing nltk.sem, nltk.interence Semantic interpretation nltk.metrics Evaluation metrics nltk.probability Probability & Estimation, Applications
  9. 9. Accessing Text Corpora & Lexical Resources •NLTK provides over 50 corpora and lexical resources. >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> len(brown.sents()) 57340 >>> len(brown.words()) 1161192 •
  10. 10. Tokenization • Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. >>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize >>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.''' • Word Punctuation Tokenization >>> wordpunct_tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] • Sentence Tokenization >>> sent_tokenize(s) ['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.'] • Word Tokenization >>> [word_tokenize(t) for t in sent_tokenize(s)] [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
  11. 11. Normalizing Text • Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form , generally a written word form. • Porter Stemming Algorithm >>> from nltk.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> stemmer.stem('cooking') 'cook' • LancasterStemmer Algorithm >>> from nltk.stem import LancasterStemmer >>> stemmer = LancasterStemmer() >>> stemmer.stem('cooking') 'cook' • SnowballStemmer Algorithm (supports 15 languages) >>> from nltk.stem import SnowballStemmer >>> stemmer = SnowballStemmer('english') >>> stemmer.stem('cooking') 'cook'
  12. 12. Normalizing Text (Cont.) •Lemmatization process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() >>> lemmatizer.lemmatize('cooking') 'cooking' >>> lemmatizer.lemmatize('cooking', pos='v') 'cook'
  13. 13. Normalizing Text (Cont.) •Comparison between stemming and lemmatizing. >>> stemmer.stem('believes') 'believ' >>> lemmatizer.lemmatize('believes') 'belief'
  14. 14. Part-of-speech Tagging •Part-of-speech Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech >>> from nltk.tokenize import word_tokenize >>> from nltk.tag import pos_tag >>> words = word_tokenize('And now for something completely different') >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] •https://www.ling.upenn. edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  15. 15. Named-entity Recognition •Named-entity recognition is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. >>> from nltk import pos_tag, ne_chunk >>> from nltk.tokenize import wordpunct_tokenize >>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.' >>> ne_chunk(pos_tag(wordpunct_tokenize(sent))) Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'), ('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'), ('2006', 'CD'), ('.', '.')])
  16. 16. Language model •A statistical language model assigns a probability to a sequence of m words P(w1, w2, …., wm) by means of a probability distribution. >>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.model import NgramModel >>> from nltk.probability import LidstoneProbDist >>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')] >>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1)) >>> ssm.prob('of',('the','name')) 0.907524932004 >>> ssm.prob('if',('the','name')) 0.0124444830775