Natural Language Processing
                                   Using Python




                                    Presented by:-
                                    Sumit Kumar Raj
                                    1DS09IS082

ISE,DSCE-2013
Table of Contents




        •
          Introduction
        •
          History
        •
          Methods in NLP
        •
          Natural Language Toolkit
        •
          Sample Codes
        •
          Feeling Lonely ?
        •
          Building a Spam Filter
        •
          Applications
        •
          References


ISE,DSCE-2013                        1
What is Natural Language Processing ?




    •Computer     aided text analysis of human language.

    •The    goal is to enable machines to understand human
          language and extract meaning from text.

    •It   is a field of study which falls under the category of
          machine learning and more specifically computational
          linguistics.




ISE,DSCE-2013                                                     2
History


  •
      1948- 1st NLP application
         – dictionary look-up system
         – developed at Birkbeck College, London

  •
  l   1949- American interest
         –WWII code breaker Warren Weaver
         – He viewed German as English in code.

  •
      1966- Over-promised under-delivered
         – Machine Translation worked only word by word
       l
         – NLP brought the first hostility of research funding
       l
         – NLP gave AI a bad name before AI had a name.
ISE,DSCE-2013                                                    3
Natural language processing is heavily used throughout all web
                         technologies


                           Search engines


    Consumer behavior analysis              Site recommendations



     Sentiment analysis                         Spam filtering



           Automated customer        Knowledge bases and
             support systems            expert systems


ISE,DSCE-2013                                                      4
Context


   Little sister: What’s your name?

   Me: Uhh….Sumit..?

   Sister: Can you spell it?

   Me: yes. S-U-M-I-T…..
ISE,DSCE-2013                         5
Sister: WRONG! It’s spelled “I-
     T”



ISE,DSCE-2013                          6
Ambiguity

   “I shot the man with ice cream.“
   -
    A man with ice cream was shot
   -
    A man had ice cream shot at him




ISE,DSCE-2013                         7
Methods :-

       1) POS Tagging :-

      •In  corpus linguistics, Parts-of-speech tagging also called
          grammatical tagging or word-category disambiguation.
      •It is the process of marking up a word in a text corres-
          ponding to a particular POS.
      •POS tagging is harder than just having a list of words
          and their parts of speech.
      •Consider the example:
             l
               The sailor dogs the barmaid.



ISE,DSCE-2013                                                        8
2) Parsing :-


  •In
    context of NLP, parsing may be defined as the process of
   assigning structural descriptions to sequences of words in
   a natural language.
  Applications of parsing include
     simple phrase finding, eg. for proper name recognition
     Full semantic analysis of text, e.g. information extraction or
                                         machine translation




ISE,DSCE-2013                                                    9
3) Speech Recognition:-



  •
    It is concerned with the mapping a continuous speech signal
  into a sequence of recognized words.
  •
    Problem is variation in pronunciation, homonyms.
  •
    In sentence “the boy eats”, a bi-gram model sufficient to
        model the relationship b/w boy and eats.
          “The boy on the hill by the lake in our town…eats”
  •
    Bi-gram and Trigram have proven extremely effective in
        obvious dependencies.




ISE,DSCE-2013                                                 10
4) Machine Translation:-



 •
   It involves translating text from one NL to another.
 •
   Approaches:-
        -simple word substitution,with some changes in ordering to
         account for grammatical differences
        -translate the source language into underlying meaning
         representation or interlingua




ISE,DSCE-2013                                                    11
5) Stemming:-




  •
      In linguistic morphology and information retrieval, stemming is
            the process for reducing inflected words to their stem.
    •
      The stem need not be identical to the morphological root of the
                                    word.
  •
    Many search engines treat words with same stem as synonyms
          as a kind of query broadening, a process called conflation.




ISE,DSCE-2013                                                     12
Natural Language Toolkit

    •
      NLTK is a leading platform for building Python program to
    work with human language data.
    •
      Provides a suite of text processing libraries for
      classification, tokenization, stemming, tagging, parsing,
      and semantic reasoning.

    •
      Currently only available for Python 2.5 – 2.6
    http://www.nltk.org/download
    •
      `easy_install nltk
    •
      Prerequisites
       –
         NumPy
       –
         SciPy

ISE,DSCE-2013                                                     13
Let’s dive into some code!




ISE,DSCE-2013                       14
Part of Speech Tagging

from nltk import pos_tag,word_tokenize

sentence1 = 'this is a demo that will show you how
to detects parts of speech with little effort
using NLTK!'

tokenized_sent = word_tokenize(sentence1)
print pos_tag(tokenized_sent)


[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'),
('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'),
('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with',
'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!',
'.')]
ISE,DSCE-2013                                                                  15
Fun things to Try




ISE,DSCE-2013                       16
Feeling lonely?

  Eliza is there to talk to you all day! What human could ever do that
  for you??
    from nltk.chat import eliza
    eliza.eliza_chat()
    ……starts the chatbot
   Therapist
   ---------
   Talk to the program by typing in plain English, using normal upper-
   and lower-case letters and punctuation. Enter "quit" when done.
   ============================================================
   ============
   Hello. How are you feeling today?


ISE,DSCE-2013                                                            17
Let’s build something even
    cooler




ISE,DSCE-2013                    18
Lets write a Spam filter!

   A program that analyzes legitimate emails “Ham” as well as
   “Spam” and learns the features that are associated with
   each.

   Once trained, we should be able to run this program on
   incoming mail and have it reliably label each one with the
   appropriate category.




ISE,DSCE-2013                                                   19
“Spambot.py” (continued)



  1.   Extract one of the archives from the site into your working directory.

  2.   Create a python script, lets call it “spambot.py”.

   Your working directory should contain the “spambot” script and the
  3.

  folders “spam” and “ham”.


from nltk import word_tokenize,
WordNetLemmatizer,NaiveBayesClassifier
,classify,MaxentClassifier

from nltk.corpus import stopwords
import random
ISE,DSCE-2013                                                                   20
“Spambot.py” (continued)

label each item with the appropriate label and store them as a list of tuples


mixedemails = ([(email,'spam') for email in spamtexts]
mixedemails += [(email,'ham') for email in hamtexts])

From this list of random but labeled emails, we will defined a “feature
extractor” which outputs a feature set that our program can use to statistically
compare spam and ham.



random.shuffle(mixedemails)
                                  lets give them a nice shuffle




ISE,DSCE-2013                                                                   21
“Spambot.py” (continued)


def email_features(sent):
    features = {}
    wordtokens = [wordlemmatizer.lemmatize(word.lower()) for
word in word_tokenize(sent)]         Normalize words
    for word in wordtokens:
         if word not in commonwords:
              features[word] = True
    return features
                     If the word is not a stop-word then lets
                     consider it a “feature”




featuresets = [(email_features(n), g) for (n,g) in mixedemails]

ISE,DSCE-2013
“Spambot.py” (continued)



While True:
   featset = email_features(raw_input("Enter text to classify: "))
   print classifier.classify(featset)



We can now directly input new email and have it classified as either Spam or
Ham




ISE,DSCE-2013                                                              23
Applications :-



  •
    Conversion from natural language to computer language
      and vice-versa.
  •
    Translation from one human language to another.
  •
    Automatic checking for grammar and writing techniques.
  •
    Spam filtering
  •
    Sentiment Analysis




ISE,DSCE-2013                                                24
Conclusion:-



 NLP takes a very important role in new machine human interfaces. When we look at
 Some of the products based on technologies with NLP we can see that they are very
 advanced but very useful.

 But there are many limitations, For example language we speak is highly ambiguous.
 This makes it very difficult to understand and analyze. Also with so many languages
 spoken all over the world it is very difficult to design a system that is 100% accurate.

 These problems get more complicated when we think of different people speaking the
 same language with different styles.

 Intelligent systems are being experimented right now.
 We will be able to see improved applications of NLP in the near future.


ISE,DSCE-2013                                                                          25
References :-


•
  http://en.wikipedia.org/wiki/Natural_language_processing
•
  An overview of Empirical Natural Language Processing
      by Eric Brill and Raymond J. Mooney
•
  Investigating classification for natural language processing tasks
     by Ben W. Medlock, University of Cambridge
•
  Natural Language Processing and Machine Learning using Python
     by Shankar Ambady.
•
  http://www.slideshare.net
•
  http://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol1/hks/index.html
l
  http://googlesystem.blogspot.in/2012/10/google-improves-results-for-natural/
    Codes from :https://github.com/shanbady/NLTK-Boston-Python-Meetup




ISE,DSCE-2013                                                               26
Any Questions ???




ISE,DSCE-2013                       27
Thank You...

                Reach me @:
                facebook.com/sumit12dec

                sumit786raj@gmail.com

                9590 285 524

ISE,DSCE-2013

Natural language processing (Python)

  • 1.
    Natural Language Processing Using Python Presented by:- Sumit Kumar Raj 1DS09IS082 ISE,DSCE-2013
  • 2.
    Table of Contents • Introduction • History • Methods in NLP • Natural Language Toolkit • Sample Codes • Feeling Lonely ? • Building a Spam Filter • Applications • References ISE,DSCE-2013 1
  • 3.
    What is NaturalLanguage Processing ? •Computer aided text analysis of human language. •The goal is to enable machines to understand human language and extract meaning from text. •It is a field of study which falls under the category of machine learning and more specifically computational linguistics. ISE,DSCE-2013 2
  • 4.
    History • 1948- 1st NLP application – dictionary look-up system – developed at Birkbeck College, London • l 1949- American interest –WWII code breaker Warren Weaver – He viewed German as English in code. • 1966- Over-promised under-delivered – Machine Translation worked only word by word l – NLP brought the first hostility of research funding l – NLP gave AI a bad name before AI had a name. ISE,DSCE-2013 3
  • 5.
    Natural language processingis heavily used throughout all web technologies Search engines Consumer behavior analysis Site recommendations Sentiment analysis Spam filtering Automated customer Knowledge bases and support systems expert systems ISE,DSCE-2013 4
  • 6.
    Context Little sister: What’s your name? Me: Uhh….Sumit..? Sister: Can you spell it? Me: yes. S-U-M-I-T….. ISE,DSCE-2013 5
  • 7.
    Sister: WRONG! It’sspelled “I- T” ISE,DSCE-2013 6
  • 8.
    Ambiguity “I shot the man with ice cream.“ - A man with ice cream was shot - A man had ice cream shot at him ISE,DSCE-2013 7
  • 9.
    Methods :- 1) POS Tagging :- •In corpus linguistics, Parts-of-speech tagging also called grammatical tagging or word-category disambiguation. •It is the process of marking up a word in a text corres- ponding to a particular POS. •POS tagging is harder than just having a list of words and their parts of speech. •Consider the example: l The sailor dogs the barmaid. ISE,DSCE-2013 8
  • 10.
    2) Parsing :- •In context of NLP, parsing may be defined as the process of assigning structural descriptions to sequences of words in a natural language. Applications of parsing include simple phrase finding, eg. for proper name recognition Full semantic analysis of text, e.g. information extraction or machine translation ISE,DSCE-2013 9
  • 11.
    3) Speech Recognition:- • It is concerned with the mapping a continuous speech signal into a sequence of recognized words. • Problem is variation in pronunciation, homonyms. • In sentence “the boy eats”, a bi-gram model sufficient to model the relationship b/w boy and eats. “The boy on the hill by the lake in our town…eats” • Bi-gram and Trigram have proven extremely effective in obvious dependencies. ISE,DSCE-2013 10
  • 12.
    4) Machine Translation:- • It involves translating text from one NL to another. • Approaches:- -simple word substitution,with some changes in ordering to account for grammatical differences -translate the source language into underlying meaning representation or interlingua ISE,DSCE-2013 11
  • 13.
    5) Stemming:- • In linguistic morphology and information retrieval, stemming is the process for reducing inflected words to their stem. • The stem need not be identical to the morphological root of the word. • Many search engines treat words with same stem as synonyms as a kind of query broadening, a process called conflation. ISE,DSCE-2013 12
  • 14.
    Natural Language Toolkit • NLTK is a leading platform for building Python program to work with human language data. • Provides a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. • Currently only available for Python 2.5 – 2.6 http://www.nltk.org/download • `easy_install nltk • Prerequisites – NumPy – SciPy ISE,DSCE-2013 13
  • 15.
    Let’s dive intosome code! ISE,DSCE-2013 14
  • 16.
    Part of SpeechTagging from nltk import pos_tag,word_tokenize sentence1 = 'this is a demo that will show you how to detects parts of speech with little effort using NLTK!' tokenized_sent = word_tokenize(sentence1) print pos_tag(tokenized_sent) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')] ISE,DSCE-2013 15
  • 17.
    Fun things toTry ISE,DSCE-2013 16
  • 18.
    Feeling lonely? Eliza is there to talk to you all day! What human could ever do that for you?? from nltk.chat import eliza eliza.eliza_chat() ……starts the chatbot Therapist --------- Talk to the program by typing in plain English, using normal upper- and lower-case letters and punctuation. Enter "quit" when done. ============================================================ ============ Hello. How are you feeling today? ISE,DSCE-2013 17
  • 19.
    Let’s build somethingeven cooler ISE,DSCE-2013 18
  • 20.
    Lets write aSpam filter! A program that analyzes legitimate emails “Ham” as well as “Spam” and learns the features that are associated with each. Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category. ISE,DSCE-2013 19
  • 21.
    “Spambot.py” (continued) 1. Extract one of the archives from the site into your working directory. 2. Create a python script, lets call it “spambot.py”. Your working directory should contain the “spambot” script and the 3. folders “spam” and “ham”. from nltk import word_tokenize, WordNetLemmatizer,NaiveBayesClassifier ,classify,MaxentClassifier from nltk.corpus import stopwords import random ISE,DSCE-2013 20
  • 22.
    “Spambot.py” (continued) label eachitem with the appropriate label and store them as a list of tuples mixedemails = ([(email,'spam') for email in spamtexts] mixedemails += [(email,'ham') for email in hamtexts]) From this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham. random.shuffle(mixedemails) lets give them a nice shuffle ISE,DSCE-2013 21
  • 23.
    “Spambot.py” (continued) def email_features(sent): features = {} wordtokens = [wordlemmatizer.lemmatize(word.lower()) for word in word_tokenize(sent)] Normalize words for word in wordtokens: if word not in commonwords: features[word] = True return features If the word is not a stop-word then lets consider it a “feature” featuresets = [(email_features(n), g) for (n,g) in mixedemails] ISE,DSCE-2013
  • 24.
    “Spambot.py” (continued) While True: featset = email_features(raw_input("Enter text to classify: ")) print classifier.classify(featset) We can now directly input new email and have it classified as either Spam or Ham ISE,DSCE-2013 23
  • 25.
    Applications :- • Conversion from natural language to computer language and vice-versa. • Translation from one human language to another. • Automatic checking for grammar and writing techniques. • Spam filtering • Sentiment Analysis ISE,DSCE-2013 24
  • 26.
    Conclusion:- NLP takesa very important role in new machine human interfaces. When we look at Some of the products based on technologies with NLP we can see that they are very advanced but very useful. But there are many limitations, For example language we speak is highly ambiguous. This makes it very difficult to understand and analyze. Also with so many languages spoken all over the world it is very difficult to design a system that is 100% accurate. These problems get more complicated when we think of different people speaking the same language with different styles. Intelligent systems are being experimented right now. We will be able to see improved applications of NLP in the near future. ISE,DSCE-2013 25
  • 27.
    References :- • http://en.wikipedia.org/wiki/Natural_language_processing • An overview of Empirical Natural Language Processing by Eric Brill and Raymond J. Mooney • Investigating classification for natural language processing tasks by Ben W. Medlock, University of Cambridge • Natural Language Processing and Machine Learning using Python by Shankar Ambady. • http://www.slideshare.net • http://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol1/hks/index.html l http://googlesystem.blogspot.in/2012/10/google-improves-results-for-natural/ Codes from :https://github.com/shanbady/NLTK-Boston-Python-Meetup ISE,DSCE-2013 26
  • 28.
  • 29.
    Thank You... Reach me @: facebook.com/sumit12dec sumit786raj@gmail.com 9590 285 524 ISE,DSCE-2013