Samatha  Gagan  Sunil
What is NLP? NLP provides means of analyzing text  The goal of NLP is to make computers analyze and understand the languages that humans use naturally Interaction between Computers-Humans
Why Natural Language Processing? kJfmmfj  mmmvvv  nnnffn333 Uj iheale eleee mnster vensi credur Baboi oi cestnitze  Computers “see” text in English the same way you have seen above! People have no trouble understanding language Computers have  No common sense knowledge No reasoning capacity
raw (unstructured) text part-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text Natural Language Processing ……………………………… ..………………………………………….……….... ... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells.  …………………………………………………………….. Secretion  of  TNF  was  abolished  by  BHA  in  PMA-stimulated  U937  cells  . NN  IN  NN  VBZ  VBN  IN  NN  IN  JJ  NN  NNS  . PP PP NP PP VP VP NP NP S Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/ DTCII .ppt
Uses of NLP Text based application Dialogue based application Information extraction Extract useful information. e.g. resumes Automatic summarization Condense 1 book into 1 page
What is  ? OpenNLP is a open source, java-based NLP tools which perform  sentence detection, Tokenization,  pos-tagging,  parsing,  named-entity detection  using the OpenNLP package. 1 1 http://opennlp.sourceforge.net/
Use of openNLP in our University project It can be used in  “searching”  names using  Named entity recognition.
OpenNLP is used for: Sentence splitting Tokenization Part-of-speech tagging Named entity recognition Chunking Treebank Parser
Sentence splitting sentence boundary  = period + space(s) + capital letter Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop. Unusually, the gender of crocodiles is determined by temperature.  If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile.  At lower temperatures only female or 'cow' crocodiles develop.
sentDetect(s, language = "en", model = NULL)   A character vector with texts from which sentences  should be detected. A character string giving the language of s. This  argument is only used if model is NULL for selecting  a default model. A model. If model is NULL then a default model for  sentence detection is loaded from the corresponding openNLP models language package. s language model http://opennlp.sourceforge.net/
Tokenization Convert a sentence into a sequence of  tokens Divides the text into smallest units (usually words), removing punctuation.  Rule: Use spaces as the boundaries Adds spaces before and after special characters tokenize(s, language = "en", model = NULL) http://opennlp.sourceforge.net/
Tokenization "A Saudi Arabian woman can get a divorce if her husband doesn't give her coffee." " A Saudi Arabian woman can get a divorce if her husband does   n't give her coffee . "
Part-of-speech tagging Assign a part-of-speech tag to each token in a sentence. Most/ JJS  lipstick/ NN  is/ VBZ  partially/ RB  made/ VBN  of/ IN  fish/ NN  scales/ NNS Most lipstick is partially made of fish scales tagPOS(sentence, language = "en", model = NULL, tagdict = NULL) http://opennlp.sourceforge.net/
Part of speech tags 1 CC  - Coordinating conjunction CD   - Cardinal number DT   - Determiner EX   - Existential there FW  - Foreign word IN   - Preposition or subordinating  conjunction JJ   - Adjective JJR   - Adjective, comparative JJS   - Adjective, superlative NN   - Noun, singular or mass NNS  - Noun, plural NNP   - Proper noun, singular NNPS  - Proper noun, plural PDT   – Predeterminer NP   - Noun Phrase. PP  - Prepositional Phrase VP  - Verb Phrase. PRP  - Personal pronoun RB  - Adverb RBR  - Adverb, comparative RBS  - Adverb, superlative RP  - Particle SYM  - Symbol TO  - to UH  - Interjection VB  - Verb, base form VBD  - Verb, past tense VBG  - Verb, gerund or present participle VBN  - Verb, past participle VBP  - Verb, non-3rd person singular present VBZ  - Verb, 3rd person singular present WDT  - Wh-determiner WP  - Wh-pronoun WRB  - Wh-adverb 1  http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
Named-Entity Recognition Named entity recognition classify tokens in text into predefined categories such as date, location, person, time. The name finder can find up to seven different types of entities - date, location, money, organization, percentage, person, and time.
Named-Entity Recognition Diana Hayden  was in Philadelphia city  on 3rd october <namefind/person> Diana Hayden </namefind/person>  was in<namefind/location> Philadelphia </namefind/location>  city on<namefind/date> 3rd october </namefind/date>
Chunking (shallow parsing) He   reckons   the  current  account  deficit   will  narrow   to NP  VP  NP  VP  PP only  #   1.8 billion   in   September  .   NP  PP  NP A chunker (shallow parser) segments a sentence into meaningful phrases. Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/ DTCII .ppt
Tree bank parser It tags tokens and groups phrases into a tree. (TOP (S (NP (DT  A ) (NN  hospital ) (NN  bed )) (VP (VBZ  is ) (NP (NP (DT  a ) (VBN  parked ) (NN  taxi )) (PP (IN  with ) (NP (DT  the ) (NN  meter ) (VBG  running ))))))) A hospital bed is a parked taxi with the meter running
S NP VP DT NN NN VBZ NP NP DT VBN NN PP IN NP DT NN VBG a hospital bed is a parked taxi with the meter running Visualization of Treebank Parser
 

OpenNLP demo

  • 1.
  • 2.
    What is NLP?NLP provides means of analyzing text The goal of NLP is to make computers analyze and understand the languages that humans use naturally Interaction between Computers-Humans
  • 3.
    Why Natural LanguageProcessing? kJfmmfj mmmvvv nnnffn333 Uj iheale eleee mnster vensi credur Baboi oi cestnitze Computers “see” text in English the same way you have seen above! People have no trouble understanding language Computers have No common sense knowledge No reasoning capacity
  • 4.
    raw (unstructured) textpart-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text Natural Language Processing ……………………………… ..………………………………………….……….... ... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. …………………………………………………………….. Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . NN IN NN VBZ VBN IN NN IN JJ NN NNS . PP PP NP PP VP VP NP NP S Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/ DTCII .ppt
  • 5.
    Uses of NLPText based application Dialogue based application Information extraction Extract useful information. e.g. resumes Automatic summarization Condense 1 book into 1 page
  • 6.
    What is ? OpenNLP is a open source, java-based NLP tools which perform sentence detection, Tokenization, pos-tagging, parsing, named-entity detection using the OpenNLP package. 1 1 http://opennlp.sourceforge.net/
  • 7.
    Use of openNLPin our University project It can be used in “searching” names using Named entity recognition.
  • 8.
    OpenNLP is usedfor: Sentence splitting Tokenization Part-of-speech tagging Named entity recognition Chunking Treebank Parser
  • 9.
    Sentence splitting sentenceboundary = period + space(s) + capital letter Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop. Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop.
  • 10.
    sentDetect(s, language =&quot;en&quot;, model = NULL) A character vector with texts from which sentences should be detected. A character string giving the language of s. This argument is only used if model is NULL for selecting a default model. A model. If model is NULL then a default model for sentence detection is loaded from the corresponding openNLP models language package. s language model http://opennlp.sourceforge.net/
  • 11.
    Tokenization Convert asentence into a sequence of tokens Divides the text into smallest units (usually words), removing punctuation. Rule: Use spaces as the boundaries Adds spaces before and after special characters tokenize(s, language = &quot;en&quot;, model = NULL) http://opennlp.sourceforge.net/
  • 12.
    Tokenization &quot;A SaudiArabian woman can get a divorce if her husband doesn't give her coffee.&quot; &quot; A Saudi Arabian woman can get a divorce if her husband does n't give her coffee . &quot;
  • 13.
    Part-of-speech tagging Assigna part-of-speech tag to each token in a sentence. Most/ JJS lipstick/ NN is/ VBZ partially/ RB made/ VBN of/ IN fish/ NN scales/ NNS Most lipstick is partially made of fish scales tagPOS(sentence, language = &quot;en&quot;, model = NULL, tagdict = NULL) http://opennlp.sourceforge.net/
  • 14.
    Part of speechtags 1 CC - Coordinating conjunction CD - Cardinal number DT - Determiner EX - Existential there FW - Foreign word IN - Preposition or subordinating conjunction JJ - Adjective JJR - Adjective, comparative JJS - Adjective, superlative NN - Noun, singular or mass NNS - Noun, plural NNP - Proper noun, singular NNPS - Proper noun, plural PDT – Predeterminer NP - Noun Phrase. PP - Prepositional Phrase VP - Verb Phrase. PRP - Personal pronoun RB - Adverb RBR - Adverb, comparative RBS - Adverb, superlative RP - Particle SYM - Symbol TO - to UH - Interjection VB - Verb, base form VBD - Verb, past tense VBG - Verb, gerund or present participle VBN - Verb, past participle VBP - Verb, non-3rd person singular present VBZ - Verb, 3rd person singular present WDT - Wh-determiner WP - Wh-pronoun WRB - Wh-adverb 1 http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
  • 15.
    Named-Entity Recognition Namedentity recognition classify tokens in text into predefined categories such as date, location, person, time. The name finder can find up to seven different types of entities - date, location, money, organization, percentage, person, and time.
  • 16.
    Named-Entity Recognition DianaHayden was in Philadelphia city on 3rd october <namefind/person> Diana Hayden </namefind/person> was in<namefind/location> Philadelphia </namefind/location> city on<namefind/date> 3rd october </namefind/date>
  • 17.
    Chunking (shallow parsing)He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September . NP PP NP A chunker (shallow parser) segments a sentence into meaningful phrases. Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/ DTCII .ppt
  • 18.
    Tree bank parserIt tags tokens and groups phrases into a tree. (TOP (S (NP (DT A ) (NN hospital ) (NN bed )) (VP (VBZ is ) (NP (NP (DT a ) (VBN parked ) (NN taxi )) (PP (IN with ) (NP (DT the ) (NN meter ) (VBG running ))))))) A hospital bed is a parked taxi with the meter running
  • 19.
    S NP VPDT NN NN VBZ NP NP DT VBN NN PP IN NP DT NN VBG a hospital bed is a parked taxi with the meter running Visualization of Treebank Parser
  • 20.