OpenNLP demo

30,943 views
30,416 views

Published on

this ppt was prepared on ubuntu ,so might effect some formatting while opened in windows

Published in: Technology
4 Comments
25 Likes
Statistics
Notes
No Downloads
Views
Total views
30,943
On SlideShare
0
From Embeds
0
Number of Embeds
152
Actions
Shares
0
Downloads
858
Comments
4
Likes
25
Embeds 0
No embeds

No notes for slide

OpenNLP demo

  1. 1. Samatha Gagan Sunil
  2. 2. What is NLP? <ul><li>NLP provides means of analyzing text </li></ul><ul><li>The goal of NLP is to make computers analyze and understand the languages that humans use naturally </li></ul><ul><li>Interaction between Computers-Humans </li></ul>
  3. 3. Why Natural Language Processing? <ul><li>kJfmmfj mmmvvv nnnffn333 </li></ul><ul><li>Uj iheale eleee mnster vensi credur </li></ul><ul><li>Baboi oi cestnitze </li></ul><ul><li>Computers “see” text in English the same way you have seen above! </li></ul><ul><li>People have no trouble understanding language </li></ul><ul><li>Computers have </li></ul><ul><ul><li>No common sense knowledge </li></ul></ul><ul><ul><li>No reasoning capacity </li></ul></ul>
  4. 4. raw (unstructured) text part-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text Natural Language Processing ……………………………… ..………………………………………….……….... ... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. …………………………………………………………….. Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . NN IN NN VBZ VBN IN NN IN JJ NN NNS . PP PP NP PP VP VP NP NP S Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/ DTCII .ppt
  5. 5. Uses of NLP <ul><ul><li>Text based application </li></ul></ul><ul><ul><li>Dialogue based application </li></ul></ul><ul><ul><li>Information extraction </li></ul></ul><ul><li>Extract useful information. e.g. resumes </li></ul><ul><li>Automatic summarization </li></ul><ul><li>Condense 1 book into 1 page </li></ul>
  6. 6. What is ? <ul><li>OpenNLP is a open source, java-based NLP tools which perform </li></ul><ul><li>sentence detection, </li></ul><ul><li>Tokenization, </li></ul><ul><li>pos-tagging, </li></ul><ul><li>parsing, </li></ul><ul><li>named-entity detection </li></ul><ul><li>using the OpenNLP package. 1 </li></ul>1 http://opennlp.sourceforge.net/
  7. 7. Use of openNLP in our University project <ul><li>It can be used in “searching” names using Named entity recognition. </li></ul>
  8. 8. OpenNLP is used for: <ul><li>Sentence splitting </li></ul><ul><li>Tokenization </li></ul><ul><li>Part-of-speech tagging </li></ul><ul><li>Named entity recognition </li></ul><ul><li>Chunking </li></ul><ul><li>Treebank Parser </li></ul>
  9. 9. Sentence splitting sentence boundary = period + space(s) + capital letter Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop. Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop.
  10. 10. sentDetect(s, language = &quot;en&quot;, model = NULL) A character vector with texts from which sentences should be detected. A character string giving the language of s. This argument is only used if model is NULL for selecting a default model. A model. If model is NULL then a default model for sentence detection is loaded from the corresponding openNLP models language package. s language model http://opennlp.sourceforge.net/
  11. 11. Tokenization <ul><li>Convert a sentence into a sequence of tokens </li></ul><ul><li>Divides the text into smallest units (usually words), removing punctuation. </li></ul><ul><li>Rule: </li></ul><ul><li>Use spaces as the boundaries </li></ul><ul><li>Adds spaces before and after special characters </li></ul>tokenize(s, language = &quot;en&quot;, model = NULL) http://opennlp.sourceforge.net/
  12. 12. Tokenization &quot;A Saudi Arabian woman can get a divorce if her husband doesn't give her coffee.&quot; &quot; A Saudi Arabian woman can get a divorce if her husband does n't give her coffee . &quot;
  13. 13. Part-of-speech tagging Assign a part-of-speech tag to each token in a sentence. Most/ JJS lipstick/ NN is/ VBZ partially/ RB made/ VBN of/ IN fish/ NN scales/ NNS Most lipstick is partially made of fish scales tagPOS(sentence, language = &quot;en&quot;, model = NULL, tagdict = NULL) http://opennlp.sourceforge.net/
  14. 14. Part of speech tags 1 CC - Coordinating conjunction CD - Cardinal number DT - Determiner EX - Existential there FW - Foreign word IN - Preposition or subordinating conjunction JJ - Adjective JJR - Adjective, comparative JJS - Adjective, superlative NN - Noun, singular or mass NNS - Noun, plural NNP - Proper noun, singular NNPS - Proper noun, plural PDT – Predeterminer NP - Noun Phrase. PP - Prepositional Phrase VP - Verb Phrase. PRP - Personal pronoun RB - Adverb RBR - Adverb, comparative RBS - Adverb, superlative RP - Particle SYM - Symbol TO - to UH - Interjection VB - Verb, base form VBD - Verb, past tense VBG - Verb, gerund or present participle VBN - Verb, past participle VBP - Verb, non-3rd person singular present VBZ - Verb, 3rd person singular present WDT - Wh-determiner WP - Wh-pronoun WRB - Wh-adverb 1 http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
  15. 15. Named-Entity Recognition <ul><li>Named entity recognition classify tokens in text into predefined categories such as date, location, person, time. </li></ul><ul><li>The name finder can find up to seven different types of entities - date, location, money, organization, percentage, person, and time. </li></ul>
  16. 16. Named-Entity Recognition Diana Hayden was in Philadelphia city on 3rd october <namefind/person> Diana Hayden </namefind/person> was in<namefind/location> Philadelphia </namefind/location> city on<namefind/date> 3rd october </namefind/date>
  17. 17. Chunking (shallow parsing) He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September . NP PP NP A chunker (shallow parser) segments a sentence into meaningful phrases. Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/ DTCII .ppt
  18. 18. Tree bank parser It tags tokens and groups phrases into a tree. (TOP (S (NP (DT A ) (NN hospital ) (NN bed )) (VP (VBZ is ) (NP (NP (DT a ) (VBN parked ) (NN taxi )) (PP (IN with ) (NP (DT the ) (NN meter ) (VBG running ))))))) A hospital bed is a parked taxi with the meter running
  19. 19. S NP VP DT NN NN VBZ NP NP DT VBN NN PP IN NP DT NN VBG a hospital bed is a parked taxi with the meter running Visualization of Treebank Parser

×