Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Natural Language Processing


Published on

Natural Language Processing has matured a lot recently. With the availability of great open source tools complementing the needs of the Semantic Web we believe this field should be on the radar of all software engineering professionals.

Published in: Technology
  • Be the first to comment

Introduction to Natural Language Processing

  1. 1. Natural Language Processing Quick Introduction Rohit Nayak Talentica Software
  2. 2. <ul><li>Part 1: Semantic Web, Uses of NLP, Core Concepts, Intro to GATE </li></ul><ul><li>Part 2: GATE Detailed Demo </li></ul>
  3. 3. NLP 420 <ul><li>Falling Tree Hits, Kills OR Forest Service Worker </li></ul><ul><li>Time flies like an arrow </li></ul><ul><li>Choosing a Program to Improve Your Future </li></ul><ul><li>Monkeys like bananas when they wake up </li></ul><ul><li>Monkeys like bananas when they are ripe </li></ul>
  4. 4. I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘ intelligent agents ’ people have touted for ages will finally materialize. – Tim Berners -Lee , 1999
  5. 5. <ul><li>Disaster Type: earthquake </li></ul><ul><ul><li>location: Afghanistan </li></ul></ul><ul><ul><li>date: 05/30/1998 </li></ul></ul><ul><ul><li>magnitude: 6.9 </li></ul></ul><ul><ul><li>epicenter: a remote part of the country </li></ul></ul><ul><ul><li>damage: </li></ul></ul><ul><ul><ul><li>human-effect: </li></ul></ul></ul><ul><ul><ul><ul><li>victim: Thousands of people </li></ul></ul></ul></ul><ul><ul><ul><ul><li>number: Thousands </li></ul></ul></ul></ul><ul><ul><ul><ul><li>outcome: dead </li></ul></ul></ul></ul><ul><ul><ul><li>physical-effect: </li></ul></ul></ul><ul><ul><ul><ul><li>object: entire villages </li></ul></ul></ul></ul><ul><ul><ul><ul><li>outcome: damaged </li></ul></ul></ul></ul>QUAKE IN AFGHANISTAN Thousands of people are feared dead following... (voice-over) ... a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale, centered in a remote part of the country . (on camera) Details now hard to come by, but reports say entire villages were buried by the quake .
  6. 6. Text Categorization Is the document about plants? sports? health and fitness? corporate acquisitions? … stock market? Document
  7. 7. Sentiment Classification Is the overall sentiment in the document positive? negative? In general, sentiment classification appears to be harder than categorizing by topic. Document
  8. 8. Information Extraction Information Extraction System text collection Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____
  9. 9. Information Extraction (IE) <ul><li>Recognition, tagging, and extraction into a structured representation, certain key elements of information, e.g. persons, companies, locations, organizations, from large collections of text. </li></ul><ul><li>These extractions can then be utilized for a range of applications including question-answering, visualization, and data mining. </li></ul>
  10. 10. Question-Answering <ul><li>In contrast to Information Retrieval, which provides a list of potentially relevant documents in response to a user’s query </li></ul><ul><li>provides the user with either just the text of the answer itself or answer-providing passages. </li></ul>
  11. 11. Summarization <ul><li>reduces a larger text into a shorter, yet richly constituted abbreviated narrative representation of the original document. </li></ul>
  12. 12. Machine Translation <ul><li>perhaps the oldest of all NLP applications, various levels of NLP have been utilized in MT systems, ranging from the ‘word-based’ approach to applications that include higher levels of analysis. </li></ul>
  13. 13. Dialogue Systems <ul><li>perhaps the omnipresent application of the future, in the systems envisioned by large providers of end-user applications. </li></ul><ul><li>Dialogue systems usually focus on a narrowly defined application (e.g. your refrigerator or home sound system), </li></ul><ul><li>currently utilize the phonetic and lexical levels of language. It is believed that utilization of all the levels of language processing explained above offer the potential for truly habitable dialogue systems. </li></ul>
  14. 14. Challenge of Semantic Web <ul><li>Machine processable data to complement hypertext </li></ul><ul><li>Attach metadata to documents </li></ul><ul><ul><li>Explicit: title, author, creation date </li></ul></ul><ul><ul><li>Implicit: deduced information like names of entities and their relation </li></ul></ul>
  15. 15. Ontology <ul><li>Specification of conceptualisation </li></ul><ul><li>Basis of document “understanding” </li></ul><ul><li>Creating and populating is very time-consuming, practically impossible </li></ul>
  16. 16. Simple Workflow <ul><li>Classification </li></ul><ul><li>Tokeniser </li></ul><ul><li>Gazetteer </li></ul><ul><li>Sentence Splitter </li></ul><ul><li>Parts Of Speech Tagging </li></ul><ul><li>Named Entity Tagging </li></ul><ul><li>Final Extraction </li></ul>
  17. 17. Tools <ul><li>GATE </li></ul><ul><li>OpenNLP </li></ul><ul><li>NLTK (python) </li></ul><ul><li>Stanford Parser </li></ul><ul><li>Weka for classification </li></ul>
  18. 18. GATE <ul><li>General Architecture for Text Engineering </li></ul><ul><li>Over 10 years, active development </li></ul><ul><li>Most popular NLP platform </li></ul><ul><li>Current version 5.0 </li></ul><ul><li>Built as a framework for both programmers and developers </li></ul><ul><li>Powerful GUI and well-documented Java API </li></ul><ul><li>Multilingual </li></ul>
  19. 19. GATE <ul><li>Clean separation of low-level tasks (e.g., data storage) from the NLP components </li></ul><ul><li>Separation between linguistic data and algorithms that process it </li></ul>
  20. 20. JAPE <ul><li>Just A Pleasant Experience </li></ul><ul><li>Pattern-Matching over Annotations </li></ul><ul><li>Regular Expression like </li></ul><ul><li>Can use Java in actions </li></ul>
  21. 21. Rule: Company1 Priority: 25 ( ({Token.orthography == upperInitial})+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = {kind = &quot;company&quot;, rule = &quot;Company1&quot;}
  22. 22. CREOLE components <ul><li>GATE plugins uses CREOLE </li></ul><ul><li>Collection of Reusable Objects for Language Engineering </li></ul><ul><li>Modified JavaBeans with XML configuration </li></ul><ul><li>Minimal component: 10 lines of Java, 10 lines of XML </li></ul>
  23. 23. External Slideshow <ul><li> (27) </li></ul>
  24. 24. GATE Demo <ul><li>Quick look </li></ul><ul><li>Detailed Demo next SIG </li></ul>