II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

527 views

Published on

Published in: Internet
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
527
On SlideShare
0
From Embeds
0
Number of Embeds
128
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

  1. 1. Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis, and Discovery Roger Bradford Agilex Technologies 16 April 2012
  2. 2. 2 What’s Wrong with Keyword Search? 6.9 Million Documents
  3. 3. 3 • Synonomy: Common English Nouns have 6-8 Close Synonyms Common English Verbs have 9-11 • Polysemy: The English Word Strike has >30 Common Meanings • Implicit Context (e.g., John’s Project) • Data Irregularities: Missing Erroneous Contradictory • Cross-document Relationships Complexities of Text
  4. 4. 4 The Problem of Synonomy Xxxxxxxxxxxxxxx Xxxxxxxxxxxxxxx Xxxxxxxxxxxxxxx Xxx Auto xxx Xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx User Terminology Author Terminology Car? Automobile? Motor Vehicle?
  5. 5. 5 Large Collections ⇒⇒⇒⇒ Many Terms to Choose from
  6. 6. 6 The Problem of Polysemy Miss (Baseball)? Hit (Bowling)? Labor Action? Military Operation? … User Terminology Intent Strike
  7. 7. 7 John ↔ Bob Relationship: First Order: Second Order: Third Order: JOHN BOB JOHN TOM TOM BOB JOHN TOM TOM DAVE DAVE BOB 51,474 11,026,553 68,070,600 # of Relations in 5,998 Documents: Cross-document Relationships
  8. 8. 8 How do Semantic Representations Help? 1.Improve Efficiency 2.Overcome Errors 3.Deal with Complex Information Requirements 4.Organize Information 5.Interpret Results 6.Discover New Information 7.Work across Languages 8.Search Multiple Databases Simultaneously 9.Search Multimedia Collections 10.Support Direct Analytics Top 10 Benefits:
  9. 9. 9 Implementation Approaches • Capture Semantics Manually(Labor-intensive): Taxonomies Dictionaries • Approximate Semantics using Local Co- occurrence Statistics (Low Accuracy) • Employ Matrix Decomposition Techniques (Computationally Intensive) : LSI PLSI Ontologies Grammars LRA SDD-LSI
  10. 10. 10 8 of the Top 10 Litigation Support Platforms use it for Analytics 6 of the Top 8 Essay Scoring Packages use it for Holistic Scoring (SAT, GMAT, etc.) Most Spam Filters are Based on it It is used for Survey Analysis for Major Corporations (Marriott, Home Depot, etc.) It is used in the Most Advanced Literature- based Discovery Systems Semantic Processing: Commercial Application Examples
  11. 11. 11 Xxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxx x automobile xx Xxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxx Conceptual Search Car Semantic Space
  12. 12. 12 Xxxxxxxxxxxxxx Xxxxxxxxxxxxxx methods of armed struggle not accepted internationally Xxxxxxxxxxxxxxx Xxxxxxxxxxxxxxx Deep Conceptual Generalization is Possible War Crimes Semantic Space
  13. 13. 13 Muammar Kaddafi Muammar Qadhdhafi Muammar Qadhafi Muammar Ghadafi Muammar Gaddafi Moammar Qaddafi Muammar Gaddaffi Muamar Qaddafi Muammar Qadhaffi Muammar Kadafi Muammer Gadaffi Muamar Ghadaffi Muammar Gadhafi Muammar Qhadhafi Muammar Qathafi Muammar Kadhafi Muammar al Qadaffi Moammar Gadaffi Mouammar Qaddafi Muammar al Qadhaffi Moammar Qadafi Muamar Gadaffi Moammar Qadhafi Muammar Qaddafi Automated Terminology Variant Identification Query = Muammar Qadaffi Variants Found Automatically in 6.1 M News Articles:
  14. 14. 14 X Query Term Transcription Errors Nicknames Synonyms OCR Errors Trade Names Speech Conversion Errors Aliases Transliteration Differences Comprehensive Variant Clustering
  15. 15. 15 Interpreting Results Today MOX is widely used in Europe and is planned to be used in Japan. Currently over 30 reactors in Europe (Belgium, Switzerland, Germany and France) are using MOX and a further 20 have been licensed to do so. Japan also plans to use MOX in around a third of its reactors by 2010. Most reactors use it as about one third of their core, but some will accept up to 50% MOX assemblies. France aims to have all its 900 MWe series of reactors running with at least one third MOX. Japan aims to have one third of its reactors using MOX by 2010, and has
  16. 16. 16 Today MOX is widely used in Europe and is planned to be used in Japan. Currently over 30 reactors in Europe (Belgium, Switzerland, Germany and France) are using MOX and a further 20 have been licensed to do so. Japan also plans to use MOX in around a third of its reactors by 2010. Most reactors use it as about one third of their core, but some will accept up to 50% MOX assemblies. France aims to have all its 900 MWe series of reactors running with at least one third MOX. Japan aims to have one third of its reactors using MOX by 2010, and has uranium-plutonium mixed-oxide bnfl reactor reactor's plutonium-uranium bnfl's nuclear-fuel vver Instant Context Display
  17. 17. 17 Incoming Document Stream Immediate Action Queue for Review No Further Action Conceptual Generalization + Context “Understanding” ⇒⇒⇒⇒ Accurate Categorization
  18. 18. 18 Semantic Processing ⇒⇒⇒⇒ Noise Tolerance 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Percentage of Words Corrupted CategorizationAccuracy ComparedtoBaseline
  19. 19. 19 Sample Document with 70% Word Error Rate – Still Categorized Correctly EORE MA4EUE uIqTED AT CAJFMARHUILLA Peru's stnte minerals markeing Arm, Minero PFou CVvercial SA (Mi1peco)C ifted a force majeure n zinc iWgotshipmenfwfrom 2he country's biggesD zioc sefintry at CajUmarquivla,Vaspekus an said. The spofesmanBsaid the p oblems affectiKg sulphurxg a iT and roast j plants that Uad hgltedWproduction sinUe May p had bsen resolvFd. D Howvjr,Khe BTid producti6 of znc ingots thisLyearPwas expe ted to fall to aro nd 86,000 tonnes thistyear at CajamruuiwpT, from 9wT000 Tonne7 in 198x becauNe of1the f5oppage O Thm refVuery has an opKimum annuv pjoduotibn capwcity of 100,000 tonn s but its highest qroueio ca 969000 tonnes of refined zRnc zLgots inw1985, the spokebman sa d.
  20. 20. 20 Application: Sentiment Detection Mideast Sentiment 0400 hrs 5 August 2002 Automatic Detection in News Articles
  21. 21. 21 Good, Quality, Standard Nice hotel, Clean Lobby, Small Enjoyed stay Bathrooms, Hair, Plug Showers, Water, Head, Floor Smoking room, Pool area, Cigarettes Light, Poor, Dark, Bedside Work properly, Flushing Excellent staff Property, View, Beautiful hotel Noise, Traffic, Construction, Sleep 5,000 Surveys from Large Hotel Chain Automatic Taxonomy Generation 426 253 410 348 281 260 231 251 250 242 198 192 Internet access, High-speed internet172
  22. 22. 22 Application: Summarization • So I ask you to join me in expanding their access to child care, creating new hiring preferences for military spouses across the federal government, and allowing our troops to transfer their unused education benefits to their spouses or children. • And tonight, I ask Congress to support an innovative proposal to provide food assistance by purchasing crops directly from farmers in the developing world, so we can build up local agriculture and help break the cycle of famine. 45- minute Speech
  23. 23. 23 Represent Complex Information Requirements (I) Job Descriptions E-mail Alert to Hiring Manager Résumé For One Fortune 500 Company ⇒⇒⇒⇒ Savings of >3 Days in Average Applicant Processing Time
  24. 24. 24 Represent Complex Information Requirements (II) Sections of Previous Proposals Draft Proposal InputsRFP Subsection One Fortune 500 Company Estimates Savings of >$10 Million per Year
  25. 25. 25 Zukas, A., GO-Driven Literature-Based Discovery using Semantic Analysis, MS Thesis, George Mason University, 2007. • PubMed Abstracts • Gene – Function Relationships Derived Semantically • 98,074,359 Potential Gene- function Associations. Literature-based Discovery (I)
  26. 26. 26 Latent Gene and Function Relationships from the June 2006 Gene Ontology - Later Documented in the January 2007 Gene Ontology Literature-based Discovery (II)
  27. 27. 27 Social Media – Goldmine of Public Opinion?
  28. 28. 28 Volume – 200 Million Tweets per Day Worldwide Velocity – Minutes to Hours Variety - 17 Languages Variability – Very Low Signal to Noise Ratio 28 Processing Social Media = Big Data Problem
  29. 29. 29 Cross-lingual Retrieval Farsi English Arabic English English Doc Retrieved Documents in Correct Relevance Order Documents in Multiple Languages
  30. 30. 30 Interleaved Results Display
  31. 31. 31 Federation of Queries Database #1 Database #2 Database #3 Merged Result List
  32. 32. 32 Buyer Seller Material Amount Date John Smith Ace Jewelers Diamond Ring $3,000 8/18/11 RDBMS Semantic Representation Space Rows = Mini-documents Text Combined Search of Structured and Unstructured Information
  33. 33. 33 Where is Semantic Analysis Going? 1.Larger Databases (100s of Millions of Documents +) 2.Multimedia Collections Text Relational Data Audio Video 3.Visualization 4.Analytics
  34. 34. 34 0 100 200 300 400 500 0 25 50 75 100 BuildTime(hrs) Millions of Documents 2010 2009 2007 2011 Hardware Advances ⇒⇒⇒⇒ Large Applications
  35. 35. 35 Integrated Analytics Structured Data Images Multi- lingual Text Audio Sensor Data Video Integration of Multimedia Data Buyer Seller Material Amount Date John Smith Ace Jewelers Diamond Ring 3 Carat 8/18/06
  36. 36. 36 Analytics + Visualization
  37. 37. 37 Questions or Comments Roger Bradford Agilex Technologies Inc 703-889-3916 r.bradford@agilex.com

×