Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DBpedia Spotlight at I-SEMANTICS 2011

4,777 views

Published on

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries.

This is the presentation at the best paper award session at I-SEMANTICS 2011.

Published in: Technology, Business
  • Be the first to comment

DBpedia Spotlight at I-SEMANTICS 2011

  1. 1. DBpedia SpotlightShedding Light on the Web of Documents<br />Pablo N. Mendes, Max Jakob, Andrés Garcia-Silva, Christian Bizer<br />pablo.mendes@fu-berlin.de<br />I-SEMANTICS, Graz, Austria<br />September 9th 2011<br />1<br />
  2. 2. Agenda<br />What is text annotation?<br />What can you build with it?<br />Why is it difficult?<br />How did we approach the challenge?<br />How well did it work?<br />What are the next steps?<br />2<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  3. 3. What is it?<br />3<br />
  4. 4. Text Annotation<br />From:<br />To:<br />(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps. <br />(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps. <br />http://dbpedia.org/resource/New_York_City<br />http://dbpedia.org/resource/Apple_Corps<br />4<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  5. 5. Challenge: Term Ambiguity<br />5<br />...this apple on the palm of my hand...<br />...Apple tried to acquire Palm Inc....<br />...eating an apple sitted by a palm tree...<br />What do “apple” and “palm” mean in each case?<br />Our objective is to recognize entities and disambiguate their meaning, generating DBpedia annotation in text.<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  6. 6. What can you do with annotations?<br />Links to complementary information<br />“More about this”<br />Faceted browsing of blog posts<br />Show only posts with topics related to Sports<br />Rich snippets on Google<br />Search engines start to display info from annotations<br />More expressive filtering of information streams<br />Twarql (entry at I-SEMANTICS 2010 Challenge)<br />6<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  7. 7. Rich Snippets<br />Search Engines already benefit from some kinds of annotations<br />7<br />http://www.google.com/webmasters/tools/richsnippets<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  8. 8. Twarql Example Use Case<br />What competitors of my product are being mentioned with my product on Twitter?<br />- comparative opinion!<br />SELECT ? competitor<br />WHERE {<br />dbpedia:IPadskos:subject ?category .<br /> ?competitor skos:subject ?category .<br /> ?tweet moat:taggedWith ?competitor .<br />}<br />?tweet moat:taggedWithdbpedia:Ipad .<br />8<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  9. 9. Twarql Example Use Case (2)<br />Incoming microposts…<br />Background Knowledge (e.g. DBpedia)<br />@anonymized<br />Loremipsumblabla this is an example tweet<br />dbpedia:IPad<br />skos:subject<br />?category<br />?category<br />?competitor<br />skos:subject<br />skos:subject<br />moat:taggedWith<br />Competition is modeled as two products <br />in the same category in DBpedia<br />?tweet<br />9<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  10. 10. Twarql Example Use Case (3)<br />Incoming microposts…<br />Background Knowledge (e.g. DBpedia)<br />@anonymized<br />Loremipsumblabla this is an example tweet<br />category:Wi-Fi<br />dbpedia:IPad<br />category:Touchscreen<br />skos:subject<br />?category<br />?category<br />?competitor<br />skos:subject<br />skos:subject<br />moat:taggedWith<br />Background knowledge is dynamically “brought into” microposts.<br />?tweet<br />10<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  11. 11. Twarql Example Use Case (4)<br />Background Knowledge (e.g. DBpedia)<br />@anonymized<br />Loremipsumblabla this is an example tweet<br />category:Wi-Fi<br />dbpedia:IPad<br />category:Touchscreen<br />skos:subject<br />?category<br />?category<br />?competitor<br />skos:subject<br />skos:subject<br />moat:taggedWith<br />?tweet<br />Trigger action if micropost matches constraints.<br />11<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  12. 12. DBpedia Spotlight<br />DBpedia is a collection of entity descriptions extracted from Wikipedia & shared as linked data<br />DBpedia Spotlight uses data from DBpedia and text from associated Wikipedia pages <br />Learns how to recognize that a DBpedia resource was mentioned<br />Given plain text as input, generates annotated text<br />12<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  13. 13. Why is it difficult?<br />13<br />
  14. 14. Dataset overview<br />Volume of Wikipedia<br />56,9 GB in raw text data<br />Occurrences of Ambiguous Terms in Wikipedia: 58.8%<br />Sparsity: less data for some DBpedia resources<br />14<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  15. 15. Histogram: URI occurrences<br />Many “rare” URIs, <br />(few links on Wikipedia)<br />Most of previous work deals with these entities:<br />People, Organization, Location<br />Few “popular” URIs<br />(lots of links on Wikipedia)<br />log(n(uri))))<br />15<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  16. 16. Histogram: Surface Form Ambiguity<br />Many “unambiguous” surface forms<br />Max: 1199 (log=7.08)<br />Min: 1<br />Mean: 1.328949<br />Few very “ambiguous” surface forms<br />log(n(uri,sf))))<br />16<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  17. 17. Ambiguity<br />17<br />What are the most ambiguous surface forms?<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  18. 18. Name Variation<br />18<br />What are the URIs with many surface forms?<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  19. 19. How did we approach the challenge?<br />19<br />
  20. 20. A 4-stage approach<br />Spotting<br />Candidate Mapping<br />Disambiguation<br />Linking<br />20<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  21. 21. Stage 1: Spotting<br />Find substrings that seem worthy of annotation<br />Naïve implementation (impractical)<br />all n-grams of length (1,|text|)<br />Input:<br />(…) Upon their return, Lennon and McCartney went to New York <br />to announce the formation of Apple Corps. <br />Output:<br />“Lennon”, “McCartney”, “New York”, “Apple Corps”<br />21<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  22. 22. Spotting in DBpedia Spotlight<br />Detect that the label (surface form) of a DBpedia Resource was mentioned<br />Lexicalized, Aho-Corasick algorithm (LingPipe)<br />Name variations from redirects, disambiguation pages, anchor texts<br />Advantages: <br />Simple implementation, well studied problem,<br />Produces a reduced set of spots, <br />Relies on user provided terms.<br />Drawback: <br />high memory requirements (~7G)<br />22<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  23. 23. Stage 2: Candidate Mapping<br />What are the possible senses of a given surface form (the candidate DBpedia resources)?<br />Input:<br />“Lennon”, “McCartney”, “New York”, “Apple Corps”<br />Output:<br />“Lennon”: { Lennon_(album), Lennon,_Michigan, … }<br />“McCartney”: { McCartney(surname), Paul_McCartney, … }<br />“New York”: { New_York_State, New_York_City, … }<br />“Apple Corps”: { Apple_Corps}<br />23<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  24. 24. Candidate Mapping in DBpedia Spotlight<br />Sources of mappings between surface forms and DBpedia Resources<br />Page titles offer “chosen names” for resources<br />Redirects offer alternative spellings, aliases, etc.<br />Disambiguation Pages: link a common term to many resources<br />24<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  25. 25. Candidate Map: Disambiguation Pages<br />Collectively provide a list of ambiguous terms and meanings for each<br />25<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  26. 26. Candidate Map: Redirects<br />AAPL<br />Apple (Company)<br />Apple (Computers)<br />Apple (company)<br />Apple (computer)<br />Apple Company<br />Apple Computer<br />Apple Computer Co.<br />Apple Computer Inc.<br />Apple Computer Incorporated<br />Apple Computer, Inc<br />Apple Computer, Inc.<br />Apple Computers<br />Apple Inc<br />Apple Incorporate<br />Apple Incorporated<br />Apple India<br />Apple comp<br />Apple compputer<br />Apple computer<br />Apple computer Inc<br />Apple computers<br />Apple inc<br />Apple inc.<br />Apple incoporated<br />Apple incorporated<br />Apple pc<br />Apple's<br />Apple, Inc<br />Apple, Inc.<br />Apple,inc.<br />Apple.com<br />AppleComputer<br />Bowman Bank<br />Cripple Inc.<br />Inc. Apple Computer<br />Jobs and Wozniak<br />Option-Shift-K<br /> Inc.<br />26<br />Apple_Inc<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  27. 27. Stage 3: Disambiguation<br />Select the correct candidate DBpedia Resource for a given surface form.<br />Decision is made based on the context(1) the surface form was mentioned<br />con·text  (kntkst)n.<br />1. the parts of a discourse that surround a word or passage and can throw light on its meaning<br />2. The circumstances in which an event occurs; a setting.<br />27<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />http://mw1.merriam-webster.com/dictionary/context<br />
  28. 28. Learning the Context for a resource<br />Collect context for DBpedia Resources from Wikipedia<br />Types of context<br />Wikipedia Pages <br />Definitions from disambiguation pages<br />Paragraphs that link to resources<br />28<br />(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps. <br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  29. 29. Disambiguation in DBpedia Spotlight<br />Model DBpedia Resources as vectors of terms found in Wikipedia text<br />Define functions for term scoring and vector similarity (e.g. frequency and cosine)<br />Rank candidate resource vectors based on their similarity with vector of input text<br />Choose highest ranking candidate<br />29<br />Lennon = {Beatles,McCartney,rock,guitar,...}<br />Lennon = {tf(Beatles)=320,tf(McCartney)=100,...}<br />Cos(Input,Lennon) = 0.12<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  30. 30. Scoring Strategies<br />TF*IDF (Term Freq. * Inverse Doc. Freq.)<br />TF: insight into the relevance of the term in the context of a DBpedia Resource<br />IDF: insight into the rarity of the term. Co-occurrence of rare terms is more informative<br />ICF: Inverse Candidate Frequency<br />IDF is the “rarity” in the entire Wikipedia<br />ICF is the rarity of a word with relation to the possible senses only<br />30<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  31. 31. Context-Independent Strategies<br />NAÏVE<br />Use surface form to build URI: “berlin” -> dbpedia:Berlin<br />PROMINENCE<br />P(u): n(u) / N (what is the ‘popularity’/importance of this URL)<br />n(u): number of times URI u occurred<br />N: total number of occurrences<br />Intuition: URIs that have appeared a lot are more likely to appear again<br />DEFAULT SENSE<br />P(u|s): n(u,s) / n(s)<br />n(u,s): number of times URI u occurred with surface form s<br />Intuition: some surface forms are strongly associated to some specific URIs<br />31<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  32. 32. Linking (Configuration)<br />Decide which spots to annotate with links to the disambiguated resources<br />Different use cases have different needs<br />Only annotate prominent resources?<br />Only if you’re sure disambiguation is correct?<br />Only people?<br />Only things related to Berlin?<br />32<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  33. 33. Linking in DBpedia Spotlight<br />Can be configured based on:<br />Thresholds<br />Confidence<br />Prominence (support)<br />Whitelist or Blacklist of types<br />Hide all people, Show only organizations<br />Complex definition of a “type” through a SPARQL query.<br />33<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  34. 34. How well did it work?<br />34<br />
  35. 35. Evaluation: Disambiguation<br />Used held out (unseen) Wikipedia occurrences as test data<br />Evaluates accuracy of disambiguation stage<br />Baselines<br />Random: performs well with low ambiguity<br />Default Sense: only prominence, without context<br />Default Similarity (TF*IDF) : Lucene implementation<br />35<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  36. 36. Disambiguation Evaluation Results<br />36<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  37. 37. Evaluation: Annotation<br />News text, different topics<br />Hand-annotated examples by 4 annotators<br />Gold standard from agreement <br />Evaluates precision and recall of annotations.<br />37<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  38. 38. Annotation Evaluation Results (2)<br />38<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  39. 39. Annotation Evaluation Results<br />39<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  40. 40. Conclusions<br />DBpedia Spotlight: a configurable annotation tool to support a variety of use cases<br />Very simple methods work surprisingly well for disambiguation<br />More work is needed to alleviate sparsity<br />Most challenging step is linking<br />More evaluation on larger annotation datasets is needed<br />40<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  41. 41. What are the next steps?<br />41<br />
  42. 42. A preview of next release<br />CORS-enabled + jQuery client<br />One line to annotate any web page:<br />A new demo interface: based on the plugin<br />Types: DBpedia 3.7, Freebase, Schema.org<br />New configuration parameters<br />E.g. perform smarter spotting<br />Easier install: maven2, jar, debian package<br />42<br />$(“div”).annotate()<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  43. 43. 43<br />Preview:<br />Temporarily available for I-SEMANTICS 2011<br />http://spotlight.dbpedia.org/dev/demo<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  44. 44. Future work<br />Internationalization (German, Spanish,...)<br />More sophisticated spotting<br />New disambiguation strategies<br />Global disambiguation: one disambiguation decision helps the other decisions<br />Sparsity problems: try smoothing, dimensionality reduction, etc.<br />Store user feedback, learn from mistakes<br />44<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />
  45. 45. We are open<br />Tell us about your use cases<br />Hack something with us<br />Drupal/Wordpress Plugin<br />Semantic Media Wiki integration<br />Are you a good engineer?<br />Help us make it faster, smaller!<br />Are you a good researcher?<br />Let’s collaborate on your/our ideas.<br />45<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />Licensed as Apache v2.0<br />(Business friendly)<br />
  46. 46. Thank you!<br />On Twitter: @pablomendes<br />E-mail: pablo.mendes@fu-berlin.de<br />Web: http://pablomendes.com<br />Special thanks to Jo Daiber (working with us for the next release)<br />Partially funded by LOD2.eu and Neofonie Gmbh<br />46<br />http://spotlight.dbpedia.org<br />Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents<br />

×