Your SlideShare is downloading. ×
DBpedia Spotlight at I-SEMANTICS 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

DBpedia Spotlight at I-SEMANTICS 2011

3,440
views

Published on

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided …

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries.

This is the presentation at the best paper award session at I-SEMANTICS 2011.

Published in: Technology, Business

0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,440
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
96
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • $ gunzip -c MostCommon-surfaceForm.count.gz | grep -Pc "\\t1$"4258908$ gunzip -c MostCommon-surfaceForm.count.gz | wc -l72442894258908 / 7244289 = 0.58789868819424514952399055311018
  • Max = 200,474 (log = 12.2)Min = 1Mean = 8.343878
  • Lexicalized: uses a list of resource namesComes from titles, redirects, disambiguates, anchor texts
  • The agreement between individual annotators is:Annotator 1 vs Annotator 2 (Kappa = 0.674)Annotator 1 vs Annotator 3 (Kappa = 0.606)Annotator 2 vs Annotator 3 (Kappa = 0.577)Annotator 2 vs Annotator 4 (Kappa = 0.528)Annotator 1 vs Annotator 4 (Kappa = 0.469)Annotator 3 vs Annotator 4 (Kappa = 0.385)
  • Transcript

    • 1. DBpedia SpotlightShedding Light on the Web of Documents
      Pablo N. Mendes, Max Jakob, Andrés Garcia-Silva, Christian Bizer
      pablo.mendes@fu-berlin.de
      I-SEMANTICS, Graz, Austria
      September 9th 2011
      1
    • 2. Agenda
      What is text annotation?
      What can you build with it?
      Why is it difficult?
      How did we approach the challenge?
      How well did it work?
      What are the next steps?
      2
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 3. What is it?
      3
    • 4. Text Annotation
      From:
      To:
      (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
      (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
      http://dbpedia.org/resource/New_York_City
      http://dbpedia.org/resource/Apple_Corps
      4
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 5. Challenge: Term Ambiguity
      5
      ...this apple on the palm of my hand...
      ...Apple tried to acquire Palm Inc....
      ...eating an apple sitted by a palm tree...
      What do “apple” and “palm” mean in each case?
      Our objective is to recognize entities and disambiguate their meaning, generating DBpedia annotation in text.
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 6. What can you do with annotations?
      Links to complementary information
      “More about this”
      Faceted browsing of blog posts
      Show only posts with topics related to Sports
      Rich snippets on Google
      Search engines start to display info from annotations
      More expressive filtering of information streams
      Twarql (entry at I-SEMANTICS 2010 Challenge)
      6
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 7. Rich Snippets
      Search Engines already benefit from some kinds of annotations
      7
      http://www.google.com/webmasters/tools/richsnippets
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 8. Twarql Example Use Case
      What competitors of my product are being mentioned with my product on Twitter?
      - comparative opinion!
      SELECT ? competitor
      WHERE {
      dbpedia:IPadskos:subject ?category .
      ?competitor skos:subject ?category .
      ?tweet moat:taggedWith ?competitor .
      }
      ?tweet moat:taggedWithdbpedia:Ipad .
      8
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 9. Twarql Example Use Case (2)
      Incoming microposts…
      Background Knowledge (e.g. DBpedia)
      @anonymized
      Loremipsumblabla this is an example tweet
      dbpedia:IPad
      skos:subject
      ?category
      ?category
      ?competitor
      skos:subject
      skos:subject
      moat:taggedWith
      Competition is modeled as two products
      in the same category in DBpedia
      ?tweet
      9
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 10. Twarql Example Use Case (3)
      Incoming microposts…
      Background Knowledge (e.g. DBpedia)
      @anonymized
      Loremipsumblabla this is an example tweet
      category:Wi-Fi
      dbpedia:IPad
      category:Touchscreen
      skos:subject
      ?category
      ?category
      ?competitor
      skos:subject
      skos:subject
      moat:taggedWith
      Background knowledge is dynamically “brought into” microposts.
      ?tweet
      10
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 11. Twarql Example Use Case (4)
      Background Knowledge (e.g. DBpedia)
      @anonymized
      Loremipsumblabla this is an example tweet
      category:Wi-Fi
      dbpedia:IPad
      category:Touchscreen
      skos:subject
      ?category
      ?category
      ?competitor
      skos:subject
      skos:subject
      moat:taggedWith
      ?tweet
      Trigger action if micropost matches constraints.
      11
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 12. DBpedia Spotlight
      DBpedia is a collection of entity descriptions extracted from Wikipedia & shared as linked data
      DBpedia Spotlight uses data from DBpedia and text from associated Wikipedia pages
      Learns how to recognize that a DBpedia resource was mentioned
      Given plain text as input, generates annotated text
      12
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 13. Why is it difficult?
      13
    • 14. Dataset overview
      Volume of Wikipedia
      56,9 GB in raw text data
      Occurrences of Ambiguous Terms in Wikipedia: 58.8%
      Sparsity: less data for some DBpedia resources
      14
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 15. Histogram: URI occurrences
      Many “rare” URIs,
      (few links on Wikipedia)
      Most of previous work deals with these entities:
      People, Organization, Location
      Few “popular” URIs
      (lots of links on Wikipedia)
      log(n(uri))))
      15
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 16. Histogram: Surface Form Ambiguity
      Many “unambiguous” surface forms
      Max: 1199 (log=7.08)
      Min: 1
      Mean: 1.328949
      Few very “ambiguous” surface forms
      log(n(uri,sf))))
      16
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 17. Ambiguity
      17
      What are the most ambiguous surface forms?
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 18. Name Variation
      18
      What are the URIs with many surface forms?
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 19. How did we approach the challenge?
      19
    • 20. A 4-stage approach
      Spotting
      Candidate Mapping
      Disambiguation
      Linking
      20
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 21. Stage 1: Spotting
      Find substrings that seem worthy of annotation
      Naïve implementation (impractical)
      all n-grams of length (1,|text|)
      Input:
      (…) Upon their return, Lennon and McCartney went to New York
      to announce the formation of Apple Corps.
      Output:
      “Lennon”, “McCartney”, “New York”, “Apple Corps”
      21
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 22. Spotting in DBpedia Spotlight
      Detect that the label (surface form) of a DBpedia Resource was mentioned
      Lexicalized, Aho-Corasick algorithm (LingPipe)
      Name variations from redirects, disambiguation pages, anchor texts
      Advantages:
      Simple implementation, well studied problem,
      Produces a reduced set of spots,
      Relies on user provided terms.
      Drawback:
      high memory requirements (~7G)
      22
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 23. Stage 2: Candidate Mapping
      What are the possible senses of a given surface form (the candidate DBpedia resources)?
      Input:
      “Lennon”, “McCartney”, “New York”, “Apple Corps”
      Output:
      “Lennon”: { Lennon_(album), Lennon,_Michigan, … }
      “McCartney”: { McCartney(surname), Paul_McCartney, … }
      “New York”: { New_York_State, New_York_City, … }
      “Apple Corps”: { Apple_Corps}
      23
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 24. Candidate Mapping in DBpedia Spotlight
      Sources of mappings between surface forms and DBpedia Resources
      Page titles offer “chosen names” for resources
      Redirects offer alternative spellings, aliases, etc.
      Disambiguation Pages: link a common term to many resources
      24
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 25. Candidate Map: Disambiguation Pages
      Collectively provide a list of ambiguous terms and meanings for each
      25
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 26. Candidate Map: Redirects
      AAPL
      Apple (Company)
      Apple (Computers)
      Apple (company)
      Apple (computer)
      Apple Company
      Apple Computer
      Apple Computer Co.
      Apple Computer Inc.
      Apple Computer Incorporated
      Apple Computer, Inc
      Apple Computer, Inc.
      Apple Computers
      Apple Inc
      Apple Incorporate
      Apple Incorporated
      Apple India
      Apple comp
      Apple compputer
      Apple computer
      Apple computer Inc
      Apple computers
      Apple inc
      Apple inc.
      Apple incoporated
      Apple incorporated
      Apple pc
      Apple's
      Apple, Inc
      Apple, Inc.
      Apple,inc.
      Apple.com
      AppleComputer
      Bowman Bank
      Cripple Inc.
      Inc. Apple Computer
      Jobs and Wozniak
      Option-Shift-K
       Inc.
      26
      Apple_Inc
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 27. Stage 3: Disambiguation
      Select the correct candidate DBpedia Resource for a given surface form.
      Decision is made based on the context(1) the surface form was mentioned
      con·text  (kntkst)n.
      1. the parts of a discourse that surround a word or passage and can throw light on its meaning
      2. The circumstances in which an event occurs; a setting.
      27
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
      http://mw1.merriam-webster.com/dictionary/context
    • 28. Learning the Context for a resource
      Collect context for DBpedia Resources from Wikipedia
      Types of context
      Wikipedia Pages
      Definitions from disambiguation pages
      Paragraphs that link to resources
      28
      (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 29. Disambiguation in DBpedia Spotlight
      Model DBpedia Resources as vectors of terms found in Wikipedia text
      Define functions for term scoring and vector similarity (e.g. frequency and cosine)
      Rank candidate resource vectors based on their similarity with vector of input text
      Choose highest ranking candidate
      29
      Lennon = {Beatles,McCartney,rock,guitar,...}
      Lennon = {tf(Beatles)=320,tf(McCartney)=100,...}
      Cos(Input,Lennon) = 0.12
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 30. Scoring Strategies
      TF*IDF (Term Freq. * Inverse Doc. Freq.)
      TF: insight into the relevance of the term in the context of a DBpedia Resource
      IDF: insight into the rarity of the term. Co-occurrence of rare terms is more informative
      ICF: Inverse Candidate Frequency
      IDF is the “rarity” in the entire Wikipedia
      ICF is the rarity of a word with relation to the possible senses only
      30
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 31. Context-Independent Strategies
      NAÏVE
      Use surface form to build URI: “berlin” -> dbpedia:Berlin
      PROMINENCE
      P(u): n(u) / N (what is the ‘popularity’/importance of this URL)
      n(u): number of times URI u occurred
      N: total number of occurrences
      Intuition: URIs that have appeared a lot are more likely to appear again
      DEFAULT SENSE
      P(u|s): n(u,s) / n(s)
      n(u,s): number of times URI u occurred with surface form s
      Intuition: some surface forms are strongly associated to some specific URIs
      31
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 32. Linking (Configuration)
      Decide which spots to annotate with links to the disambiguated resources
      Different use cases have different needs
      Only annotate prominent resources?
      Only if you’re sure disambiguation is correct?
      Only people?
      Only things related to Berlin?
      32
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 33. Linking in DBpedia Spotlight
      Can be configured based on:
      Thresholds
      Confidence
      Prominence (support)
      Whitelist or Blacklist of types
      Hide all people, Show only organizations
      Complex definition of a “type” through a SPARQL query.
      33
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 34. How well did it work?
      34
    • 35. Evaluation: Disambiguation
      Used held out (unseen) Wikipedia occurrences as test data
      Evaluates accuracy of disambiguation stage
      Baselines
      Random: performs well with low ambiguity
      Default Sense: only prominence, without context
      Default Similarity (TF*IDF) : Lucene implementation
      35
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 36. Disambiguation Evaluation Results
      36
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 37. Evaluation: Annotation
      News text, different topics
      Hand-annotated examples by 4 annotators
      Gold standard from agreement
      Evaluates precision and recall of annotations.
      37
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 38. Annotation Evaluation Results (2)
      38
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 39. Annotation Evaluation Results
      39
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 40. Conclusions
      DBpedia Spotlight: a configurable annotation tool to support a variety of use cases
      Very simple methods work surprisingly well for disambiguation
      More work is needed to alleviate sparsity
      Most challenging step is linking
      More evaluation on larger annotation datasets is needed
      40
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 41. What are the next steps?
      41
    • 42. A preview of next release
      CORS-enabled + jQuery client
      One line to annotate any web page:
      A new demo interface: based on the plugin
      Types: DBpedia 3.7, Freebase, Schema.org
      New configuration parameters
      E.g. perform smarter spotting
      Easier install: maven2, jar, debian package
      42
      $(“div”).annotate()
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 43. 43
      Preview:
      Temporarily available for I-SEMANTICS 2011
      http://spotlight.dbpedia.org/dev/demo
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 44. Future work
      Internationalization (German, Spanish,...)
      More sophisticated spotting
      New disambiguation strategies
      Global disambiguation: one disambiguation decision helps the other decisions
      Sparsity problems: try smoothing, dimensionality reduction, etc.
      Store user feedback, learn from mistakes
      44
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    • 45. We are open
      Tell us about your use cases
      Hack something with us
      Drupal/Wordpress Plugin
      Semantic Media Wiki integration
      Are you a good engineer?
      Help us make it faster, smaller!
      Are you a good researcher?
      Let’s collaborate on your/our ideas.
      45
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
      Licensed as Apache v2.0
      (Business friendly)
    • 46. Thank you!
      On Twitter: @pablomendes
      E-mail: pablo.mendes@fu-berlin.de
      Web: http://pablomendes.com
      Special thanks to Jo Daiber (working with us for the next release)
      Partially funded by LOD2.eu and Neofonie Gmbh
      46
      http://spotlight.dbpedia.org
      Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents