• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
DBpedia Spotlight at I-SEMANTICS 2011
 

DBpedia Spotlight at I-SEMANTICS 2011

on

  • 3,460 views

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided ...

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries.

This is the presentation at the best paper award session at I-SEMANTICS 2011.

Statistics

Views

Total Views
3,460
Views on SlideShare
3,424
Embed Views
36

Actions

Likes
10
Downloads
86
Comments
0

4 Embeds 36

http://paper.li 13
http://linkeddata.uriburner.com 12
https://twitter.com 10
http://a0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • $ gunzip -c MostCommon-surfaceForm.count.gz | grep -Pc "\\t1$"4258908$ gunzip -c MostCommon-surfaceForm.count.gz | wc -l72442894258908 / 7244289 = 0.58789868819424514952399055311018
  • Max = 200,474 (log = 12.2)Min = 1Mean = 8.343878
  • Lexicalized: uses a list of resource namesComes from titles, redirects, disambiguates, anchor texts
  • The agreement between individual annotators is:Annotator 1 vs Annotator 2 (Kappa = 0.674)Annotator 1 vs Annotator 3 (Kappa = 0.606)Annotator 2 vs Annotator 3 (Kappa = 0.577)Annotator 2 vs Annotator 4 (Kappa = 0.528)Annotator 1 vs Annotator 4 (Kappa = 0.469)Annotator 3 vs Annotator 4 (Kappa = 0.385)

DBpedia Spotlight at I-SEMANTICS 2011 DBpedia Spotlight at I-SEMANTICS 2011 Presentation Transcript

  • DBpedia SpotlightShedding Light on the Web of Documents
    Pablo N. Mendes, Max Jakob, Andrés Garcia-Silva, Christian Bizer
    pablo.mendes@fu-berlin.de
    I-SEMANTICS, Graz, Austria
    September 9th 2011
    1
  • Agenda
    What is text annotation?
    What can you build with it?
    Why is it difficult?
    How did we approach the challenge?
    How well did it work?
    What are the next steps?
    2
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • What is it?
    3
  • Text Annotation
    From:
    To:
    (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
    (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
    http://dbpedia.org/resource/New_York_City
    http://dbpedia.org/resource/Apple_Corps
    4
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Challenge: Term Ambiguity
    5
    ...this apple on the palm of my hand...
    ...Apple tried to acquire Palm Inc....
    ...eating an apple sitted by a palm tree...
    What do “apple” and “palm” mean in each case?
    Our objective is to recognize entities and disambiguate their meaning, generating DBpedia annotation in text.
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • What can you do with annotations?
    Links to complementary information
    “More about this”
    Faceted browsing of blog posts
    Show only posts with topics related to Sports
    Rich snippets on Google
    Search engines start to display info from annotations
    More expressive filtering of information streams
    Twarql (entry at I-SEMANTICS 2010 Challenge)
    6
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Rich Snippets
    Search Engines already benefit from some kinds of annotations
    7
    http://www.google.com/webmasters/tools/richsnippets
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Twarql Example Use Case
    What competitors of my product are being mentioned with my product on Twitter?
    - comparative opinion!
    SELECT ? competitor
    WHERE {
    dbpedia:IPadskos:subject ?category .
    ?competitor skos:subject ?category .
    ?tweet moat:taggedWith ?competitor .
    }
    ?tweet moat:taggedWithdbpedia:Ipad .
    8
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Twarql Example Use Case (2)
    Incoming microposts…
    Background Knowledge (e.g. DBpedia)
    @anonymized
    Loremipsumblabla this is an example tweet
    dbpedia:IPad
    skos:subject
    ?category
    ?category
    ?competitor
    skos:subject
    skos:subject
    moat:taggedWith
    Competition is modeled as two products
    in the same category in DBpedia
    ?tweet
    9
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Twarql Example Use Case (3)
    Incoming microposts…
    Background Knowledge (e.g. DBpedia)
    @anonymized
    Loremipsumblabla this is an example tweet
    category:Wi-Fi
    dbpedia:IPad
    category:Touchscreen
    skos:subject
    ?category
    ?category
    ?competitor
    skos:subject
    skos:subject
    moat:taggedWith
    Background knowledge is dynamically “brought into” microposts.
    ?tweet
    10
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Twarql Example Use Case (4)
    Background Knowledge (e.g. DBpedia)
    @anonymized
    Loremipsumblabla this is an example tweet
    category:Wi-Fi
    dbpedia:IPad
    category:Touchscreen
    skos:subject
    ?category
    ?category
    ?competitor
    skos:subject
    skos:subject
    moat:taggedWith
    ?tweet
    Trigger action if micropost matches constraints.
    11
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • DBpedia Spotlight
    DBpedia is a collection of entity descriptions extracted from Wikipedia & shared as linked data
    DBpedia Spotlight uses data from DBpedia and text from associated Wikipedia pages
    Learns how to recognize that a DBpedia resource was mentioned
    Given plain text as input, generates annotated text
    12
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Why is it difficult?
    13
  • Dataset overview
    Volume of Wikipedia
    56,9 GB in raw text data
    Occurrences of Ambiguous Terms in Wikipedia: 58.8%
    Sparsity: less data for some DBpedia resources
    14
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Histogram: URI occurrences
    Many “rare” URIs,
    (few links on Wikipedia)
    Most of previous work deals with these entities:
    People, Organization, Location
    Few “popular” URIs
    (lots of links on Wikipedia)
    log(n(uri))))
    15
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Histogram: Surface Form Ambiguity
    Many “unambiguous” surface forms
    Max: 1199 (log=7.08)
    Min: 1
    Mean: 1.328949
    Few very “ambiguous” surface forms
    log(n(uri,sf))))
    16
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Ambiguity
    17
    What are the most ambiguous surface forms?
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Name Variation
    18
    What are the URIs with many surface forms?
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • How did we approach the challenge?
    19
  • A 4-stage approach
    Spotting
    Candidate Mapping
    Disambiguation
    Linking
    20
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Stage 1: Spotting
    Find substrings that seem worthy of annotation
    Naïve implementation (impractical)
    all n-grams of length (1,|text|)
    Input:
    (…) Upon their return, Lennon and McCartney went to New York
    to announce the formation of Apple Corps.
    Output:
    “Lennon”, “McCartney”, “New York”, “Apple Corps”
    21
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Spotting in DBpedia Spotlight
    Detect that the label (surface form) of a DBpedia Resource was mentioned
    Lexicalized, Aho-Corasick algorithm (LingPipe)
    Name variations from redirects, disambiguation pages, anchor texts
    Advantages:
    Simple implementation, well studied problem,
    Produces a reduced set of spots,
    Relies on user provided terms.
    Drawback:
    high memory requirements (~7G)
    22
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Stage 2: Candidate Mapping
    What are the possible senses of a given surface form (the candidate DBpedia resources)?
    Input:
    “Lennon”, “McCartney”, “New York”, “Apple Corps”
    Output:
    “Lennon”: { Lennon_(album), Lennon,_Michigan, … }
    “McCartney”: { McCartney(surname), Paul_McCartney, … }
    “New York”: { New_York_State, New_York_City, … }
    “Apple Corps”: { Apple_Corps}
    23
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Candidate Mapping in DBpedia Spotlight
    Sources of mappings between surface forms and DBpedia Resources
    Page titles offer “chosen names” for resources
    Redirects offer alternative spellings, aliases, etc.
    Disambiguation Pages: link a common term to many resources
    24
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Candidate Map: Disambiguation Pages
    Collectively provide a list of ambiguous terms and meanings for each
    25
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Candidate Map: Redirects
    AAPL
    Apple (Company)
    Apple (Computers)
    Apple (company)
    Apple (computer)
    Apple Company
    Apple Computer
    Apple Computer Co.
    Apple Computer Inc.
    Apple Computer Incorporated
    Apple Computer, Inc
    Apple Computer, Inc.
    Apple Computers
    Apple Inc
    Apple Incorporate
    Apple Incorporated
    Apple India
    Apple comp
    Apple compputer
    Apple computer
    Apple computer Inc
    Apple computers
    Apple inc
    Apple inc.
    Apple incoporated
    Apple incorporated
    Apple pc
    Apple's
    Apple, Inc
    Apple, Inc.
    Apple,inc.
    Apple.com
    AppleComputer
    Bowman Bank
    Cripple Inc.
    Inc. Apple Computer
    Jobs and Wozniak
    Option-Shift-K
     Inc.
    26
    Apple_Inc
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Stage 3: Disambiguation
    Select the correct candidate DBpedia Resource for a given surface form.
    Decision is made based on the context(1) the surface form was mentioned
    con·text  (kntkst)n.
    1. the parts of a discourse that surround a word or passage and can throw light on its meaning
    2. The circumstances in which an event occurs; a setting.
    27
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    http://mw1.merriam-webster.com/dictionary/context
  • Learning the Context for a resource
    Collect context for DBpedia Resources from Wikipedia
    Types of context
    Wikipedia Pages
    Definitions from disambiguation pages
    Paragraphs that link to resources
    28
    (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Disambiguation in DBpedia Spotlight
    Model DBpedia Resources as vectors of terms found in Wikipedia text
    Define functions for term scoring and vector similarity (e.g. frequency and cosine)
    Rank candidate resource vectors based on their similarity with vector of input text
    Choose highest ranking candidate
    29
    Lennon = {Beatles,McCartney,rock,guitar,...}
    Lennon = {tf(Beatles)=320,tf(McCartney)=100,...}
    Cos(Input,Lennon) = 0.12
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Scoring Strategies
    TF*IDF (Term Freq. * Inverse Doc. Freq.)
    TF: insight into the relevance of the term in the context of a DBpedia Resource
    IDF: insight into the rarity of the term. Co-occurrence of rare terms is more informative
    ICF: Inverse Candidate Frequency
    IDF is the “rarity” in the entire Wikipedia
    ICF is the rarity of a word with relation to the possible senses only
    30
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Context-Independent Strategies
    NAÏVE
    Use surface form to build URI: “berlin” -> dbpedia:Berlin
    PROMINENCE
    P(u): n(u) / N (what is the ‘popularity’/importance of this URL)
    n(u): number of times URI u occurred
    N: total number of occurrences
    Intuition: URIs that have appeared a lot are more likely to appear again
    DEFAULT SENSE
    P(u|s): n(u,s) / n(s)
    n(u,s): number of times URI u occurred with surface form s
    Intuition: some surface forms are strongly associated to some specific URIs
    31
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Linking (Configuration)
    Decide which spots to annotate with links to the disambiguated resources
    Different use cases have different needs
    Only annotate prominent resources?
    Only if you’re sure disambiguation is correct?
    Only people?
    Only things related to Berlin?
    32
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Linking in DBpedia Spotlight
    Can be configured based on:
    Thresholds
    Confidence
    Prominence (support)
    Whitelist or Blacklist of types
    Hide all people, Show only organizations
    Complex definition of a “type” through a SPARQL query.
    33
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • How well did it work?
    34
  • Evaluation: Disambiguation
    Used held out (unseen) Wikipedia occurrences as test data
    Evaluates accuracy of disambiguation stage
    Baselines
    Random: performs well with low ambiguity
    Default Sense: only prominence, without context
    Default Similarity (TF*IDF) : Lucene implementation
    35
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Disambiguation Evaluation Results
    36
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Evaluation: Annotation
    News text, different topics
    Hand-annotated examples by 4 annotators
    Gold standard from agreement
    Evaluates precision and recall of annotations.
    37
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Annotation Evaluation Results (2)
    38
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Annotation Evaluation Results
    39
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Conclusions
    DBpedia Spotlight: a configurable annotation tool to support a variety of use cases
    Very simple methods work surprisingly well for disambiguation
    More work is needed to alleviate sparsity
    Most challenging step is linking
    More evaluation on larger annotation datasets is needed
    40
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • What are the next steps?
    41
  • A preview of next release
    CORS-enabled + jQuery client
    One line to annotate any web page:
    A new demo interface: based on the plugin
    Types: DBpedia 3.7, Freebase, Schema.org
    New configuration parameters
    E.g. perform smarter spotting
    Easier install: maven2, jar, debian package
    42
    $(“div”).annotate()
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • 43
    Preview:
    Temporarily available for I-SEMANTICS 2011
    http://spotlight.dbpedia.org/dev/demo
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • Future work
    Internationalization (German, Spanish,...)
    More sophisticated spotting
    New disambiguation strategies
    Global disambiguation: one disambiguation decision helps the other decisions
    Sparsity problems: try smoothing, dimensionality reduction, etc.
    Store user feedback, learn from mistakes
    44
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
  • We are open
    Tell us about your use cases
    Hack something with us
    Drupal/Wordpress Plugin
    Semantic Media Wiki integration
    Are you a good engineer?
    Help us make it faster, smaller!
    Are you a good researcher?
    Let’s collaborate on your/our ideas.
    45
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents
    Licensed as Apache v2.0
    (Business friendly)
  • Thank you!
    On Twitter: @pablomendes
    E-mail: pablo.mendes@fu-berlin.de
    Web: http://pablomendes.com
    Special thanks to Jo Daiber (working with us for the next release)
    Partially funded by LOD2.eu and Neofonie Gmbh
    46
    http://spotlight.dbpedia.org
    Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents