Creating Knowledge out of Interlinked Data




LOD2 Webinar . 26.02.2013 . Page 1                     http://lod2.eu
Creating Knowledge out of Interlinked Data




        LOD2 is a large-scale integrating project co-funded by the European
        Commission within the FP7 Information and Communication Technologies
        Work Programme. This 4-year project comprises leading Linked Open
        Data technology researchers, companies, and service providers. Coming
        from across 12 countries the partners are coordinated by the Agile
        Knowledge Engineering and Semantic Web Research Group at the
        University of Leipzig, Germany.

        LOD2 will integrate and syndicate Linked Data with existing large-scale
        applications. The project shows the benefits in the scenarios of Media and
        Publishing, Corporate Data intranets and eGovernment.




                                                                                     http://lod2.eu
LOD2 Webinar . 29.11.2011 . Page 2                                                    http://lod2.eu
Creating Knowledge out of Interlinked Data




        Once per month the LOD2 webinar series offer a free webinar about
        tools and services along the Linked Open Data Life Cycle.

        Stay with us and learn more about acquisition, editing, composing,
        connected applications – and finally publishing Linked Open Data.




                                                                             http://lod2.eu
LOD2 Webinar . 29.11.2011 . Page 3                                            http://lod2.eu
Creating Knowledge out of Interlinked Data



Agenda


    Profiles: Pablo N Mendes and the DBpedia Spotlight team

    Linked Data life cycle and role of DBpedia Spotlight within LOD2

    What is DBpedia Spotlight

    Demonstration

    Lessons Learned and Next steps

    Q&A




LOD2 Webinar . 26.02.2013. Page 4                                       http://lod2.eu
Creating Knowledge out of Interlinked Data



Pablo N. Mendes and the DBpedia Spotlight team

       Pablo N. Mendes
       Research Associate at the                   Co-maintainers
           Open Knowledge Foundation,              Max Jakob (Neofonie Gmbh)
           Germany
                                                   Joachim Daiber (MS student at
           http://okfn.de
                                                   the Rijksuniversiteit Groningen)
        Interests:
       - Information Extraction, Integration,
           Retrieval and Exploration
                                       Contributors
        More info:
                                       Sandro Coelho (BS student at UFJF, Brazil)
       http://pablomendes.com
                                       Chris Hokamp (PhD student at University
                                       of North Texas, USA)
Funding                                Dirk Weissenborn (MS student at
LOD2, DICODE, Google Summer            University of Dresden, Germany)
of Code 2012, IKS                      Liu Zhengzhong (now PhD student at
                                       Carnegie Mellon University, USA)
Hosting                                Marcus Nitschke (student at U. Leipzig)
U.Mannheim, MTA SZTAKI,                ...
Globo.com, RNP.br
                                       Full list on GitHub.
LOD2 Webinar . 26.02.2013. Page 5                                         http://lod2.eu
Creating Knowledge out of Interlinked Data



Linked Data Life Cycle



                     Manual             Interlinking
                     revision              Fusing      Classification
                    authoring                           Enrichment




          Storage                                               Quality
          Querying                                              Analysis




                    Extraction             Search       Evolution
                                         Browsing        Repair
                                        Exploration

LOD2 Webinar . 26.02.2013. Page 6                                          http://lod2.eu
Creating Knowledge out of Interlinked Data



Linked Data Life Cycle



                     Manual             Interlinking
                     revision              Fusing      Classification
                    authoring                           Enrichment




          Storage                                               Quality
          Querying                                              Analysis




                    Extraction             Search       Evolution
                                         Browsing        Repair
                                        Exploration

LOD2 Webinar . 26.02.2013. Page 7                                          http://lod2.eu
Creating Knowledge out of Interlinked Data




                    Shedding Light on the Web of Documents




LOD2 Webinar . 26.02.2013. Page 8                            http://lod2.eu
Creating Knowledge out of Interlinked Data




  Named Entity Recognition/Disambiguation
• Automatically put Wikipedia links to (plain) text.




 LOD2 Webinar . 26.02.2013. Page 9                      http://lod2.eu
Creating Knowledge out of Interlinked Data




Named Entity Recognition/Disambiguation
• Automatically put Wikipedia links to (plain) text.




• 1. Recognition: find „interesting“ strings
    •    s urface form s




LOD2 Webinar . 26.02.2013. Page 10                     http://lod2.eu
Creating Knowledge out of Interlinked Data




 Named Entity Recognition/Disambiguation
• Automatically put Wikipedia links to (plain) text.




• 1. Recognition: find „interesting“ strings
    •    s urface form s




LOD2 Webinar . 26.02.2013. Page 11                     http://lod2.eu
Creating Knowledge out of Interlinked Data




   Named Entity Recognition/Disambiguation
• Automatically put Wikipedia links to (plain) text.




• 1. Recognition: find „interesting“ strings
    •    s urface form s
• 2. Disambiguation: choose appropriate Wikipedia page
    •    Each Wikipedia page represents an e ntity
    •    Every surface form can have multiple candidate entities for linking
 LOD2 Webinar . 26.02.2013. Page 12                                            http://lod2.eu
Creating Knowledge out of Interlinked Data




Michael Jackson died in 2007.




LOD2 Webinar . 26.02.2013. Page 13                     http://lod2.eu
Creating Knowledge out of Interlinked Data




Michael Jackson died in 2007.
• Recognition: Find surface forms




 LOD2 Webinar . 26.02.2013. Page 14                     http://lod2.eu
Creating Knowledge out of Interlinked Data




[Michael Jackson] died in 2007.
• Recognition: Find surface forms




 LOD2 Webinar . 26.02.2013. Page 15                     http://lod2.eu
Creating Knowledge out of Interlinked Data




[Michael Jackson] died in 2007.


• Disambiguation: Choose correct entity




 LOD2 Webinar . 26.02.2013. Page 16                     http://lod2.eu
Creating Knowledge out of Interlinked Data




[Michael Jackson] died in 2007.


• Disambiguation: Choose correct entity
   •     Candidates for               [Michael Jackson]




 LOD2 Webinar . 26.02.2013. Page 17                       http://lod2.eu
Creating Knowledge out of Interlinked Data




       [Michael Jackson] died in 2007.
• Disambiguation: Choose correct entity
   •     Candidates for               [Michael Jackson]




 LOD2 Webinar . 26.02.2013. Page 18                       http://lod2.eu
Creating Knowledge out of Interlinked Data


                                                          contex
                                                                t
       [Michael Jackson] died in 2007.
• Disambiguation: Choose correct entity
   •     Candidates for               [Michael Jackson]




 LOD2 Webinar . 26.02.2013. Page 19                            http://lod2.eu
Creating Knowledge out of Interlinked Data
                                                          less dis
                                                                   tinctive
                                                             contex
                                                                      t
 [Michael Jackson] came to Paris.
• Disambiguation: Choose correct entity
   •     Candidates for               [Michael Jackson]


    Singer                                                      Journalist




 LOD2 Webinar . 26.02.2013. Page 20                                   http://lod2.eu
Creating Knowledge out of Interlinked Data

                                                          less dis
                                                                   tinctive
                                                             contex
                                                                      t
 [Michael Jackson] came to Paris.
• Disambiguation: Choose correct entity
   •     Candidates for               [Michael Jackson]


    Singer                                                     Journalist




 LOD2 Webinar . 26.02.2013. Page 21                                   http://lod2.eu
Creating Knowledge out of Interlinked Data




Probabilities
• P(entity | surface form)
   •     Who is typically meant by a name?
   •     For example, given [Michael Jackson] (and ignoring the context), what
         are the probabilities of the candidates?
   •     Michael J ackson (singer) 0.98
   •     Michael J ackson (journalist) 0.02

• Other useful probabilities:
   •     P(surface form | entity), P(entity), P(surface form)


• Estimate Maximum Likelihood using Wikipedia page links

 LOD2 Webinar . 26.02.2013. Page 22                                    http://lod2.eu
Creating Knowledge out of Interlinked Data




  Data Processing
• Two pipelines
      −    Single machine with Scala
      −    MapReduce-style with Apache Pig

• Apache Pig for analyzing large datasets on top of Hadoop
      −    Data-flow language
      −    Think in tuples, bags and maps
      −    load, filter, join, group by, store, …
      −    from which Pig derives a MapReduce plan
      −    We build on p ig nlp ro c , started by Olivier Grisel (Stanbol)


 LOD2 Webinar . 26.02.2013. Page 23                                   http://lod2.eu
Creating Knowledge out of Interlinked Data




 Probability estimation
                                                 count( surface form, entity )
• P( entity | surface form ) =
                                                        count( surface form )

    •    P( Michael J ackson (singer) | Michael J ackson) = 0.98
    •    P( Michael J ackson (journalist) | Michael J ackson) = 0.02




• Check the project web for estimation of other scores
    – Other probabilities...
    – TF*ICF (modification of TF*IDF) and others...

 LOD2 Webinar . 26.02.2013. Page 24                                              http://lod2.eu
Creating Knowledge out of Interlinked Data




LOD2 Webinar . 26.02.2013. Page 25                     http://lod2.eu
Creating Knowledge out of Interlinked Data




Annotate
                                     http://dbpedia.org/resource/LSU_Tigers




LOD2 Webinar . 26.02.2013. Page 26                                            http://lod2.eu
Creating Knowledge out of Interlinked Data




 Annotate

                                     http://dbpedia.org/resource/LSU_Tigers




                                                    http://dbpedia.org/resource/No. 4 (album)




LOD2 Webinar . 26.02.2013. Page 27                                                 http://lod2.eu
Creating Knowledge out of Interlinked Data




 Top K Candidates



                                                       LSU_Tigers

                                                             Louisiana
                                                             State
                                                             University




LOD2 Webinar . 26.02.2013. Page 28                             http://lod2.eu
Creating Knowledge out of Interlinked Data




Demo:
      – http://spotlight.dbpedia.org/demo/
Web Service:
      – http://spotlight.dbpedia.org/rest/{API}
      – APIs:
             • Phrase Recognition (/spot), Disambiguation (/disambiguation)
             • Top K disambiguations (/candidates)
             • Annotation (/annotation)
Source code:
      – https://github.com/dbpedia-spotlight/dbpedia-spotlight/
Apache V2 License!
LOD2 Webinar . 26.02.2013. Page 29                                  http://lod2.eu
Creating Knowledge out of Interlinked Data




Lessons learned

    A generic solution to the problem is tough
      – Most of the research focuses on solving very specialized cases
      – Some entity types are harder than others
      – Some types of text are harder than others

      Yet, users expect it to “just work”.

We are focusing on a generic core that can be easily customized.




LOD2 Webinar . 26.02.2013. Page 30                                       http://lod2.eu
Creating Knowledge out of Interlinked Data




Next steps

    More experiments with DBpedia Spotlight in the context of LOD2
     Use Case packages: Wolters Kluwer (legal domain, German
     language), Emergency Response,

    Automating build process and release to LOD2 Stack

    Expanding to other languages

    Easier adaptation to other knowledge bases beyond DBpedia

    New algorithms, collective disambiguation, etc.




LOD2 Webinar . 26.02.2013. Page 31                               http://lod2.eu
Creating Knowledge out of Interlinked Data




Credits

Jingle       R.E.M., Martin Kaltenböck, Florian Kondert
Coordination Thomas Thurner
             Martin Kaltenböck
Moderation Martin Kaltenböck
Presented by Pablo N. Mendes
Slides from Pablo N. Mendes, Max Jakob, Joachim Daiber




LOD2 Webinar . 26.02.2013 . Page 32                       http://lod2.eu
Creating Knowledge out of Interlinked Data




        Hope you enjoyed staying with us – if you need more detailed
        information, visit us at www.lod2.eu and let us know how we can
        improve to meet your expectations!

        Don’t forget to register for our next webinar

           27.03.2013 – CKAN and PublicData.eu (OKFN)
           April – Vituoso 7 (Openlink Software)

        Have a great day and don’t forget ...




                                                                          http://lod2.eu
LOD2 Webinar . 29.11.2011 . Page 33                                        http://lod2.eu
Creating Knowledge out of Interlinked Data




                                                       http://lod2.eu
LOD2 Webinar . 29.11.2011 . Page 34                     http://lod2.eu

LOD2 Webinar Series: DBpedia Spotlight

  • 1.
    Creating Knowledge outof Interlinked Data LOD2 Webinar . 26.02.2013 . Page 1 http://lod2.eu
  • 2.
    Creating Knowledge outof Interlinked Data LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany. LOD2 will integrate and syndicate Linked Data with existing large-scale applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment. http://lod2.eu LOD2 Webinar . 29.11.2011 . Page 2 http://lod2.eu
  • 3.
    Creating Knowledge outof Interlinked Data Once per month the LOD2 webinar series offer a free webinar about tools and services along the Linked Open Data Life Cycle. Stay with us and learn more about acquisition, editing, composing, connected applications – and finally publishing Linked Open Data. http://lod2.eu LOD2 Webinar . 29.11.2011 . Page 3 http://lod2.eu
  • 4.
    Creating Knowledge outof Interlinked Data Agenda  Profiles: Pablo N Mendes and the DBpedia Spotlight team  Linked Data life cycle and role of DBpedia Spotlight within LOD2  What is DBpedia Spotlight  Demonstration  Lessons Learned and Next steps  Q&A LOD2 Webinar . 26.02.2013. Page 4 http://lod2.eu
  • 5.
    Creating Knowledge outof Interlinked Data Pablo N. Mendes and the DBpedia Spotlight team Pablo N. Mendes Research Associate at the Co-maintainers Open Knowledge Foundation, Max Jakob (Neofonie Gmbh) Germany Joachim Daiber (MS student at http://okfn.de the Rijksuniversiteit Groningen)  Interests: - Information Extraction, Integration, Retrieval and Exploration Contributors  More info: Sandro Coelho (BS student at UFJF, Brazil) http://pablomendes.com Chris Hokamp (PhD student at University of North Texas, USA) Funding Dirk Weissenborn (MS student at LOD2, DICODE, Google Summer University of Dresden, Germany) of Code 2012, IKS Liu Zhengzhong (now PhD student at Carnegie Mellon University, USA) Hosting Marcus Nitschke (student at U. Leipzig) U.Mannheim, MTA SZTAKI, ... Globo.com, RNP.br Full list on GitHub. LOD2 Webinar . 26.02.2013. Page 5 http://lod2.eu
  • 6.
    Creating Knowledge outof Interlinked Data Linked Data Life Cycle Manual Interlinking revision Fusing Classification authoring Enrichment Storage Quality Querying Analysis Extraction Search Evolution Browsing Repair Exploration LOD2 Webinar . 26.02.2013. Page 6 http://lod2.eu
  • 7.
    Creating Knowledge outof Interlinked Data Linked Data Life Cycle Manual Interlinking revision Fusing Classification authoring Enrichment Storage Quality Querying Analysis Extraction Search Evolution Browsing Repair Exploration LOD2 Webinar . 26.02.2013. Page 7 http://lod2.eu
  • 8.
    Creating Knowledge outof Interlinked Data Shedding Light on the Web of Documents LOD2 Webinar . 26.02.2013. Page 8 http://lod2.eu
  • 9.
    Creating Knowledge outof Interlinked Data Named Entity Recognition/Disambiguation • Automatically put Wikipedia links to (plain) text. LOD2 Webinar . 26.02.2013. Page 9 http://lod2.eu
  • 10.
    Creating Knowledge outof Interlinked Data Named Entity Recognition/Disambiguation • Automatically put Wikipedia links to (plain) text. • 1. Recognition: find „interesting“ strings • s urface form s LOD2 Webinar . 26.02.2013. Page 10 http://lod2.eu
  • 11.
    Creating Knowledge outof Interlinked Data Named Entity Recognition/Disambiguation • Automatically put Wikipedia links to (plain) text. • 1. Recognition: find „interesting“ strings • s urface form s LOD2 Webinar . 26.02.2013. Page 11 http://lod2.eu
  • 12.
    Creating Knowledge outof Interlinked Data Named Entity Recognition/Disambiguation • Automatically put Wikipedia links to (plain) text. • 1. Recognition: find „interesting“ strings • s urface form s • 2. Disambiguation: choose appropriate Wikipedia page • Each Wikipedia page represents an e ntity • Every surface form can have multiple candidate entities for linking LOD2 Webinar . 26.02.2013. Page 12 http://lod2.eu
  • 13.
    Creating Knowledge outof Interlinked Data Michael Jackson died in 2007. LOD2 Webinar . 26.02.2013. Page 13 http://lod2.eu
  • 14.
    Creating Knowledge outof Interlinked Data Michael Jackson died in 2007. • Recognition: Find surface forms LOD2 Webinar . 26.02.2013. Page 14 http://lod2.eu
  • 15.
    Creating Knowledge outof Interlinked Data [Michael Jackson] died in 2007. • Recognition: Find surface forms LOD2 Webinar . 26.02.2013. Page 15 http://lod2.eu
  • 16.
    Creating Knowledge outof Interlinked Data [Michael Jackson] died in 2007. • Disambiguation: Choose correct entity LOD2 Webinar . 26.02.2013. Page 16 http://lod2.eu
  • 17.
    Creating Knowledge outof Interlinked Data [Michael Jackson] died in 2007. • Disambiguation: Choose correct entity • Candidates for [Michael Jackson] LOD2 Webinar . 26.02.2013. Page 17 http://lod2.eu
  • 18.
    Creating Knowledge outof Interlinked Data [Michael Jackson] died in 2007. • Disambiguation: Choose correct entity • Candidates for [Michael Jackson] LOD2 Webinar . 26.02.2013. Page 18 http://lod2.eu
  • 19.
    Creating Knowledge outof Interlinked Data contex t [Michael Jackson] died in 2007. • Disambiguation: Choose correct entity • Candidates for [Michael Jackson] LOD2 Webinar . 26.02.2013. Page 19 http://lod2.eu
  • 20.
    Creating Knowledge outof Interlinked Data less dis tinctive contex t [Michael Jackson] came to Paris. • Disambiguation: Choose correct entity • Candidates for [Michael Jackson] Singer Journalist LOD2 Webinar . 26.02.2013. Page 20 http://lod2.eu
  • 21.
    Creating Knowledge outof Interlinked Data less dis tinctive contex t [Michael Jackson] came to Paris. • Disambiguation: Choose correct entity • Candidates for [Michael Jackson] Singer Journalist LOD2 Webinar . 26.02.2013. Page 21 http://lod2.eu
  • 22.
    Creating Knowledge outof Interlinked Data Probabilities • P(entity | surface form) • Who is typically meant by a name? • For example, given [Michael Jackson] (and ignoring the context), what are the probabilities of the candidates? • Michael J ackson (singer) 0.98 • Michael J ackson (journalist) 0.02 • Other useful probabilities: • P(surface form | entity), P(entity), P(surface form) • Estimate Maximum Likelihood using Wikipedia page links LOD2 Webinar . 26.02.2013. Page 22 http://lod2.eu
  • 23.
    Creating Knowledge outof Interlinked Data Data Processing • Two pipelines − Single machine with Scala − MapReduce-style with Apache Pig • Apache Pig for analyzing large datasets on top of Hadoop − Data-flow language − Think in tuples, bags and maps − load, filter, join, group by, store, … − from which Pig derives a MapReduce plan − We build on p ig nlp ro c , started by Olivier Grisel (Stanbol) LOD2 Webinar . 26.02.2013. Page 23 http://lod2.eu
  • 24.
    Creating Knowledge outof Interlinked Data Probability estimation count( surface form, entity ) • P( entity | surface form ) = count( surface form ) • P( Michael J ackson (singer) | Michael J ackson) = 0.98 • P( Michael J ackson (journalist) | Michael J ackson) = 0.02 • Check the project web for estimation of other scores – Other probabilities... – TF*ICF (modification of TF*IDF) and others... LOD2 Webinar . 26.02.2013. Page 24 http://lod2.eu
  • 25.
    Creating Knowledge outof Interlinked Data LOD2 Webinar . 26.02.2013. Page 25 http://lod2.eu
  • 26.
    Creating Knowledge outof Interlinked Data Annotate http://dbpedia.org/resource/LSU_Tigers LOD2 Webinar . 26.02.2013. Page 26 http://lod2.eu
  • 27.
    Creating Knowledge outof Interlinked Data Annotate http://dbpedia.org/resource/LSU_Tigers http://dbpedia.org/resource/No. 4 (album) LOD2 Webinar . 26.02.2013. Page 27 http://lod2.eu
  • 28.
    Creating Knowledge outof Interlinked Data Top K Candidates LSU_Tigers Louisiana State University LOD2 Webinar . 26.02.2013. Page 28 http://lod2.eu
  • 29.
    Creating Knowledge outof Interlinked Data Demo: – http://spotlight.dbpedia.org/demo/ Web Service: – http://spotlight.dbpedia.org/rest/{API} – APIs: • Phrase Recognition (/spot), Disambiguation (/disambiguation) • Top K disambiguations (/candidates) • Annotation (/annotation) Source code: – https://github.com/dbpedia-spotlight/dbpedia-spotlight/ Apache V2 License! LOD2 Webinar . 26.02.2013. Page 29 http://lod2.eu
  • 30.
    Creating Knowledge outof Interlinked Data Lessons learned  A generic solution to the problem is tough – Most of the research focuses on solving very specialized cases – Some entity types are harder than others – Some types of text are harder than others Yet, users expect it to “just work”. We are focusing on a generic core that can be easily customized. LOD2 Webinar . 26.02.2013. Page 30 http://lod2.eu
  • 31.
    Creating Knowledge outof Interlinked Data Next steps  More experiments with DBpedia Spotlight in the context of LOD2 Use Case packages: Wolters Kluwer (legal domain, German language), Emergency Response,  Automating build process and release to LOD2 Stack  Expanding to other languages  Easier adaptation to other knowledge bases beyond DBpedia  New algorithms, collective disambiguation, etc. LOD2 Webinar . 26.02.2013. Page 31 http://lod2.eu
  • 32.
    Creating Knowledge outof Interlinked Data Credits Jingle R.E.M., Martin Kaltenböck, Florian Kondert Coordination Thomas Thurner Martin Kaltenböck Moderation Martin Kaltenböck Presented by Pablo N. Mendes Slides from Pablo N. Mendes, Max Jakob, Joachim Daiber LOD2 Webinar . 26.02.2013 . Page 32 http://lod2.eu
  • 33.
    Creating Knowledge outof Interlinked Data Hope you enjoyed staying with us – if you need more detailed information, visit us at www.lod2.eu and let us know how we can improve to meet your expectations! Don’t forget to register for our next webinar 27.03.2013 – CKAN and PublicData.eu (OKFN) April – Vituoso 7 (Openlink Software) Have a great day and don’t forget ... http://lod2.eu LOD2 Webinar . 29.11.2011 . Page 33 http://lod2.eu
  • 34.
    Creating Knowledge outof Interlinked Data http://lod2.eu LOD2 Webinar . 29.11.2011 . Page 34 http://lod2.eu