Page:




                                 Stanbol
Semantic CMS Community           in the Labs


                                 Universal Topic
 Olivier Grisel                  Classification
 Nuxeo

 June 17, 2011                   Named Entity
                                 Disambiguation

    Co-funded by the
                             1              Copyright IKS Consortium
    European Union

        www.iks-project.eu
Page:




   1 - Universal Topic Classification




www.iks-project.eu
Page: 3          June 17, 2011




Wikipedia is a Web-Scale Controlled
             Vocabulary

                      – Chris Sizemore, BBC




www.iks-project.eu                   Copyright IKS Consortium
Page:




A Rather “Simple” Idea

                          Use
Apache Lucene / Solr MoreLikeThis
                      to perform an
  approximate k-Nearest Neighbors
                         query

                         in the
  TF-IDF vector space of Wikipedia

 www.iks-project.eu
Page:




    Which means:
●   Picks the top 30 terms of the document to categorize
●   Build a fuzzy full-text query
●   Search for indexed articles that share most terms
●   Rank results according to similarity score
●   Use the top-related Wikipedia articles as “Topics”




      www.iks-project.eu
Page:




However Wikipedia has millions of
           articles:
       Navigation Hell

         Need hierarchical structure:

              from generic to specific

                     Faceted Browsing!

www.iks-project.eu
Page:




    Hierarchical Wikipedia Categorization
●   Group text of all articles categorized for a given Topic
●   Use Wikipedia Categories as Hierarchical Taxonomy
●   Categorize new document with MoreLikeThis on the
    aggregate text of articles
●   Available DBpedia dumps provides:
    ●    Text summaries for each article
    ●    “subject” relationships between articles and topics
    ●    “broader” / “narrower” SKOS hieararchy between topics



        www.iks-project.eu
Page:




    Challenges encountered
●   500k “technical” categories
    “People_with_missing_birth_place”, “Rivers_in_Romania”
●   70k “grounded” categories
    ●   Paths to roots need both “technical” and “grounded”
●   Loops everywhere!
    ●   Death is a subcategory of Life
         –   Life is a subcategory of Death
                ●   …
●   Scale
    ●   1.2M topic / topic links
    ●   30M topic / article links
        www.iks-project.eu
Page:




                     Sample results

Pig / Solr / Python Proof of Concept




www.iks-project.eu
Page:




    IKS Workshop Wiki Page
●   Category:Free_web_development_software
●   Category:Semantic_HTML
●   Category:Semantic_Web
●   Category:Web_development_software
●   Category:Office_software
●   Category:World_Wide_Web_Consortium
●   Category:Open_source_project_foundations
●   Category:Free_network-related_software
●   Category:Free_business_software
      www.iks-project.eu
Page:




    IKS Workshop Wiki Page (cont'd)
●   Category:Knowledge_representation_languages
●   Category:PHP_programming_language
●   Category:XML-based_standards
●   Category:Content_management_systems
●   Category:Knowledge_representation
●   Category:Presentation
●   Category:Cross-platform_software
●   Category:HTML
●   Category:Data_management
      www.iks-project.eu
Page:




    Yesterday Wikinews Articles (1/3)
    Hundreds of thousands of British public sector workers
    strike over planned pension changes


●   Category:Retirement_in_the_United_Kingdom
●   Category:United_Kingdom_pensions_and_benefits
●   Category:Pensions_in_the_United_Kingdom
●   Category:Labor_disputes_by_country
●   Category:Labor_disputes


      www.iks-project.eu
Page:




    Yesterday Wikinews Articles (2/3)
    US children who celebrate Independence Day more
    likely to become Republicans, says Harvard study


●   Category:Fireworks
●   Category:Voting_theory
●   Category:Republican_Party_%28United_States%29
●   Category:Statistics
●   Category:Electoral_systems

      www.iks-project.eu
Page:




    Yesterday Wikinews Articles (3/3)
    U.S. space agency NASA sues ex-astronaut


●   Category:American_astronauts
●   Category:Aviation_halls_of_fame
●   Category:Edwards_Air_Force_Base
●   Category:Apollo_program
●   Category:Exploration_of_the_Moon


      www.iks-project.eu
Page:




    Scientific publication (1/2) (PLOS One)
    Metabolic Programming during Lactation Stimulates
    Renal Na+ Transport in the Adult Offspring Due to an
    Early Impact on Local Angiotensin II Pathways


●   Category:Renal_physiology
●   Category:Kidney
●   Category:Nephrology
●   Category:Hypertension
●   Category:Membrane_biology
      www.iks-project.eu
Page:




    Scientific Publications (2/2)
    International Conference on Machine Learning 2011
    accepted papers abstracts


●   Category:Machine_learning
●   Category:Computational_statistics
●   Category:Data_analysis
●   Category:Classification_algorithms
●   Category:Ensemble_learning

      www.iks-project.eu
Page:




    Track & Hack
●   https://github.com/ogrisel/pignlproc
●   https://issues.apache.org/jira/browse/STANBOL-201
●   Help integrate into Stanbol EntityHub / Enhancer during the
    Hackathon
●   IKS User Story S10: Automated document categorization
    ●   I create new document in my CMS by typing in a HTML edit form or
        by uploading a document with textual content (PDF, office file, XML
        file, ...). I want the CMS to suggest me a list of maximum 3
        controlled properties such as subjects/topics or geographical
        coverage out of list of standardised options (IPTC subjects or world
        countries), based on the text content I gave.


        www.iks-project.eu
Page:




 2 – Named Entity Disambiguation




www.iks-project.eu
Page:




    An example
●   Query for person with name = “George Bush”
    ●    Results: 2 ambigous possibilities
●   Perform additional MoreLikeThis with surrounding
    paragraph as context:
●   If more like “41st”, “1988”, “Reagan”, “Panama”...
    ●    then: dbpedia:George_H._W._Bush
●   If more like “43rd”, “911”, “War on Terror”, “bretzel”...
    ●    then: dbpedia:George_W._Bush



        www.iks-project.eu
Page:




    Work in Progress
●   EntityHub's SolrYard now has a SimilarityConstraint
●   OpenNLP NamedEntiy Engine already extracts context
●   pignlproc is able to extract occurrence corpus from
    Wikipedia dumps
●   Early prototype during Berlin Buzzwords Hackathon


                                TODO:
                 build a prepackaged Enhancer Engine
                           & EntityHub index
      www.iks-project.eu

Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

  • 1.
    Page: Stanbol Semantic CMS Community in the Labs Universal Topic Olivier Grisel Classification Nuxeo June 17, 2011 Named Entity Disambiguation Co-funded by the 1 Copyright IKS Consortium European Union www.iks-project.eu
  • 2.
    Page: 1 - Universal Topic Classification www.iks-project.eu
  • 3.
    Page: 3 June 17, 2011 Wikipedia is a Web-Scale Controlled Vocabulary – Chris Sizemore, BBC www.iks-project.eu Copyright IKS Consortium
  • 4.
    Page: A Rather “Simple”Idea Use Apache Lucene / Solr MoreLikeThis to perform an approximate k-Nearest Neighbors query in the TF-IDF vector space of Wikipedia www.iks-project.eu
  • 5.
    Page: Which means: ● Picks the top 30 terms of the document to categorize ● Build a fuzzy full-text query ● Search for indexed articles that share most terms ● Rank results according to similarity score ● Use the top-related Wikipedia articles as “Topics” www.iks-project.eu
  • 6.
    Page: However Wikipedia hasmillions of articles: Navigation Hell Need hierarchical structure: from generic to specific Faceted Browsing! www.iks-project.eu
  • 7.
    Page: Hierarchical Wikipedia Categorization ● Group text of all articles categorized for a given Topic ● Use Wikipedia Categories as Hierarchical Taxonomy ● Categorize new document with MoreLikeThis on the aggregate text of articles ● Available DBpedia dumps provides: ● Text summaries for each article ● “subject” relationships between articles and topics ● “broader” / “narrower” SKOS hieararchy between topics www.iks-project.eu
  • 8.
    Page: Challenges encountered ● 500k “technical” categories “People_with_missing_birth_place”, “Rivers_in_Romania” ● 70k “grounded” categories ● Paths to roots need both “technical” and “grounded” ● Loops everywhere! ● Death is a subcategory of Life – Life is a subcategory of Death ● … ● Scale ● 1.2M topic / topic links ● 30M topic / article links www.iks-project.eu
  • 9.
    Page: Sample results Pig / Solr / Python Proof of Concept www.iks-project.eu
  • 10.
    Page: IKS Workshop Wiki Page ● Category:Free_web_development_software ● Category:Semantic_HTML ● Category:Semantic_Web ● Category:Web_development_software ● Category:Office_software ● Category:World_Wide_Web_Consortium ● Category:Open_source_project_foundations ● Category:Free_network-related_software ● Category:Free_business_software www.iks-project.eu
  • 11.
    Page: IKS Workshop Wiki Page (cont'd) ● Category:Knowledge_representation_languages ● Category:PHP_programming_language ● Category:XML-based_standards ● Category:Content_management_systems ● Category:Knowledge_representation ● Category:Presentation ● Category:Cross-platform_software ● Category:HTML ● Category:Data_management www.iks-project.eu
  • 12.
    Page: Yesterday Wikinews Articles (1/3) Hundreds of thousands of British public sector workers strike over planned pension changes ● Category:Retirement_in_the_United_Kingdom ● Category:United_Kingdom_pensions_and_benefits ● Category:Pensions_in_the_United_Kingdom ● Category:Labor_disputes_by_country ● Category:Labor_disputes www.iks-project.eu
  • 13.
    Page: Yesterday Wikinews Articles (2/3) US children who celebrate Independence Day more likely to become Republicans, says Harvard study ● Category:Fireworks ● Category:Voting_theory ● Category:Republican_Party_%28United_States%29 ● Category:Statistics ● Category:Electoral_systems www.iks-project.eu
  • 14.
    Page: Yesterday Wikinews Articles (3/3) U.S. space agency NASA sues ex-astronaut ● Category:American_astronauts ● Category:Aviation_halls_of_fame ● Category:Edwards_Air_Force_Base ● Category:Apollo_program ● Category:Exploration_of_the_Moon www.iks-project.eu
  • 15.
    Page: Scientific publication (1/2) (PLOS One) Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways ● Category:Renal_physiology ● Category:Kidney ● Category:Nephrology ● Category:Hypertension ● Category:Membrane_biology www.iks-project.eu
  • 16.
    Page: Scientific Publications (2/2) International Conference on Machine Learning 2011 accepted papers abstracts ● Category:Machine_learning ● Category:Computational_statistics ● Category:Data_analysis ● Category:Classification_algorithms ● Category:Ensemble_learning www.iks-project.eu
  • 17.
    Page: Track & Hack ● https://github.com/ogrisel/pignlproc ● https://issues.apache.org/jira/browse/STANBOL-201 ● Help integrate into Stanbol EntityHub / Enhancer during the Hackathon ● IKS User Story S10: Automated document categorization ● I create new document in my CMS by typing in a HTML edit form or by uploading a document with textual content (PDF, office file, XML file, ...). I want the CMS to suggest me a list of maximum 3 controlled properties such as subjects/topics or geographical coverage out of list of standardised options (IPTC subjects or world countries), based on the text content I gave. www.iks-project.eu
  • 18.
    Page: 2 –Named Entity Disambiguation www.iks-project.eu
  • 19.
    Page: An example ● Query for person with name = “George Bush” ● Results: 2 ambigous possibilities ● Perform additional MoreLikeThis with surrounding paragraph as context: ● If more like “41st”, “1988”, “Reagan”, “Panama”... ● then: dbpedia:George_H._W._Bush ● If more like “43rd”, “911”, “War on Terror”, “bretzel”... ● then: dbpedia:George_W._Bush www.iks-project.eu
  • 20.
    Page: Work in Progress ● EntityHub's SolrYard now has a SimilarityConstraint ● OpenNLP NamedEntiy Engine already extracts context ● pignlproc is able to extract occurrence corpus from Wikipedia dumps ● Early prototype during Berlin Buzzwords Hackathon TODO: build a prepackaged Enhancer Engine & EntityHub index www.iks-project.eu