Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)


Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

  Stanbol Semantic CMS Community in the Labs Universal Topic Olivier Grisel Classification Nuxeo June 17, 2011 Named Entity Disambiguation
  1 - Universal Topic Classification
  June 17, 2011 Wikipedia is a Web-Scale Controlled Vocabulary – Chris Sizemore, BBC
  A Rather "Simple" Idea Use Apache Lucene / Solr MoreLikeThis to perform an approximate k-Nearest Neighbors query in the TF-IDF vector space of Wikipedia
  Which means:
● Picks the top 30 terms of the document to categorize
● Build a fuzzy full-text query
● Search for indexed articles that share most terms
● Rank results according to similarity score
● Use the top-related Wikipedia articles as "Topics"
  However Wikipedia has millions of articles: Navigation Hell Need hierarchical structure: from generic to specific Faceted Browsing!
  Hierarchical Wikipedia Categorization
● Group text of all articles categorized for a given Topic
● Use Wikipedia Categories as Hierarchical Taxonomy
● Categorize new document with MoreLikeThis on the aggregate text of articles
● Available DBpedia dumps provides: 
 ● Text summaries for each article 
 ● "subject" relationships between articles and topics 
 ● "broader" / "narrower" SKOS hieararchy between topics
  Challenges encountered
● 500k "technical" categories "People_with_missing_birth_place", "Rivers_in_Romania"
● 70k "grounded" categories 
 ● Paths to roots need both "technical" and "grounded"
● Loops everywhere! 
 ● Death is a subcategory of Life – Life is a subcategory of Death 
 ● …
● Scale 
 ● 1.2M topic / topic links 
 ● 30M topic / article links
  Sample results Pig / Solr / Python Proof of Concept
  IKS Workshop Wiki Page
● Category:Free_web_development_software
● Category:Semantic_HTML
● Category:Semantic_Web
● Category:Web_development_software
● Category:Office_software
● Category:World_Wide_Web_Consortium
● Category:Open_source_project_foundations
● Category:Free_network-related_software
● Category:Free_business_software
  IKS Workshop Wiki Page (contd)
● Category:Knowledge_representation_languages
● Category:PHP_programming_language
● Category:XML-based_standards
● Category:Content_management_systems
● Category:Knowledge_representation
● Category:Presentation
● Category:Cross-platform_software
● Category:HTML
● Category:Data_management
  Yesterday Wikinews Articles (1/3) Hundreds of thousands of British public sector workers strike over planned pension changes
● Category:Retirement_in_the_United_Kingdom
● Category:United_Kingdom_pensions_and_benefits
● Category:Pensions_in_the_United_Kingdom
● Category:Labor_disputes_by_country
● Category:Labor_disputes
  Yesterday Wikinews Articles (2/3) US children who celebrate Independence Day more likely to become Republicans, says Harvard study
● Category:Fireworks
● Category:Voting_theory
● Category:Republican_Party_%28United_States%29
● Category:Statistics
● Category:Electoral_systems
  Yesterday Wikinews Articles (3/3) U.S. space agency NASA sues ex-astronaut
● Category:American_astronauts
● Category:Aviation_halls_of_fame
● Category:Edwards_Air_Force_Base
● Category:Apollo_program
● Category:Exploration_of_the_Moon
  Scientific publication (1/2) (PLOS One) Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways
● Category:Renal_physiology
● Category:Kidney
● Category:Nephrology
● Category:Hypertension
● Category:Membrane_biology
  Scientific Publications (2/2) International Conference on Machine Learning 2011 accepted papers abstracts
● Category:Machine_learning
● Category:Computational_statistics
● Category:Data_analysis
● Category:Classification_algorithms
● Category:Ensemble_learning
  Track & Hack
● https://github.com/ogrisel/pignlproc
● https://issues.apache.org/jira/browse/STANBOL-201
● Help integrate into Stanbol EntityHub / Enhancer during the Hackathon
● IKS User Story S10: Automated document categorization 
 ● I create new document in my CMS by typing in a HTML edit form or by uploading a document with textual content (PDF, office file, XML file, ...). I want the CMS to suggest me a list of maximum 3 controlled properties such as subjects/topics or geographical coverage out of list of standardised options (IPTC subjects or world countries), based on the text content I gave.
  2 – Named Entity Disambiguation
  An example
● Query for person with name = "George Bush" 
 ● Results: 2 ambigous possibilities
● Perform additional MoreLikeThis with surrounding paragraph as context:
● If more like "41st", "1988", "Reagan", "Panama"... 
 ● then: dbpedia:George_H._W._Bush
● If more like "43rd", "911", "War on Terror", "bretzel"... 
 ● then: dbpedia:George_W._Bush
  Work in Progress
● EntityHubs SolrYard now has a SimilarityConstraint
● OpenNLP NamedEntiy Engine already extracts context
● pignlproc is able to extract occurrence corpus from Wikipedia dumps
● Early prototype during Berlin Buzzwords Hackathon TODO: build a prepackaged Enhancer Engine & EntityHub index