Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)


Published on

Published in: Technology, Education

Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

  1. 1. Page: StanbolSemantic CMS Community in the Labs Universal Topic Olivier Grisel Classification Nuxeo June 17, 2011 Named Entity Disambiguation Co-funded by the 1 Copyright IKS Consortium European Union www.iks-project.eu
  2. 2. Page: 1 - Universal Topic Classificationwww.iks-project.eu
  3. 3. Page: 3 June 17, 2011Wikipedia is a Web-Scale Controlled Vocabulary – Chris Sizemore, BBCwww.iks-project.eu Copyright IKS Consortium
  4. 4. Page:A Rather “Simple” Idea UseApache Lucene / Solr MoreLikeThis to perform an approximate k-Nearest Neighbors query in the TF-IDF vector space of Wikipedia www.iks-project.eu
  5. 5. Page: Which means:● Picks the top 30 terms of the document to categorize● Build a fuzzy full-text query● Search for indexed articles that share most terms● Rank results according to similarity score● Use the top-related Wikipedia articles as “Topics” www.iks-project.eu
  6. 6. Page:However Wikipedia has millions of articles: Navigation Hell Need hierarchical structure: from generic to specific Faceted Browsing!www.iks-project.eu
  7. 7. Page: Hierarchical Wikipedia Categorization● Group text of all articles categorized for a given Topic● Use Wikipedia Categories as Hierarchical Taxonomy● Categorize new document with MoreLikeThis on the aggregate text of articles● Available DBpedia dumps provides: ● Text summaries for each article ● “subject” relationships between articles and topics ● “broader” / “narrower” SKOS hieararchy between topics www.iks-project.eu
  8. 8. Page: Challenges encountered● 500k “technical” categories “People_with_missing_birth_place”, “Rivers_in_Romania”● 70k “grounded” categories ● Paths to roots need both “technical” and “grounded”● Loops everywhere! ● Death is a subcategory of Life – Life is a subcategory of Death ● …● Scale ● 1.2M topic / topic links ● 30M topic / article links www.iks-project.eu
  9. 9. Page: Sample resultsPig / Solr / Python Proof of Conceptwww.iks-project.eu
  10. 10. Page: IKS Workshop Wiki Page● Category:Free_web_development_software● Category:Semantic_HTML● Category:Semantic_Web● Category:Web_development_software● Category:Office_software● Category:World_Wide_Web_Consortium● Category:Open_source_project_foundations● Category:Free_network-related_software● Category:Free_business_software www.iks-project.eu
  11. 11. Page: IKS Workshop Wiki Page (contd)● Category:Knowledge_representation_languages● Category:PHP_programming_language● Category:XML-based_standards● Category:Content_management_systems● Category:Knowledge_representation● Category:Presentation● Category:Cross-platform_software● Category:HTML● Category:Data_management www.iks-project.eu
  12. 12. Page: Yesterday Wikinews Articles (1/3) Hundreds of thousands of British public sector workers strike over planned pension changes● Category:Retirement_in_the_United_Kingdom● Category:United_Kingdom_pensions_and_benefits● Category:Pensions_in_the_United_Kingdom● Category:Labor_disputes_by_country● Category:Labor_disputes www.iks-project.eu
  13. 13. Page: Yesterday Wikinews Articles (2/3) US children who celebrate Independence Day more likely to become Republicans, says Harvard study● Category:Fireworks● Category:Voting_theory● Category:Republican_Party_%28United_States%29● Category:Statistics● Category:Electoral_systems www.iks-project.eu
  14. 14. Page: Yesterday Wikinews Articles (3/3) U.S. space agency NASA sues ex-astronaut● Category:American_astronauts● Category:Aviation_halls_of_fame● Category:Edwards_Air_Force_Base● Category:Apollo_program● Category:Exploration_of_the_Moon www.iks-project.eu
  15. 15. Page: Scientific publication (1/2) (PLOS One) Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways● Category:Renal_physiology● Category:Kidney● Category:Nephrology● Category:Hypertension● Category:Membrane_biology www.iks-project.eu
  16. 16. Page: Scientific Publications (2/2) International Conference on Machine Learning 2011 accepted papers abstracts● Category:Machine_learning● Category:Computational_statistics● Category:Data_analysis● Category:Classification_algorithms● Category:Ensemble_learning www.iks-project.eu
  17. 17. Page: Track & Hack● https://github.com/ogrisel/pignlproc● https://issues.apache.org/jira/browse/STANBOL-201● Help integrate into Stanbol EntityHub / Enhancer during the Hackathon● IKS User Story S10: Automated document categorization ● I create new document in my CMS by typing in a HTML edit form or by uploading a document with textual content (PDF, office file, XML file, ...). I want the CMS to suggest me a list of maximum 3 controlled properties such as subjects/topics or geographical coverage out of list of standardised options (IPTC subjects or world countries), based on the text content I gave. www.iks-project.eu
  18. 18. Page: 2 – Named Entity Disambiguationwww.iks-project.eu
  19. 19. Page: An example● Query for person with name = “George Bush” ● Results: 2 ambigous possibilities● Perform additional MoreLikeThis with surrounding paragraph as context:● If more like “41st”, “1988”, “Reagan”, “Panama”... ● then: dbpedia:George_H._W._Bush● If more like “43rd”, “911”, “War on Terror”, “bretzel”... ● then: dbpedia:George_W._Bush www.iks-project.eu
  20. 20. Page: Work in Progress● EntityHubs SolrYard now has a SimilarityConstraint● OpenNLP NamedEntiy Engine already extracts context● pignlproc is able to extract occurrence corpus from Wikipedia dumps● Early prototype during Berlin Buzzwords Hackathon TODO: build a prepackaged Enhancer Engine & EntityHub index www.iks-project.eu