Your SlideShare is downloading. ×
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

4,963
views

Published on

Published in: Technology, Education

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,963
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
42
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Page: StanbolSemantic CMS Community in the Labs Universal Topic Olivier Grisel Classification Nuxeo June 17, 2011 Named Entity Disambiguation Co-funded by the 1 Copyright IKS Consortium European Union www.iks-project.eu
  • 2. Page: 1 - Universal Topic Classificationwww.iks-project.eu
  • 3. Page: 3 June 17, 2011Wikipedia is a Web-Scale Controlled Vocabulary – Chris Sizemore, BBCwww.iks-project.eu Copyright IKS Consortium
  • 4. Page:A Rather “Simple” Idea UseApache Lucene / Solr MoreLikeThis to perform an approximate k-Nearest Neighbors query in the TF-IDF vector space of Wikipedia www.iks-project.eu
  • 5. Page: Which means:● Picks the top 30 terms of the document to categorize● Build a fuzzy full-text query● Search for indexed articles that share most terms● Rank results according to similarity score● Use the top-related Wikipedia articles as “Topics” www.iks-project.eu
  • 6. Page:However Wikipedia has millions of articles: Navigation Hell Need hierarchical structure: from generic to specific Faceted Browsing!www.iks-project.eu
  • 7. Page: Hierarchical Wikipedia Categorization● Group text of all articles categorized for a given Topic● Use Wikipedia Categories as Hierarchical Taxonomy● Categorize new document with MoreLikeThis on the aggregate text of articles● Available DBpedia dumps provides: ● Text summaries for each article ● “subject” relationships between articles and topics ● “broader” / “narrower” SKOS hieararchy between topics www.iks-project.eu
  • 8. Page: Challenges encountered● 500k “technical” categories “People_with_missing_birth_place”, “Rivers_in_Romania”● 70k “grounded” categories ● Paths to roots need both “technical” and “grounded”● Loops everywhere! ● Death is a subcategory of Life – Life is a subcategory of Death ● …● Scale ● 1.2M topic / topic links ● 30M topic / article links www.iks-project.eu
  • 9. Page: Sample resultsPig / Solr / Python Proof of Conceptwww.iks-project.eu
  • 10. Page: IKS Workshop Wiki Page● Category:Free_web_development_software● Category:Semantic_HTML● Category:Semantic_Web● Category:Web_development_software● Category:Office_software● Category:World_Wide_Web_Consortium● Category:Open_source_project_foundations● Category:Free_network-related_software● Category:Free_business_software www.iks-project.eu
  • 11. Page: IKS Workshop Wiki Page (contd)● Category:Knowledge_representation_languages● Category:PHP_programming_language● Category:XML-based_standards● Category:Content_management_systems● Category:Knowledge_representation● Category:Presentation● Category:Cross-platform_software● Category:HTML● Category:Data_management www.iks-project.eu
  • 12. Page: Yesterday Wikinews Articles (1/3) Hundreds of thousands of British public sector workers strike over planned pension changes● Category:Retirement_in_the_United_Kingdom● Category:United_Kingdom_pensions_and_benefits● Category:Pensions_in_the_United_Kingdom● Category:Labor_disputes_by_country● Category:Labor_disputes www.iks-project.eu
  • 13. Page: Yesterday Wikinews Articles (2/3) US children who celebrate Independence Day more likely to become Republicans, says Harvard study● Category:Fireworks● Category:Voting_theory● Category:Republican_Party_%28United_States%29● Category:Statistics● Category:Electoral_systems www.iks-project.eu
  • 14. Page: Yesterday Wikinews Articles (3/3) U.S. space agency NASA sues ex-astronaut● Category:American_astronauts● Category:Aviation_halls_of_fame● Category:Edwards_Air_Force_Base● Category:Apollo_program● Category:Exploration_of_the_Moon www.iks-project.eu
  • 15. Page: Scientific publication (1/2) (PLOS One) Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways● Category:Renal_physiology● Category:Kidney● Category:Nephrology● Category:Hypertension● Category:Membrane_biology www.iks-project.eu
  • 16. Page: Scientific Publications (2/2) International Conference on Machine Learning 2011 accepted papers abstracts● Category:Machine_learning● Category:Computational_statistics● Category:Data_analysis● Category:Classification_algorithms● Category:Ensemble_learning www.iks-project.eu
  • 17. Page: Track & Hack● https://github.com/ogrisel/pignlproc● https://issues.apache.org/jira/browse/STANBOL-201● Help integrate into Stanbol EntityHub / Enhancer during the Hackathon● IKS User Story S10: Automated document categorization ● I create new document in my CMS by typing in a HTML edit form or by uploading a document with textual content (PDF, office file, XML file, ...). I want the CMS to suggest me a list of maximum 3 controlled properties such as subjects/topics or geographical coverage out of list of standardised options (IPTC subjects or world countries), based on the text content I gave. www.iks-project.eu
  • 18. Page: 2 – Named Entity Disambiguationwww.iks-project.eu
  • 19. Page: An example● Query for person with name = “George Bush” ● Results: 2 ambigous possibilities● Perform additional MoreLikeThis with surrounding paragraph as context:● If more like “41st”, “1988”, “Reagan”, “Panama”... ● then: dbpedia:George_H._W._Bush● If more like “43rd”, “911”, “War on Terror”, “bretzel”... ● then: dbpedia:George_W._Bush www.iks-project.eu
  • 20. Page: Work in Progress● EntityHubs SolrYard now has a SimilarityConstraint● OpenNLP NamedEntiy Engine already extracts context● pignlproc is able to extract occurrence corpus from Wikipedia dumps● Early prototype during Berlin Buzzwords Hackathon TODO: build a prepackaged Enhancer Engine & EntityHub index www.iks-project.eu