Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

Page:

Stanbol
Semantic CMS Community in the Labs

Universal Topic
Olivier Grisel Classification
Nuxeo

June 17, 2011 Named Entity
Disambiguation

Co-funded by the
1 Copyright IKS Consortium
European Union

www.iks-project.eu

Page:

1 - Universal Topic Classification

www.iks-project.eu

Page: 3 June 17, 2011

Wikipedia is a Web-Scale Controlled
Vocabulary

– Chris Sizemore, BBC

www.iks-project.eu Copyright IKS Consortium

Page:

A Rather “Simple” Idea

Use
Apache Lucene / Solr MoreLikeThis
to perform an
approximate k-Nearest Neighbors
query

in the
TF-IDF vector space of Wikipedia

www.iks-project.eu

Page:

Which means:
● Picks the top 30 terms of the document to categorize
● Build a fuzzy full-text query
● Search for indexed articles that share most terms
● Rank results according to similarity score
● Use the top-related Wikipedia articles as “Topics”

www.iks-project.eu

Page:

However Wikipedia has millions of
articles:
Navigation Hell

Need hierarchical structure:

from generic to specific

Faceted Browsing!

www.iks-project.eu

Page:

Hierarchical Wikipedia Categorization
● Group text of all articles categorized for a given Topic
● Use Wikipedia Categories as Hierarchical Taxonomy
● Categorize new document with MoreLikeThis on the
aggregate text of articles
● Available DBpedia dumps provides:
● Text summaries for each article
● “subject” relationships between articles and topics
● “broader” / “narrower” SKOS hieararchy between topics

www.iks-project.eu

Page:

Challenges encountered
● 500k “technical” categories
“People_with_missing_birth_place”, “Rivers_in_Romania”
● 70k “grounded” categories
● Paths to roots need both “technical” and “grounded”
● Loops everywhere!
● Death is a subcategory of Life
– Life is a subcategory of Death
● …
● Scale
● 1.2M topic / topic links
● 30M topic / article links
www.iks-project.eu

Page:

Sample results

Pig / Solr / Python Proof of Concept

www.iks-project.eu

Page:

IKS Workshop Wiki Page
● Category:Free_web_development_software
● Category:Semantic_HTML
● Category:Semantic_Web
● Category:Web_development_software
● Category:Office_software
● Category:World_Wide_Web_Consortium
● Category:Open_source_project_foundations
● Category:Free_network-related_software
● Category:Free_business_software
www.iks-project.eu

Page:

IKS Workshop Wiki Page (cont'd)
● Category:Knowledge_representation_languages
● Category:PHP_programming_language
● Category:XML-based_standards
● Category:Content_management_systems
● Category:Knowledge_representation
● Category:Presentation
● Category:Cross-platform_software
● Category:HTML
● Category:Data_management
www.iks-project.eu

Page:

Yesterday Wikinews Articles (1/3)
Hundreds of thousands of British public sector workers
strike over planned pension changes

● Category:Retirement_in_the_United_Kingdom
● Category:United_Kingdom_pensions_and_benefits
● Category:Pensions_in_the_United_Kingdom
● Category:Labor_disputes_by_country
● Category:Labor_disputes

www.iks-project.eu

Page:

US children who celebrate Independence Day more
likely to become Republicans, says Harvard study

● Category:Fireworks
● Category:Voting_theory
● Category:Republican_Party_%28United_States%29
● Category:Statistics
● Category:Electoral_systems

www.iks-project.eu

Page:

U.S. space agency NASA sues ex-astronaut

● Category:American_astronauts
● Category:Aviation_halls_of_fame
● Category:Edwards_Air_Force_Base
● Category:Apollo_program
● Category:Exploration_of_the_Moon

www.iks-project.eu

Page:

Scientific publication (1/2) (PLOS One)
Metabolic Programming during Lactation Stimulates
Renal Na+ Transport in the Adult Offspring Due to an
Early Impact on Local Angiotensin II Pathways

● Category:Renal_physiology
● Category:Kidney
● Category:Nephrology
● Category:Hypertension
● Category:Membrane_biology
www.iks-project.eu

Page:

Scientific Publications (2/2)
International Conference on Machine Learning 2011
accepted papers abstracts

● Category:Machine_learning
● Category:Computational_statistics
● Category:Data_analysis
● Category:Classification_algorithms
● Category:Ensemble_learning

www.iks-project.eu

Page:

Track & Hack
● https://github.com/ogrisel/pignlproc
● https://issues.apache.org/jira/browse/STANBOL-201
● Help integrate into Stanbol EntityHub / Enhancer during the
Hackathon
● IKS User Story S10: Automated document categorization
● I create new document in my CMS by typing in a HTML edit form or
by uploading a document with textual content (PDF, office file, XML
file, ...). I want the CMS to suggest me a list of maximum 3
controlled properties such as subjects/topics or geographical
coverage out of list of standardised options (IPTC subjects or world
countries), based on the text content I gave.

www.iks-project.eu

Page:

2 – Named Entity Disambiguation

www.iks-project.eu

Page:

An example
● Query for person with name = “George Bush”
● Results: 2 ambigous possibilities
● Perform additional MoreLikeThis with surrounding
paragraph as context:
● If more like “41st”, “1988”, “Reagan”, “Panama”...
● then: dbpedia:George_H._W._Bush
● If more like “43rd”, “911”, “War on Terror”, “bretzel”...
● then: dbpedia:George_W._Bush

www.iks-project.eu

Page:

Work in Progress
● EntityHub's SolrYard now has a SimilarityConstraint
● OpenNLP NamedEntiy Engine already extracts context
● pignlproc is able to extract occurrence corpus from
Wikipedia dumps
● Early prototype during Berlin Buzzwords Hackathon

TODO:
build a prepackaged Enhancer Engine
& EntityHub index
www.iks-project.eu

Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

More Related Content

What's hot

Viewers also liked

Similar to Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

More from Olivier Grisel

Recently uploaded

Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)