Semantic Analysis using Wikipedia Taxonomy

Creating a taxonomy
for Wikipedia
Patrick Nicolas
Feb 11, 2012
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas

Introduction
The goal of the study is to build a Taxonomy Graph for the 3+
millions Wikipedia entries by leveraging the WordNet
hyponyms as a training set.
This model can used in a wide variety of commercial
applications from extracting context extraction and
automated Wiki classification to text summarization.

Notes:
• Definitions and notations are defined in the appendices
• The presentation assumes the reader has basic knowledge in
information retrieval, Natural Language Processing and Machine
Learning.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

2

Process

The computation flow for the generation of taxonomy for
Wikipedia is summarized in the following 5 simple steps.
1. Extract abstract & categories from Wikipedia datasets
2. Generate the Hypernyms lineages for Wikipedia entries
which overlap with WordNet synsets
3. Extract, reduce and ordered N-Grams and their tags
(NNP, NN,.) from each Wikipedia abstract.
4. Create a training set of weighted graphs of each Wikipedia
abstract that has a corresponding hypernyms hierarchy
5. Optimize and apply the model for generating taxonomy
lineages for each Wikipedia entry


3

Semantic Data Sources
Terms Frequency Corpora
Reuters corpus and Google N-Grams frequency is used to
compute the inverse document frequency values.

Word Net Hypernyms
WordNet database of Synsets is used to generate hierarchy of
hypernyms.
entity/physical
entity/object/location/region/district/country/European country/Italy

Wikipedia Datasets
Entry (label), long abstract and categories are to be extracted
from the Wikipedia reference database.


4

N-Grams Extraction Model
The relevancy (or weight ω) of a N-Gram to the context of a
document depends on syntactic, semantic and probabilistic features.
Frequency N-Gram in
document

Similarity of N-Gram
with Categories

β

fD
N-Gram
tag

N-Gram

α
Term 1

Semantic
Definition?

…

Frequency
of terms

ρ Frequency N-Gram in
categories abstracts

Term n

idf

φ

Contained in
1st sentence?

Frequency N-Gram in
Universe (Corpus)

Fig. 1 Illustration of features of N-Gram Extraction Model


5

Computation Flow
The computation flow is broken down in ‘plug & play’ processing units to
enable design of experiments and audit.
N-Grams
Corpus

idf

Freq.
N-Grams

Abstract

Wikipedia
Datasets

WordNet
Synsets

Categories

Weighted N-Grams
N-Gram tags

Abstract
Semantic match

label
Labeled
Lineage

Normalized
N-Gram Weights

Hypernyms
Taxonomy Graph
Trained Model

Fig. 2 Typical computation flow for generation of taxonomy


6

NGrams Frequency Analysis
Let’s define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of
the N-Gram within the corpus C is expressed as.

The inverse document frequency (IDF) is computed as

Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms
wj j =1,n with a frequency count(wj) with a document D. The
frequency of the N-Gram is computed as


7

Weighting N-Grams

Most of Wikipedia concepts are well described in the first sentence
of each abstract. Therefore we can attribute a great weight to NGrams that are contained in the first sentence. The frequency f lD of
a N-Grams in the 1st sentence of a document is defined as

A simple regression analysis showed that a square root function
provide a more accurate contribution (weight) of a N-Gram in a
document D.


8

Tagging N-Grams

Although Conditional Random Fields is the predominant discriminative
classifier to predict sentence boundaries, token tags we found out that
the Maximum Entropy for binary features were more appropriate to
classify the first term in a sentence (NNP or NN).
The model features functions ft (w) => {0,1} are extracted by
maximizing the entropy H(p) of the probability of a word, w, has a
specific tag t.

Subjected to the constraints..


9

Wikipedia Tags Distribution
We extract the tags of Wikipedia
entries (1 to 4-Grams) in the
context of their abstracts. The
distribution of the frequency of
the tags shows that the proper
nouns (stemmed as NNP tags)
are the predominant tags.

The frequency distribution is used as
prior probability for finding a
Wikipedia entry of a specific tag.

10

Tag Predictive Model

We use a multinomial Naïve Bayes to predict the tag of
any given Wikipedia entry.
Let’s defined a set of classes Ck = { w(n) | tg(w(n)) = k } of
Wikipedia entries of specific tags (CNNP NN) & p(t| Ck)
the prior probability of a tag t to belong to a class.
The likelihood a given Wikipedia entry as a tag k is


11

Taxonomy Weighted Graph

Let’s define:
• taxonomy class (or Taxa) as a
graph node representing a
Hypernym (i.e. class=‘person’)
• taxonomy instance as entity
name (i.e. instance=‘Peter’ or
Peter IS-A a Person)
• Taxonomy lineage as the list
of ancestors (Hypernyms) of
an instance
Fig. Example of taxonomy lineage


12

Document taxonomy

Any document can be represented as a weighted
graph of taxonomy classes and instances.

Fig. Example of taxonomy graph


13

Propagation Rule for Taxonomy Weights

The flow model is applied to the taxonomy weighted graph to compute
the weight of each taxonomy class from the normalized weight of
semantic N-Grams. The weights of taxonomy classes are normalized
with the root ‘entity’ (ω =1 ). The taxonomy instances (N-Grams) are
ordered & normalized by their respective weights ω( wk(n) )

Fig. Weight propagation in Taxonomy Graph


14

Normalized Taxonomy Weight in Wikipedia

We analyze the
distribution of
weights along the
taxonomy lineage
for all Wikipedia
entries


15

Lineage Weights Estimator

The training using the initial set of WordNet hypernyms shows
that the distribution of normalized weights ωkalong the taxonomy
lineage for a specific similarity class C, can be approximated with
polynomial function (spline).

This estimator is used in the classification of the taxonomy
lineages of a Wikipedia abstract.


16

Similarity Metrics

In order to train a model using labeled WordNet hypernyms, a
similarity (or distance) metrics need to be defined. Let’s consider 2
taxonomy lineages Vi and Vk of respective length n(k) and n(j)

Cosine Distance
Shortest Path Distance


17

Taxonomy Generation Model
Let consider m classes of taxonomy lineage similarity and labeled
lineage VH . A class Ciis defined by

A taxonomy lineage Vj is classified using Naïve Bayes.


18

Appendix: notation


19

Appendix: References

• “Introduction to Information Retrieval”C. Manning, P Raghavan,
H Schūtze Cambridge University Press
• “Elements of Statistical Learning”
T Hastie, R Tibshirani, J
Friedman Springer
• “Semantic Taxonomy Induction from Heterogeneous Evidence” R
Snow, D Jurafsky, A Ng
• “A Study on Linking Wikipedia Categories to WordNet synsets
using text similarity” A Toral, O Fernandez, E Agirre, R Muňoz
• “Regularization Predicts While Discovery Taxonomy” Y. Mroueh, T
Poggio, L Rosasco
• “Natural Language Semantics Term Project” M Tao.
• “A Maximum Entropy Approach to Natural Language Processing”
A Berger, V Della Pietra, S Della Pietra.


20

Semantic Analysis using Wikipedia Taxonomy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Semantic Analysis using Wikipedia Taxonomy

Similar to Semantic Analysis using Wikipedia Taxonomy (20)

More from Patrick Nicolas

More from Patrick Nicolas (10)

Recently uploaded

Recently uploaded (20)

Semantic Analysis using Wikipedia Taxonomy