This document presents CHESA (Compact Hierarchical Explicit Semantic Representation), a novel approach for representing word semantics in a compact hierarchical structure based on a predefined ontology (Wikipedia categories and articles). CHESA constructs semantic representations using a greedy algorithm that selects concepts based on a conditional overrepresentation criterion to balance coverage and size. Empirical evaluations show CHESA achieves state-of-the-art performance in measuring word semantic relatedness, outperforming other methods when representation resources are limited by generalizing semantics at varying levels of abstraction in its hierarchical structure.
1. Compact Hierarchical Explicit
Semantic Representation
(CHESA)
Sonya Liberman and Shaul Markovitch
In Proceedings of the IJCAI 2009 Workshop on
User-Contributed Knowledge and Artificial
Intelligence: An Evolving Synergy (WikiAI09),
Pasadena, CA, 2009
5. 5
Can We Endow Computers with Such
Capabilities?
n Generate a semantic representation
n Decide whether the two representations are related
Headache Neurology
Semantic
Representation
of Headache
Semantic
Representation
of Neurology
Table
Semantic
Representation
of Table
RelatedUnrelated
6. n Language is the main communication medium between
people
n People use common world knowledge to communicate
n Humans often organize knowledge within hierarchical
ontologies
Semantic Representation
7. Representation of semantics should be based on
n Natural human-defined concepts (world knowledge)
n Inner organization of these concepts as perceived by humans
Semantic Representation
Across the
Universe
Yesterday Hey Jude
Hit Me
Baby One
More Time
Crazy
Songs Songs
8. 8
Our Approach
Semantics is represented as a compact hierarchical
structure of pre-defined natural concepts
Headache
11. 11
Hierarchical Representation of Semantics
n Assume an pre-defined global hierarchical ontology
n Assume each node is associated with textual content
13. 13
Wikipedia as a Hierarchical Ontology of
Natural Concepts
n Almost 3 million English articles
n A hierarchical inner structure of categories
14. 14
Wikipedia as a Hierarchical Ontology of
Natural Concepts
Wikipedia
Categories
Wikipedia
Articles
Article
Content
Category content is the
content of its sub-tree
17. 17
The Conditional Overrepresentation
Criterion φw
N
M k
Parent Concept Child Concept
φw = 1 - Pr(X ≥ k)
High φw
Low probability that k
occurrences were
observed by chance
n
X ~ HG(N, M, n)
Performing a Hypergeometric test
18. 18
Compact Hierarchical Explicit Semantic
Representation (CHESA)
Benchmark
Greedy Top-Down CHESA Algorithm
1. Represent semantics with the root concept only
2. Traverse conceptual hierarchy top-down
3. Each iteration add the concept with maximal φw
19. 19
Compact Hierarchical Explicit Semantic
Representation (CHESA)
Greedy Top-Down CHESA Algorithm
4. Terminate when reaching size k or threshold for φw
Benchmark
For k = 15
The greedy bottom-up algorithm
(Bottom-Up CHESA) prunes
concepts according to φw
20. 20
Assigning Association Scores to Concepts
Benchmark
The association score for the word w and a concept c is
0.56
2.49
9.27
10.03 8.34
1.56
0.77
1.42
2.712.35
4.35 3.95
0.10
1.01
22. 22
Using CHESA for Semantic Relatedness
Words are related when
q Their representations intersect
q Intersecting concepts have high association scores
Neurology
K = 30
Headache
K = 30
Biology
Neurological
disorders
Biology
Neurological
disorders
Cognition
Medical
treatment
23. 23
Empirical Evaluation
n Testing on WordSimilarity353
q 353 word pairs judged by humans for semantic relatedness
n Measuring correlation with human judgments
q With varying values of representation size k
q With an unlimited representation size
n Comparing results to Explicit Semantic Analysis (ESA)
q E. Gabrilovitch and S. Markovitch 2005, 2006, 2007
24. 24
The semantics of a word is a vector of its associations
with Wikipedia articles
Semantic relatedness is measured by the cosine similarity between
the two vectors
Explicit Semantic Analysis (ESA)
Gabrilovich and Markovitch (2005,2006,2007)
Benchmark
25. 25
ESA Based Semantic Relatedness
ESA
Top 20 Concepts
Cat (Unix)
Cheshire Cat
Cool Cat
Plasan Sand Cat
Claude Cat
Big cat
Stray Cats
Felidae
Cat's Eye (film)
Cat scratch fever
Saber-toothed cat
New Britain Rock Cats
Cats (musical)
Cats & Dogs
Clan Nova Cat
Cat on a Hot Tin Roof
Sacramento River Cats
Wildcat
Jungle Cat
Leopard Cat
No intersecting
concepts
Cosine similarity
is zero
Cheshire Cat
Stray Cats
Sacramento River Cats
ESA
Top 20 Concepts
Mouse
Modest Mouse
Stanley Mouse
Mickey's Magical Christmas
Danger Mouse
Disney's House of Mouse
Apple Mighty Mouse
Natal Multimammate Mouse
Harvest Mouse
Wood mouse
Chevrotain
Mouse (computing)
Wild Mouse roller coaster
Josephine the Singer, or the Mouse Folk
Mighty Mouse
Mouse on Mars
The Mickey Mouse Club
Mickey Mouse
Minnie Mouse
Mickey Mouse Clubhouse
26. 26
CHESA Based Semantic Relatedness
Zoology
Zoology
Entertainment
Entertainment
Natural
sciences
Natural
sciences
Top Down-CHESA k = 20
Top Down-CHESA k = 20
Cell Biology
Domestication
29. 29
Empirical Evaluation - Results
Evaluation when resources are unlimited
Algorithm
WordNet
LSA
WikiRelate!
MarkovLink
ESA
CHESA
Correlation
0.35
0.56
0.50
0.55
0.74
0.72
q Using ESA full interpretation vectors
q Using CHESA full hierarchical representation
30. 30
Conclusions
n CHESA: a novel methodology for compact hierarchical
representation of semantics
n A flexible algorithm that constructs semantic representations
at any given size
n Significantly improves semantic relatedness results when
resources are limited
q Captures semantics when representation size is limited by
performing generalizations
q Using a conditional overrepresentation criterion to create a
compact and comprehensible representation