Word sense disambiguation and lexical chains construction using wordnet

WORD SENSE DISAMBIGUATION AND LEXICAL
CHAINS CONSTRUCTION USING WORDNET
The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and
Connectivity
September 14th 2010, Bucharest,
ROMANIA
Costin-Gabriel Chiru, Andrei Jancă, Traian Rebedea
Politehnica University of Bucharest

Summary
 The corpus
 Lexical Chains
 Semantic Distances
 WordNet
 Word sense
disambiguation
 Results
 Further research

The corpus
 Chats consisting of 4 or 5 participants,
debating subjects related to collaborative
learning
 High percentage of misspelled words
 Keywords - “wiki” and “forum” – not present
in WordNet with the proper sense
 Utterances not necessarily a correct text in
respect to English grammar

Lexical chains
 Lexical chain: set of words where each word has
an acceptable degree of semantic relatedness to
every other word in the set
 Each word must be fitted into a lexical chain –
How?
 The percentage of words in the chain that are
“related” to a word
 Over 90% - word “belongs” to that chain

Semantic distance
 When are two words considered to be
“related”?
 Methods of computing a semantic distance
between words – a number in respect to the
strength of the semantic connection
 Most methods use word frequency and the
lowest superordinate => the need for an
ontology

WordNet
 Ontology containing information about the
semantic relationships between words
 Words defining the same concept are
grouped into sets - Synsets
 Directed acyclic graph – synsets are nodes,
semantic relationships are vertices =>
consistency with the lower superordinate
concept

Semantic distances in WN
 Path length
• If a vertex exists between two nodes, the two
synsets are related through a semantic
relationship
• The length of such a path shows the strength of
a relationship between two senses.
 Conrath-Jiang measure
• Uses the lower superordinate and word
frequency
• Word frequency – number of hits returned by a

Word sense disambiguation
 Assigning a sense to an ambiguous word,
based on a context
 Semantic distance is computed for senses,
not for words
 The following scenario might occur : two
words are found to be semantically close, but
not for their right senses

Word sense disambiguation (2)
 Context = a window of words
 Window size – trade-off between time and
quality of results
 Our corpus : ideas and the subject of the
utterances often alternate => a large window
size is not necessary

Word sense disambiguation (3)
 Each word in the context has a list of senses
 A set of word-sense pairs : for each word in
the window, a sense is assigned
 We must choose the best such set => a score
must be computed for each set
 What evaluation function should compute the
score?

Evaluation function for WSD
 The degree of a word-sense pair : The number
of senses related to that sense in a set
 We use WN and semantic distances to
determine if two senses are related
 High degree – higher probability of a correct
word-sense assignment
 The average of all degrees in a set – high
average = high score

Evaluation function for WSD (2)
 Problem : many word-sense pairs with very low
degree and few with very high degree
 All degrees should be “packed” around the
average => standard deviation of all degrees.
 Low standard deviation = high score
 Low semantic distances = all senses are closely
related => we need the average and standard
deviation of all distances

Results
 Window size : 3-4 utterances
 High threshold for computing semantic
relatedness
 The lowest superordinate is part of the
shortest path between two senses – we
ignore path lengths > 4
 A word is included in a chain if it is related to
90% of the words present in that chain

Results (2)
 Vocabulary size for corpus - 1696 words
 With WSD
 Average chain length for entire corpus = 52.68 words
 Number of chains = 649
 Longest chain : 95 distinct words
 Number of unitary chains (chains with a single distinct word) =
394 => 23.21 % of all words are part of such a chain
 Without WSD
 Average chain length for entire corpus = 94.48 words
 Number of chains = 358 , of which 260 are unitary chains.
Therefore, some chains are very long and probably inaccurate
 Longest chain : 756 distinct words (over 40% of the vocabulary
size)

Further research
 There is a high dependency between the
linguistic tool (WN is used now) , the corpus and
algorithms for tasks like lexical chaining and
WSD
 Bottlenecks and key points of this system must
be identified
 Wikipedia – can be used as an ontology
(Wikipedia has a category graph), as well as a
relevant corpus
 Implementing a spell-checker to increase the
number of words taken into account

Further research (2)
 Each word must be fitted into a lexical chain –
How?
 When is a chain stronger, rather than where
does a word best belong
• “Strength” of a chain = how closely related are
the words
• Output is a set of chains => “strength” of a set of
chains
• State-space searching : the output of the current
lexical chaining algorithm is the initial state, while
the final state is an acceptably strong set of
chains

Word sense disambiguation and lexical chains construction using wordnet

Recommended

Recommended

More Related Content

Similar to Word sense disambiguation and lexical chains construction using wordnet

Similar to Word sense disambiguation and lexical chains construction using wordnet (20)

More from University Politehnica Bucharest

More from University Politehnica Bucharest (20)

Recently uploaded

Recently uploaded (20)

Word sense disambiguation and lexical chains construction using wordnet

Editor's Notes