Measuring Similarity Between Contexts and Concepts
Upcoming SlideShare
Loading in...5
×
 

Measuring Similarity Between Contexts and Concepts

on

  • 868 views

An invited talk from 2005 about measuring similarity of raw text and concepts.

An invited talk from 2005 about measuring similarity of raw text and concepts.

Statistics

Views

Total Views
868
Views on SlideShare
865
Embed Views
3

Actions

Likes
2
Downloads
25
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Measuring Similarity Between Contexts and Concepts Measuring Similarity Between Contexts and Concepts Presentation Transcript

  • Measuring Similarity Between Concepts and Contexts Ted Pedersen Department of Computer Science University of Minnesota, Duluth http://www.d.umn.edu/~tpederse
  • The problems…
    • Recognize similar (or related) concepts
      • frog : amphibian
      • Duluth : snow
    • Recognize similar contexts
      • I bought some food at the store :
      • I purchased something to eat at the market
  • Similarity and Relatedness
    • Two concepts are similar if they are connected by is-a relationships.
      • A frog is-a-kind-of amphibian
      • An illness is-a heath_condition
    • Two concepts can be related many ways…
      • A human has-a-part liver
      • Duluth receives-a-lot-of snow
    • … similarity is one way to be related
  • The approaches…
    • Measure conceptual similarity using a structured repository of knowledge
      • Lexical database WordNet
    • Measure contextual similarity using knowledge lean methods that are based on co-occurrence information from large corpora
  • Why measure conceptual similarity?
    • A word will take the sense that is most related to the surrounding context
      • I love Java , especially the beaches and the weather.
      • I love Java , especially the support for concurrent programming.
      • I love java , especially first thing in the morning with a bagel.
  • Word Sense Disambiguation
    • … can be performed by finding the sense of a word most related to its neighbors
    • Here, we define similarity and relatedness with respect to WordNet
      • WordNet::Similarity
      • http://wn-similarity.sourceforge.net
    • WordNet::SenseRelate
      • AllWords – assign a sense to every content word
      • TargetWord – assign a sense to a given word
      • http://senserelate.sourceforge.net
  • SenseRelate
    • For each sense of a target word in context
      • For each content word in the context
        • For each sense of that content word
          • Measure similarity/relatedness between sense of target word and sense of content word with WordNet::Similarity
          • Keep running sum for score of each sense of target
    • Pick sense of target word with highest score with words in context
  • WordNet::Similarity
    • Path based measures
      • Shortest path (path)
      • Wu & Palmer (wup)
      • Leacock & Chodorow (lch)
      • Hirst & St-Onge (hso)
    • Information content measures
      • Resnik (res)
      • Jiang & Conrath (jcn)
      • Lin (lin)
    • Gloss based measures
      • Banerjee and Pedersen (lesk)
      • Patwardhan and Pedersen (vector, vector_pairs)
  • Why don’t path finding and info. content solve the problem?
    • Concepts must be organized in a hierarchy, and connected in that hierarchy
      • Limited to comparing nouns with nouns, or maybe verbs with verbs
      • Limited to similarity measures (is-a)
    • What about mixed parts of speech?
      • Murder (noun) and horrible (adjective)
      • Tobacco (noun) and drinking (verb)
  • Using Dictionary Glosses to Measure Relatedness
    • Lesk (1985) Algorithm – measure relatedness of two concepts by counting the number of shared words in their definitions
      • Cold - a mild viral infection involving the nose and respiratory passages (but not the lungs)
      • Flu - an acute febrile highly contagious viral disease
    • Adapted Lesk (Banerjee & Pedersen, 2003) – expand glosses to include those concepts directly related
      • Cold - a common cold affecting the nasal passages and resulting in congestion and sneezing and headache; mild viral infection involving the nose and respiratory passages (but not the lungs); a disease affecting the respiratory system
      • Flu - an acute and highly contagious respiratory disease of swine caused by the orthomyxovirus thought to be the same virus that caused the 1918 influenza pandemic; an acute febrile highly contagious viral disease; a disease that can be communicated from one person to another
  • Context/Gloss Vectors
    • Leskian approaches require exact matches in glosses
      • Glosses are short, use related but not identical words
    • Solution? Expand glosses by replacing each content word with a co-occurrence vector derived from corpora
      • Rows are words in glosses, columns are the co-occurring words in a corpus, cell values are their log-likelihood ratios
    • Average the word vectors to create a single vector that represents the gloss/sense (Patwardhan & Pedersen, 2003)
      • 2 nd order co-occurrences
    • Measure relatedness using cosine rather than exact match!
  • Gloss/Context Vectors
  • 2 nd order co-occurrences
    • Two word that occur together (within some number of positions of each other) are first order co-occurrences
    • Words A and B that co-occur with C separately but not with each other are second order co-occurrences (i.e., a friend of a friend)
      • “ military intelligence” and “military armor” are first order co-occurrences
      • “ Intelligence” and “armor” are 2 nd order co-occurrences (via military)
  • WSD Experiment
    • Senseval-2 data consists of 73 nouns, verbs, and adjectives, approximately 8,600 “training” examples and 4,300 “test” examples.
      • Best supervised system 64%
      • SenseRelate 53% (lesk, vector)
      • Most frequent sense 48%
    • Window of context is defined by position, includes 2 content words to both the left and right which are measured against the word being disambiguated.
      • Positional proximity is not always associated with semantic similarity.
  • Human Relatedness Experiment
    • Miller and Charles (1991) created 30 pairs of nouns that were scored on a relatedness scale by over 50 human subjects
    • Vector measure correlates at over 80% with human relatedness judgements
    • Next closest measure is lesk (at 70%)
    • All other measures at less than 65%
  • Why gloss based measures don’t solve the problem..
    • WordNet
      • Nouns – 80,000 concepts
      • Verbs – 13,000 concepts
      • Adjectives – 18,000 concepts
      • Adverbs – 4,000 concepts
    • Words not found in WordNet can’t be disambiguated by SenseRelate
  • Knowledge Lean Methods
    • Can measure similarity between two words by comparing co-occurrence vectors created for each.
    • Can measure similarity of two contexts by representing them as 2 nd order co-occurrence vectors and comparing.
  • Word Sense Discrimination
    • Cluster different senses of words like line or interest based on contextual similarity.
      • Pedersen & Bruce, 1997
      • Schutze, 1998
      • Purandare & Pedersen, 2004
    • Hard to evaluate, senses of words are somewhat ill defined, distinctions made by clustering methods may or may not correspond with human intuitions
    • http:// senseclusters.sourceforge.net
  • Name Discrimination
    • Names that occur in similar contexts may refer to the same person.
      • George Miller is an eminent psychologist.
      • George Miller is one of the founders of modern cognitive science.
      • George Miller is a member of the US House of Representatives.
  •  
  •  
  •  
  •  
  • Objective
    • Given some number of contexts containing “John Smith”, identify those that are similar to each other
    • Group similar contexts together, assume they are associated with single individual
    • Generate an identifying label from the content of the different clusters
  • Similarity of Context?
      • Context 1: He drives his car fast
      • Context 2: Jim speeds in his auto
      • Car -> motor, garage, gasoline, insurance
      • Auto -> motor, insurance, gasoline, accident
      • Car and Auto occur with many of the same words.
      • They share many first order co-occurrences.
        • They are therefore similar!
      • Less direct relationship, more resistant to sparsity!
  • Representing a Context
    • Represent a context as the average of all the first order vectors of the words in the context
    • Very similar to vector measure – the only different is that the context here comes from raw corpora, whereas in the vector measure the context is a dictionary definition
  • Feature Selection
    • Bigrams – two word sequences that may have one intervening word between them
      • Frequency > 1
      • Log-likelihood ratio > 3.841
      • OR stop list
    • Must occur within Ft positions of target, Ft typically set to 5 or 20
  • Second Order Context Representation
    • Bigrams used to create a word matrix
      • Cell values = log-likelihood of word pair
    • Rows are first order co-occurrence vector for a word
    • Represent context by averaging vectors of words in that context
      • Context includes the Cxt positions around the target, where Cxt is typically 5 or 20.
  • 2 nd Order Context Vectors
    • He won an Oscar, but Tom Hanks is still a nice guy.
    0 6272.85 2.9133 62.6084 20.032 1176.84 51.021 O2 context 0 18818.55 0 0 0 205.5469 134.5102 guy 0 0 0 136.0441 29.576 0 0 Oscar 0 0 8.7399 51.7812 30.520 3324.98 18.5533 won needle family war movie actor football baseball
  • Limits of co-occurrence vectors 0 52.27 0 0.92 0 4.21 0 28.72 0 3.24 0 1.28 0 2.53 Weapon Missile Shoot Fire Destroy Murder Kill 17.77 0 14.6 46.2 22.1 0 34.2 19.23 2.36 0 72.7 0 1.28 2.56 Execute Command Bomb Pipe Fire CD Burn
  • Singular Value Decomposition
    • What it does (for sure):
      • Smoothes out zeroes
      • Finds Principal Components
    • What it might do:
      • Capture Polysemy
      • Word Space to Semantic Space
  • After context representation…
    • Second order vector is an average of word vectors that make up context, captures indirect relationships
      • Reduced by SVD to principal components
    • Now, cluster the vectors!
      • We use the method of repeated bisections
      • CLUTO
  • Experimental Data
    • Created from AFE GigaWord corpus
    • 170,969,00 words
    • May 1994-May 1997
    • December 2001-June 2002
    • Created name conflated pseudo words
      • 25 words to left and right of target
  • Name Conflated Data 51.4% 231,069 JapAnce 112,357 France 118,712 Japan 53.9% 46,431 JorGypt 21,762 Egyptian 25,539 Jordan 56.0% 13,734 MonSlo 6,176 Slobodan Milosovic 7,846 Shimon Peres 58.6% 5,807 MSIIBM 2,406 IBM 3,401 Microsoft 73.7% 4,073 JikRol 1,071 Rolf Ekeus 3,002 Tajik 69.3% 2,452 RoBeck 740 David Beckham 1,652 Ronaldo Maj. Total New Count Name Count Name
  • 50.3 50.3 51.1 51.1 51.4 231,069 JapAnce 53.0 57.0 59.1 56.6 53.9 46,431 JorGypt 91.4 54.6 96.6 62.8 56.0 13,734 MonSLo 60.0 68.0 51.3 47.7 58.6 5,807 MSIIBM 90.4 91.0 96.2 94.7 73.7 4,073 JikRol 54.7 85.9 72.7 57.3 69.3 2,452 Robeck Ft 20 Ft 5 Ft 20 Ft 5 Maj. # Cxt 20 Cxt 5
  • Conclusions
    • Tradeoff between size of context and feature selection space
      • Context small – Feature large : narrow window around target word where many possible features represented
      • Context large – Feature small : large window around target word where a selective set of features represented
    • SVD didn’t help/hurt
      • Results shown are without SVD
  • Ongoing work
    • Creating Path Finding Measures of Relatedness
    • Stopping Clustering Automatically
    • Cluster labeling
    • … Bring together finding conceptual similarity and contextual similarity
  • Thanks to…
    • WordNet::Similarity and SenseRelate
    • http://wn-similarity.sourceforge.net
    • http:// senserelate.sourceforge.net
      • Siddharth Patwardhan
      • Satanjeev Banerjee
      • Jason Michelizzi
    • SenseClusters
    • http:// senseclusters.sourceforge.net
      • Anagha Kulkarni
      • Amruta Purandare