Name Discrimination by  Clustering Similar Contexts Ted Pedersen & Anagha Kulkarni University of Minnesota, Duluth Amruta Purandare Now at University of Pittsburgh  Research Supported by National Science Foundation Faculty Early Career Development Award (#0092784)
Name Discrimination Different people have the same name George (HW) Bush and George (W) Bush Different places have the same name Duluth (Minn) and Duluth (GA) Different things have the same abbrev. UMD (Duluth) and UMD (College Park)
 
 
 
 
Our goals?  Given 1000 contexts w/ “John Smith”, identify those that are similar to each other Group similar contexts together, assume they are associated with single individual Generate an identifying label from the content of the different clusters
Measuring Similarity of Words and Contexts w/Large Corpora? Second order Co-occurrences Jim drives his car fast / Jim speeds in his auto Car -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accident Car and Auto occur with many of the same words. They are therefore similar!  Less direct relationship, more resistant to sparsity!
Word sense discrimination Given 1000 contexts that include a particular target word (e.g., shell)  Cluster those contexts such that  similar contexts  come together Similar contexts have similar meanings Label each cluster with something that describes content, maybe even provides definition
Methodology Feature Selection Context Representation Measuring Similarities Clustering Evaluation
Feature Selection Identify features in large (separate) training corpora, or in data to be clustered. Rely on lexical features Unigrams, bigrams, co-occurrences
Lexical features Unigrams, words that occur more than X times  Bigrams, ordered pairs of words, separated by at most 2-3 intervening words, score above cutoff on measure of association Co-occurrences, same as bigrams, but unordered
Context representation First order Unigrams, bigrams, and co-occurrences that occur in training corpus, also occur in context to be clustered Context is represented as vector that shows if (or how often) these features occur in context to be clustered
Context Representation Second order  Bigrams or co-occurrences used to create matrix, cells represent counts or measure of word pair  Rows serve as co-occurrence vector for a word Represent context by averaging vectors of words in that context
2 nd  Order Context Vectors The largest  shell   store  by the  sea shore 0 6272.85 2.9133 62.6084 20.032 1176.84 51.021 O2 context 0 18818.55 0 0 0 205.5469 134.5102 Store 0 0 0 136.0441 29.576 0 0 Shore 0 0 8.7399 51.7812 30.520 3324.98 18.5533 Sea Artillery Sales Bombs Sandy North- West Water Sells
2 nd  Order Context Vectors Context sea shore store
Measuring Similarities c1: {file,  unix , commands,  system ,  store } c2: {machine, os,  unix ,  system , computer, dos,  store } Matching = |X  П  Y| {unix, system, store} = 3 Cosine = |X  П  Y|/(|X|*|Y|) 3/(√5*√7) = 3/(2.2361*2.646) = 0.5070
Limitations of 1 st  or 2 nd  order  0 52.27 0 0.92 0 4.21 0 28.72 0 3.24 0 1.28 0 2.53 Weapon Missile Shoot Fire Destroy Murder Kill 17.77 0 14.6 46.2 22.1 0 34.2 19.23 2.36 0 72.7 0 1.28 2.56 Execute Command Bomb Pipe Fire CD Burn
Latent Semantic Analysis Singular Value Decomposition Captures Polysemy and Synonymy(?) Conceptual Fuzzy Feature Matching Word Space to Semantic Space
After context representation… Each context is represented by a vector of some sort First order vector shows direct occurrence of features in context Second order vector is an average of word vectors that make up context, captures indirect relationships Now, cluster the vectors!
Clustering UPGMA Hierarchical : Agglomerative Repeated Bisections Hybrid : Divisive + Partitional
Evaluation  (before mapping) c1 c2 c4 c3 2 1 15 2 C4 6 1 1 2 C3 1 7 1 1 C2 2 3 0 10 C1
Evaluation  (after mapping) Accuracy=38/55=0.69 20 15 2 1 2 C4 17 1 1 0 55 11 12 15 10 6 1 2 C3 10 1 7 1 C2 15 2 3 10 C1
Majority Sense Classifier Maj. =17/55=0.31
Data Line, Hard, Serve 4000+ Instances / Word 60:40 Split 3-5 Senses / Word SENSEVAL-2 73 words = 28 V + 29 N + 15 A Approx. 50-100 Test, 100-200 Train 8-12 Senses/Word
Experimental comparison of 1 st  and 2 nd  order representations:  SC3 SC1 with Bi-gram Matrix PB3 PB1 with Bi-gram Features SC2 SC1 except  UPGMA, Similarity Space PB2 PB1 except  RB, Vector Space SC1 Co-occurrence Matrix, SVD RB, Vector Space PB1 Co-occurrences,  UPGMA, Similarity Space Schütze (2 nd  Order Contexts) Pedersen & Bruce   (1 st  Order Contexts)
Experimental Conclusions 2 nd  order, RB Smaller Data (like SENSEVAL-2) 1 st  order, UPGMA Large, Homogeneous (like Line, Hard, Serve) Recommendation Nature of Data
Software SenseClusters –  http://senseclusters.sourceforge.net/ N-gram Statistic Package -  http:// www.d.umn.edu/~tpederse/nsp.html Cluto - http://www-users.cs.umn.edu/~karypis/cluto/ SVDPack -  http://netlib.org/svdpack/
Making Free Software Mostly Perl, All CopyLeft SenseClusters Identify similar contexts Ngram Statistics Package Identify interesting sequences of words WordNet::Similarity Measure similarity among concepts Google-Hack Find sets of related words WordNet::SenseRelate All words sense disambiguation SyntaLex and Duluth systems Supervised WSD http:// www.d.umn.edu/~tpederse/code.html

Cicling2005

  • 1.
    Name Discrimination by Clustering Similar Contexts Ted Pedersen & Anagha Kulkarni University of Minnesota, Duluth Amruta Purandare Now at University of Pittsburgh Research Supported by National Science Foundation Faculty Early Career Development Award (#0092784)
  • 2.
    Name Discrimination Differentpeople have the same name George (HW) Bush and George (W) Bush Different places have the same name Duluth (Minn) and Duluth (GA) Different things have the same abbrev. UMD (Duluth) and UMD (College Park)
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    Our goals? Given 1000 contexts w/ “John Smith”, identify those that are similar to each other Group similar contexts together, assume they are associated with single individual Generate an identifying label from the content of the different clusters
  • 8.
    Measuring Similarity ofWords and Contexts w/Large Corpora? Second order Co-occurrences Jim drives his car fast / Jim speeds in his auto Car -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accident Car and Auto occur with many of the same words. They are therefore similar! Less direct relationship, more resistant to sparsity!
  • 9.
    Word sense discriminationGiven 1000 contexts that include a particular target word (e.g., shell) Cluster those contexts such that similar contexts come together Similar contexts have similar meanings Label each cluster with something that describes content, maybe even provides definition
  • 10.
    Methodology Feature SelectionContext Representation Measuring Similarities Clustering Evaluation
  • 11.
    Feature Selection Identifyfeatures in large (separate) training corpora, or in data to be clustered. Rely on lexical features Unigrams, bigrams, co-occurrences
  • 12.
    Lexical features Unigrams,words that occur more than X times Bigrams, ordered pairs of words, separated by at most 2-3 intervening words, score above cutoff on measure of association Co-occurrences, same as bigrams, but unordered
  • 13.
    Context representation Firstorder Unigrams, bigrams, and co-occurrences that occur in training corpus, also occur in context to be clustered Context is represented as vector that shows if (or how often) these features occur in context to be clustered
  • 14.
    Context Representation Secondorder Bigrams or co-occurrences used to create matrix, cells represent counts or measure of word pair Rows serve as co-occurrence vector for a word Represent context by averaging vectors of words in that context
  • 15.
    2 nd Order Context Vectors The largest shell store by the sea shore 0 6272.85 2.9133 62.6084 20.032 1176.84 51.021 O2 context 0 18818.55 0 0 0 205.5469 134.5102 Store 0 0 0 136.0441 29.576 0 0 Shore 0 0 8.7399 51.7812 30.520 3324.98 18.5533 Sea Artillery Sales Bombs Sandy North- West Water Sells
  • 16.
    2 nd Order Context Vectors Context sea shore store
  • 17.
    Measuring Similarities c1:{file, unix , commands, system , store } c2: {machine, os, unix , system , computer, dos, store } Matching = |X П Y| {unix, system, store} = 3 Cosine = |X П Y|/(|X|*|Y|) 3/(√5*√7) = 3/(2.2361*2.646) = 0.5070
  • 18.
    Limitations of 1st or 2 nd order 0 52.27 0 0.92 0 4.21 0 28.72 0 3.24 0 1.28 0 2.53 Weapon Missile Shoot Fire Destroy Murder Kill 17.77 0 14.6 46.2 22.1 0 34.2 19.23 2.36 0 72.7 0 1.28 2.56 Execute Command Bomb Pipe Fire CD Burn
  • 19.
    Latent Semantic AnalysisSingular Value Decomposition Captures Polysemy and Synonymy(?) Conceptual Fuzzy Feature Matching Word Space to Semantic Space
  • 20.
    After context representation…Each context is represented by a vector of some sort First order vector shows direct occurrence of features in context Second order vector is an average of word vectors that make up context, captures indirect relationships Now, cluster the vectors!
  • 21.
    Clustering UPGMA Hierarchical: Agglomerative Repeated Bisections Hybrid : Divisive + Partitional
  • 22.
    Evaluation (beforemapping) c1 c2 c4 c3 2 1 15 2 C4 6 1 1 2 C3 1 7 1 1 C2 2 3 0 10 C1
  • 23.
    Evaluation (aftermapping) Accuracy=38/55=0.69 20 15 2 1 2 C4 17 1 1 0 55 11 12 15 10 6 1 2 C3 10 1 7 1 C2 15 2 3 10 C1
  • 24.
    Majority Sense ClassifierMaj. =17/55=0.31
  • 25.
    Data Line, Hard,Serve 4000+ Instances / Word 60:40 Split 3-5 Senses / Word SENSEVAL-2 73 words = 28 V + 29 N + 15 A Approx. 50-100 Test, 100-200 Train 8-12 Senses/Word
  • 26.
    Experimental comparison of1 st and 2 nd order representations: SC3 SC1 with Bi-gram Matrix PB3 PB1 with Bi-gram Features SC2 SC1 except UPGMA, Similarity Space PB2 PB1 except RB, Vector Space SC1 Co-occurrence Matrix, SVD RB, Vector Space PB1 Co-occurrences, UPGMA, Similarity Space Schütze (2 nd Order Contexts) Pedersen & Bruce (1 st Order Contexts)
  • 27.
    Experimental Conclusions 2nd order, RB Smaller Data (like SENSEVAL-2) 1 st order, UPGMA Large, Homogeneous (like Line, Hard, Serve) Recommendation Nature of Data
  • 28.
    Software SenseClusters – http://senseclusters.sourceforge.net/ N-gram Statistic Package - http:// www.d.umn.edu/~tpederse/nsp.html Cluto - http://www-users.cs.umn.edu/~karypis/cluto/ SVDPack - http://netlib.org/svdpack/
  • 29.
    Making Free SoftwareMostly Perl, All CopyLeft SenseClusters Identify similar contexts Ngram Statistics Package Identify interesting sequences of words WordNet::Similarity Measure similarity among concepts Google-Hack Find sets of related words WordNet::SenseRelate All words sense disambiguation SyntaLex and Duluth systems Supervised WSD http:// www.d.umn.edu/~tpederse/code.html