Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth  http://www.d.umn.edu/~tpederse [email_address]
The Problem A  context  is a short unit of text often a phrase to a paragraph in length, although it can be longer Input: N contexts Output: K clusters Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters
Language Independent Methods Do not utilize syntactic information No parsers, part of speech taggers, etc. required Do not utilize dictionaries or other manually created lexical resources Based on lexical features selected from corpora  No manually annotated data of any kind, methods are completely unsupervised in the strictest sense Assumption: word segmentation can be done by looking for white spaces between strings
Outline (Tutorial) Background and motivations Identifying lexical features Measures of association & tests of significance Context representations First & second order Dimensionality reduction Singular Value Decomposition Clustering methods Agglomerative & partitional techniques Cluster labeling Evaluation techniques  Gold standard comparisons
Outline (Practical Session) Headed contexts Name Discrimination Word Sense Discrimination Abbreviations Headless contexts Email/Newsgroup Organization Newspaper text Identifying Sets of Related Words
SenseClusters A package designed to cluster contexts Integrates with various other tools Ngram Statistics Package Cluto SVDPACKC http://senseclusters.sourceforge.net
Many thanks… Satanjeev (“Bano”) Banerjee (M.S., 2002) Founding developer of the Ngram Statistics Package (2000-2001) Now PhD student in the Language Technology Institute at Carnegie Mellon University  http://www-2.cs.cmu.edu/~banerjee/ Amruta Purandare (M.S., 2004) Founding developer of SenseClusters (2002-2004) Now PhD student in Intelligent Systems at the University of Pittsburgh  http:// www.cs.pitt.edu/~amruta / Anagha Kulkarni (M.S., 2006, expected) Enhancing SenseClusters since Fall 2004! http://www.d.umn.edu/~kulka020/ National Science Foundation (USA) for supporting Bano, Amruta, Anagha and me (!) via CAREER award #0092784
Practical Session Experiment with SenseClusters http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi Has both a command line and web interface (above) Can be installed on Linux/Unix machine without too much work http://senseclusters.sourceforge.net Has some dependencies that must be installed, so having supervisor access and/or sysadmin experience helps Complete system (SenseClusters plus dependencies) is available on CD
Background and Motivations
Headed and Headless Contexts A headed context includes a target word Our goal is to collect multiple contexts that mention a particular target word in order to try identify different senses of that word  A headless context has no target word Our goal is to identify the contexts that are similar to each other
Headed Contexts (input) I can hear the ocean in that  shell.   My operating system  shell  is bash. The  shells  on the shore are lovely. The  shell  command line is flexible. The oyster  shell   is very hard and black.
Headed Contexts (output) Cluster 1:  My operating system  shell  is bash. The  shell  command line is flexible. Cluster 2: The  shells  on the shore are lovely. The oyster  shell  is very hard and black. I can hear the ocean in that  shell.
Headless Contexts (input) The new version of Linux is more stable and better support for cameras. My Chevy Malibu has had some front end troubles. Osborne made on of the first personal computers. The brakes went out, and the car flew into the house.  With the price of gasoline, I think I’ll be taking the bus more often!
Headless Contexts (output) Cluster 1: The new version of Linux is more stable and better support for cameras. Osborne made one of the first personal computers. Cluster 2:  My Chevy Malibu has had some front end troubles. The brakes went out, and the car flew into the house.  With the price of gasoline, I think I’ll be taking the bus more often!
Applications Web search results are headed contexts Term you search for is included in snippet Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up”  Organizing web search results is an important problem.  If you click on search results or follow links in pages found, you will encounter headless contexts too…
 
 
 
 
 
Applications Email (public or private) is made up of headless contexts Short, usually focused… Cluster similar email messages together  Automatic email foldering Take all messages from sent-mail file or inbox and organize into categories
 
 
Applications News article are another example of headless contexts Entire article or first paragraph Short, usually focused Cluster similar articles together
 
 
 
Underlying Premise… You shall know a word by the company it keeps Firth, 1957 ( Studies in Linguistic Analysis ) Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) Harris, 1968 ( Mathematical Structures of Language ) Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) Miller and Charles, 1991 ( Language and Cognitive Processes ) Various extensions… Similar contexts will have similar meanings, etc. Names that occur in similar contexts will refer to the same underlying person, etc.
Identifying Lexical Features Measures of Association and  Tests of Significance
What are features? Features represent the (hopefully) salient characteristics of the contexts to be clustered Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features Vectors/contexts that include many of the same features will be similar to each other
Where do features come from?  In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step Email or news articles
Feature Selection “ Test” data – the contexts to be clustered Assume that the feature selection data is the same as the test data, unless otherwise indicated  “ Training” data – a separate corpus of held out feature selection data (that will not be clustered) may need to use if you have a small number of contexts to cluster (e.g., web search results) This sense of “training” due to Schütze (1998)
Lexical Features Unigram – a single word that occurs more than a given number of times Bigram – an ordered pair of words that occur together more often than expected by chance Consecutive or may have intervening words Co-occurrence – an unordered bigram Target Co-occurrence – a co-occurrence where one of the words is the target word
Bigrams fine wine (window size of 2) baseball bat house  of  representatives (window size of 3) president  of the  republic (window size of 4) apple orchard Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?)
Co-occurrences tropics water boat fish law president train travel Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations
Bigrams and Co-occurrences Pairs of words tend to be much less ambiguous than unigrams “ bank” versus “river bank” and “bank card” “ dot” versus “dot com” and “dot product” Three grams and beyond occur much less frequently (Ngrams very Zipfian) Unigrams are noisy, but bountiful
“ occur together more often than expected by chance…” Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix Throw out bigrams that include one or two stop words Expected values are calculated, based on the model of independence and observed values How often would you expect these words to occur together, if they only occurred together by chance? If two words occur “significantly” more often than the expected value, then the words do not occur together by chance.
2x2 Contingency Table 100,000 300 !Artificial 400 100 Artificial !Intelligence Intelligence
2x2 Contingency Table 100,000 99,700 300 99,600 99,400 200 !Artificial 400 300 100 Artificial !Intelligence Intelligence
2x2 Contingency Table 100,000 99,700 300 99,600 99,400.0 99,301.2 200.0 298.8 !Artificial 400 300.0 398.8 100.0 000.12 Artificial !Intelligence Intelligence
Measures of Association
Measures of Association
Interpreting the Scores… G^2 and X^2 are asymptotically approximated by the chi-squared distribution… This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get …
 
Interpreting the Scores… Values above a certain level of significance can be considered grounds for rejecting the null hypothesis  H0: the words in the bigram are independent 3.841 is associated with 95% confidence that the null hypothesis should be rejected
Measures of Association There are numerous measures of association that can be used to identify bigram and co-occurrence features Many of these are supported in the Ngram Statistics Package (NSP) http://www.d.umn.edu/~tpederse/nsp.html
Measures Supported in NSP Log-likelihood Ratio (ll) True Mutual Information (tmi) Pearson’s Chi-squared Test (x2) Pointwise Mutual Information (pmi) Phi coefficient (phi) T-test (tscore) Fisher’s Exact Test (leftFisher, rightFisher) Dice Coefficient (dice) Odds Ratio (odds)
NSP Will explore NSP during practical session Integrated into SenseClusters, may also be used in stand-alone mode Can be installed easily on a Linux/Unix system from CD or download from http://www.d.umn.edu/~tpederse/nsp.html I’m told it can also be installed on Windows (via cygwin or ActivePerl), but I have no personal experience of this…
Summary Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data Language independent Unigrams usually only selected by frequency Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association Bigrams and co-occurrences need not be consecutive Stop words should be eliminated Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful)
Related Work Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) Dunning, 1993 ( Computational Linguistics ) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
Context Representations First and Second Order Methods
Once features selected… We will have a set of unigrams, bigrams, co-occurrences or target co-occurrences that we believe are somehow interesting and useful We also have any frequency and measure of association score that have been used in their selection Convert contexts to be clustered into a vector representation based on these features
First Order Representation Each context is represented by a vector with M dimensions, each of which indicates whether or not a particular feature occurred in that context Value may be binary, a frequency count, or an association score Context by Feature representation
Contexts C1: There was an island curse of black magic cast by that voodoo child.  C2: Harold, a known voodoo child, was gifted in the arts of black magic. C3: Despite their military might, it was a serious error to attack. C4: Military might is no defense against a voodoo child or an island curse.
Unigram Feature Set  island  1000 black  700 curse  500 magic  400 child  200 (assume these are frequency counts obtained from some corpus…)
First Order Vectors of Unigrams 1 0 1 0 1 C4 0 0 0 0 0 C3 1 1 0 1 0 C2 1 1 1 1 1 C1 child magic curse black island
Bigram Feature Set island curse  189.2 black magic  123.5 voodoo child  120.0 military might  100.3 serious error  89.2 island child  73.2 voodoo might  69.4 military error  54.9 black child  43.2 serious curse  21.2 (assume these are log-likelihood scores based on frequency counts from some corpus)
First Order Vectors of Bigrams 1 0 1 1 0 C4 0 1 1 0 0 C3 1 0 0 0 1 C2 1 0 0 1 1 C1 voodoo child serious error military might  island curse  black magic
First Order Vectors Can have binary values or weights associated with frequency, etc. May optionally be smoothed/reduced with Singular Value Decomposition  More on that later… The contexts are ready for clustering… More on that later…
Second Order Representation Build word by word matrix from features  Must be bigrams or co-occurrences (optionally) reduce dimensionality w/SVD Each row represents first order co-occurrences Represent a context by replacing each word with an entry in the word by word matrix with its associated vector Average word vectors found for the context  Due to Schuetze (1998)
Word by Word Matrix 120.0 0 69.4 0 0 voodoo 0 89.2 0 21.2 0 serious 0 54.9 100.3 0 0 military 73.2 0 0 189.2 0 island 43.2 0 0 0 123.5 black child error might curse magic
Word by Word Matrix … can also be used to identify sets of related words In the case of bigrams, rows represent the first word in a bigram and columns represent the second word Matrix is asymmetric In the case of co-occurrences, rows and columns are equivalent Matrix is symmetric The vector (row) for each word represent a set of first order features for that word Each word in a context to be clustered for which a vector exists (in the word by word matrix) is replaced by that vector in that context
There was an  island  curse of  black  magic cast by that  voodoo  child.  120.0 0 69.4 0 0 voodoo 73.2 0 0 189.2 0 island 43.2 0 0 0 123.5 black child error might curse magic
Second Order Representation There was an  [curse, child]  curse of  [magic, child]  magic cast by that  [might, child]  child [curse, child]  +  [magic, child]  +  [might, child]
There was an  island  curse of  black  magic cast by that  voodoo  child.  78.8 0 24.4 63.1 41.2 C1 child error might curse magic
First versus Second Order First Order represents a context by showing which features occurred in that context This is what feature vectors normally do… Second Order allows for additional information about a word to be incorporated into the representation  Feature values based on information found outside of the immediate context
Second Order Co-Occurrences “black” and “island” show similarity because both words have occurred with “child”  “black” and “island” are second order co-occurrence with each other, since both occur with “child”  but not with each other (i.e., “black island” is not observed)
Second Order Co-occurrences Imagine a co-occurrence graph  Word network First order co-occurrences are directly connected Second order co-occurrences are to each connected via one other word kocos.pl program in Ngram Statistics Package finds kth order co-occurrences
Summary First order representations are intuitive, but… Can suffer from sparsity Contexts represented based on the features that occur in those contexts Second order representations are harder to visualize, but… Allow a word to be represented by the words it co-occurs with (i.e., the company it keeps) Allows a context to be represented by the words that occur with the words in the context  Helps combat sparsity…
Related Work Pedersen and Bruce 1997 (EMNLP) presented first order method of discrimination http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf Schütze 1998 ( Computational Linguistics ) introduced second order method  http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf Purandare and Pedersen 2004 (CoNLL) compared first and second order methods http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf First order better if you have lots of data Second order better with smaller amounts of data
Dimensionality Reduction Singular Value Decomposition
Motivation First order matrices are very sparse Word by word Context by feature NLP data is noisy No stemming performed synonyms
Many Methods  Singular Value Decomposition (SVD) SVDPACKC  http:// www.netlib.org/svdpack / Multi-Dimensional Scaling (MDS) Principal Components Analysis (PCA) Independent Components Analysis (ICA) Linear Discriminant Analysis (LDA) etc…
Effect of SVD SVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space If “dog” and “collie” and “wolf” are dimensions/columns in a word co-occurrence matrix, after SVD they may be a single dimension that represents “canines”
Effect of SVD The dimensions of the matrix after SVD are principal components that represent the meaning of concepts Similar columns are grouped together  SVD is a way of smoothing a very sparse matrix, so that there are very few zero valued cells after SVD
How can SVD be used? SVD on first order contexts will reduce a context by feature representation down to a smaller number of features Latent Semantic Analysis typically performs SVD on a word by context representation, where the contexts are reduced SVD used in creating second order context representations Reduce word by word matrix  SVD could also be used on resultant second order context representations (although not supported)
Word by Word Matrix 4 2 0 0 0 3 0 1 box 0 1 2 2 1 2 0 0 memory 0 0 0 1 0 0 2 0 organ 0 2 0 3 2 0 0 0 debt 0 1 0 3 1 0 0 2 linux 0 1 0 3 2 0 0 0 sales 3 0 2 2 0 3 0 0 lab 1 0 2 0 0 1 2 0 petri 0 1 0 0 2 0 0 1 disk 1 0 2 0 0 0 3 0 body 0 0 0 3 1 0 0 2 pc plasma graphics tissue data ibm cells blood apple
Singular Value Decomposition A=UDV’
U -.52 .39 -.48 .02 .09 .41 -.09 .40 -.30 .08 .31 .43 -.26 -.39 -.6 .20 .00 -.00 -.00 -.02 -.01 .00 -.02 -.00 -.07 -.3 .14 -.49 -.07 .30 .25 .56 -.01 .08 .05 -.01 .24 -.08 .11 .46 .08 .03 -.04 .72 .09 -.31 -.01 .37 -.07 .01 -.21 -.31 -.34 -.45 -.68 .29 .00 .05 .83 .17 -.02 .25 -.45 .08 .03 .20 -.22 .31 -.60 .39 .13 .35 -.01 -.04 -.44 .08 .44 .59 -.49 .05 -.02 .63 .02 -.09 .52 -.2 .09 .35
D 0.00 0.00 0.00 0.66 1.26 2.30 2.52 3.25 3.99 6.36 9.19
V -.20 .22 -.07 -.10 -.87 -.07 -.06 .17 .19 -.26 .04 .03 .17 -.32 .02 .13 -.26 -.17 .06 -.04 .86 .50 -.58 .12 .09 -.18 -.27 -.18 -.12 -.47 .11 -.03 .12 .31 -.32 -.04 .64 -.45 -.14 -.23 .28 .07 -.23 -.62 -.59 .05 .02 -.12 .15 .11 .25 -.71 -.31 -.04 .08 .29 -.05 .05 .20 -.51 .09 -.03 .12 .31 -.01 .02 -.45 -.32 .50 .27 .49 -.02 .08 .21 -.06 .08 -.09 .52 -.45 -.01 .63 .03 -.12 -.31 .71 -.13 .39 -.12 .12 .15 .37 .07 .58 -.41 .15 .17 -.30 -.32 -.27 -.39 .11 .44 .25 .03 -.02 .26 .23 .39 .57 -.37 .04 .03 -.12 -.31 -.05 -.05 .04 .28 -.04 .08 .21
Word by Word Matrix After SVD 1.1 1.0 .98 1.7 .86 .72 .85 .77 memory .00 .00 .17 1.2 .77 .00 .84 .00 organ .00 1.5 .00 3.2 2.1 .00 .00 1.2 debt .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .41 .85 .35 2.2 1.3 .39 .15 .73 sales 2.3 .18 2.5 1.7 .35 2.0 1.7 .21 lab 1.4 .00 1.5 .49 .00 1.2 1.1 .00 germ .00 .91 .00 2.1 1.3 .01 .00 .76 disk 1.5 .00 1.6 .33 .00 1.3 1.2 .00 body .09 .86 .01 2.0 1.3 .11 .00 .73 pc plasma graphics tissue data ibm cells blood apple
Second Order Representation These two contexts share no words in common, yet they are similar!  disk  and  linux  both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory”  The two contexts are similar because they share many  second order co-occurrences I got a new  disk  today! What do you think of  linux? 1.0 .72 memory .00 .00 organ .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .00 .91 .00 2.1 1.3 .01 .00 .76 disk Plasma graphics tissue data ibm cells blood apple
Clustering Methods Agglomerative and  Partitional
Many many methods… Cluto supports a wide range of different clustering methods Agglomerative Average, single, complete link… Partitional K-means Hybrid Repeated bisections SenseClusters integrates with Cluto http://www-users.cs.umn.edu/~karypis/cluto/
General Methodology Represent contexts to be clustered in first or second order vectors Cluster the vectors directly, or convert to similarity matrix and then cluster vcluster scluster
Agglomerative Clustering Create a similarity matrix of instances to be discriminated Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances Typically a first order representation, where similarity is based on the features observed in the pair of instances
Measuring Similarity Integer Values Matching Coefficient Jaccard Coefficient Dice Coefficient Real Values Cosine
Agglomerative Clustering Apply Agglomerative Clustering algorithm to similarity matrix To start, each instance is its own cluster Form a cluster from the most similar pair of instances Repeat until the desired number of clusters is obtained Advantages : high quality clustering  Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons
  Average Link Clustering 1 2 4 S3 1 2 4 S3 0 2 S4 0 3 S2 2 3 S1 S4 S2 S1 0 S4 0 S2 S1S3 S4 S2 S1S3 S4 S1S3S2 S4 S1S3S2
Partitional Methods Select some number of contexts in feature space to act as centroids Assign each context to nearest centroid, forming cluster After all contexts assigned, recompute centroids Repeat until stable clusters found Centroids don’t shift from iteration to iteration
Partitional Methods Advances : fast Disadvantages : very dependent on the initial placement of centroids
Cluster Labeling
Results of Clustering Each cluster consists of some number of contexts Each context is a short unit of text Apply measures of association to the contents of each cluster to determine N most significant bigrams Use those bigrams as a label for the cluster
Label Types The N most significant bigrams for each cluster will act as a descriptive label The M most significant bigrams that are unique to each cluster will act as a discriminating label
Evaluation Techniques Comparison to gold standard data
Evaluation If Sense tagged text is available, can be used for evaluation But don’t use sense tags for clustering or feature selection! Assume that sense tags represent “true” clusters, and compare these to discovered clusters Find mapping of clusters to senses that attains maximum accuracy
Evaluation Pseudo words are especially useful, since it is hard to find data that is discriminated Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. http://www.d.umn.edu/~tpederse/tools.html Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier
Evaluation Pseudo words are especially useful, since it is hard to find data that is discriminated Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. http://www.d.umn.edu/~kulka020/kanaghaName.html
Baseline Algorithm Baseline Algorithm – group all instances into one cluster, this will reach “accuracy” equal to majority classifier What if the clustering said everything should be in the same cluster?
Baseline Performance (0+0+55)/170  = .32  if C3 is S1  (0+0+80)/170 = .47 if C3 is S3   170 55 35 80 Totals 170 55 35 80 C3 0 0 0 0 C2 0 0 0 0 C1 Totals S3 S2 S1 170 80 35 55 Totals 170 80 35 55 C3 0 0 0 0 C2 0 0 0 0 C1 Totals S1 S2 S3
Evaluation Suppose that C1 is labeled S1, C2 as S2, and C3 as S3 Accuracy =  (10 + 0 + 10) / 170 = 12%  Diagonal shows how many members of the cluster actually belong to the sense given on the column  Can the “columns” be rearranged to improve the overall accuracy? Optimally assign clusters to senses 170 55 35 80 Totals 65 10 5 50 C3 60 40 0 20 C2 45 5 30 10 C1 Totals S3 S2 S1
Evaluation The assignment of C1 to S2, C2 to S3, and C3 to S1 results in 120/170 = 71% Find the ordering of the columns in the matrix that maximizes the sum of the diagonal.  This is an instance of the Assignment Problem from Operations Research, or finding the Maximal Matching of a Bipartite Graph from Graph Theory. 170 80 55 35 Totals 65 50 10 5 C3 60 20 40 0 C2 45 10 5 30 C1 Totals S1 S3 S2
Analysis Unsupervised methods may not discover clusters equivalent to the classes learned in supervised learning Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. Alternatives? Humans could look at the members of each cluster and determine the nature of the relationship or meaning that they all share Use the contents of the cluster to generate a descriptive label that could be inspected by a human
Practical Session Experiments with SenseClusters
Experimental Data Available on Web Site http://senseclusters.sourceforge.net Available on CD Data/SenseClusters-Data SenseClusters requires data to be in the Senseval-2 lexical sample format Plenty of such data available on CD and from web site
Creating Experimental Data NameConflate program Creates name conflated data from English GigaWord corpus Text2Headless program  Convert plain text into headless contexts http:// www.d.umn.edu/~tpederse/tools.html
Name Conflation Data Smaller Data Set (also on Web as SC-Web…) Country - Noun Name - Name Noun - Noun Larger Data Sets (also on Web as Split-Smaller…) Adidas - Puma Emile Lahoud – Askar Akayev CICLING data (CD only) David Beckham – Ronaldo  Microsoft – IBM ACL 2005 demo data (CD only) Name - Name
Clustering Contexts ACL 2005 Demo (also on Web as Email…) Various partitions of 20 news groups data sets Spanish Data (web only) News articles each of which mention abbreviations PP or PSOE
Name Discrimination
George Millers!
Headed Clustering Name Discrimination Tom Hanks Russell Crowe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Headless Contexts Email / 20 newsgroups data Spanish Text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
If you after all these matrices you crave knowledge based resources… Read on…
WordNet-Similarity Not language independent Based on English WordNet But, can be combined with distributional methods to good effect McCarthy, et. al. ACL-2004 Perl module http:// search.cpan.org /dist/WordNet-Similarity Web interface http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
Many thanks! Satanjeev “Bano” Banerjee (M.S., 2002) Inventor of Adapted Lesk Algorithm (IJCAI-2003), which is the earliest origin and motivation for WordNet-Similarity… Now PhD student at LTI/CMU… Siddharth Patwardhan (M.S., 2003) Founding developer of WordNet-Similarity (2001-2003) Now PhD student at University of Utah http:// www.cs.utah.edu/~sidd / Jason Michelizzi (M.S., 2005) Enhanced WordNet-Similarity in many ways and applied it to all words sense disambiguation (2003-2005) http://www.d.umn.edu/~mich0212 NSF for supporting Bano, and University of Minnesota for supporting Bano, Sid and Jason via various internal sources
Vector measure Build a word by word matrix from WordNet Gloss Corpus 1.4 million words Treat glosses as contexts, and use second order representation where words are replaced with vectors from matrix Average together all vectors to represent concept/definition High correlation with human relatedness judgements
Many other measures Path Based Path Leacock & Chodorow Wu and Palmer Information Content Based Resnik Lin Jiang & Conrath Relatedness Hirst & St-Onge Adapted Lesk Vector
 
 
Thank you! Questions are welcome at any time. Feel free to contact me in person or via email ( [email_address] ) at any time! All of our software is free and open source, you are welcome to download, modify, redistribute, etc.  http://www.d.umn.edu/~tpederse/code.html

Eurolan 2005 Pedersen

  • 1.
    Language Independent Methodsof Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse [email_address]
  • 2.
    The Problem A context is a short unit of text often a phrase to a paragraph in length, although it can be longer Input: N contexts Output: K clusters Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters
  • 3.
    Language Independent MethodsDo not utilize syntactic information No parsers, part of speech taggers, etc. required Do not utilize dictionaries or other manually created lexical resources Based on lexical features selected from corpora No manually annotated data of any kind, methods are completely unsupervised in the strictest sense Assumption: word segmentation can be done by looking for white spaces between strings
  • 4.
    Outline (Tutorial) Backgroundand motivations Identifying lexical features Measures of association & tests of significance Context representations First & second order Dimensionality reduction Singular Value Decomposition Clustering methods Agglomerative & partitional techniques Cluster labeling Evaluation techniques Gold standard comparisons
  • 5.
    Outline (Practical Session)Headed contexts Name Discrimination Word Sense Discrimination Abbreviations Headless contexts Email/Newsgroup Organization Newspaper text Identifying Sets of Related Words
  • 6.
    SenseClusters A packagedesigned to cluster contexts Integrates with various other tools Ngram Statistics Package Cluto SVDPACKC http://senseclusters.sourceforge.net
  • 7.
    Many thanks… Satanjeev(“Bano”) Banerjee (M.S., 2002) Founding developer of the Ngram Statistics Package (2000-2001) Now PhD student in the Language Technology Institute at Carnegie Mellon University http://www-2.cs.cmu.edu/~banerjee/ Amruta Purandare (M.S., 2004) Founding developer of SenseClusters (2002-2004) Now PhD student in Intelligent Systems at the University of Pittsburgh http:// www.cs.pitt.edu/~amruta / Anagha Kulkarni (M.S., 2006, expected) Enhancing SenseClusters since Fall 2004! http://www.d.umn.edu/~kulka020/ National Science Foundation (USA) for supporting Bano, Amruta, Anagha and me (!) via CAREER award #0092784
  • 8.
    Practical Session Experimentwith SenseClusters http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi Has both a command line and web interface (above) Can be installed on Linux/Unix machine without too much work http://senseclusters.sourceforge.net Has some dependencies that must be installed, so having supervisor access and/or sysadmin experience helps Complete system (SenseClusters plus dependencies) is available on CD
  • 9.
  • 10.
    Headed and HeadlessContexts A headed context includes a target word Our goal is to collect multiple contexts that mention a particular target word in order to try identify different senses of that word A headless context has no target word Our goal is to identify the contexts that are similar to each other
  • 11.
    Headed Contexts (input)I can hear the ocean in that shell. My operating system shell is bash. The shells on the shore are lovely. The shell command line is flexible. The oyster shell is very hard and black.
  • 12.
    Headed Contexts (output)Cluster 1: My operating system shell is bash. The shell command line is flexible. Cluster 2: The shells on the shore are lovely. The oyster shell is very hard and black. I can hear the ocean in that shell.
  • 13.
    Headless Contexts (input)The new version of Linux is more stable and better support for cameras. My Chevy Malibu has had some front end troubles. Osborne made on of the first personal computers. The brakes went out, and the car flew into the house. With the price of gasoline, I think I’ll be taking the bus more often!
  • 14.
    Headless Contexts (output)Cluster 1: The new version of Linux is more stable and better support for cameras. Osborne made one of the first personal computers. Cluster 2: My Chevy Malibu has had some front end troubles. The brakes went out, and the car flew into the house. With the price of gasoline, I think I’ll be taking the bus more often!
  • 15.
    Applications Web searchresults are headed contexts Term you search for is included in snippet Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” Organizing web search results is an important problem. If you click on search results or follow links in pages found, you will encounter headless contexts too…
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Applications Email (publicor private) is made up of headless contexts Short, usually focused… Cluster similar email messages together Automatic email foldering Take all messages from sent-mail file or inbox and organize into categories
  • 22.
  • 23.
  • 24.
    Applications News articleare another example of headless contexts Entire article or first paragraph Short, usually focused Cluster similar articles together
  • 25.
  • 26.
  • 27.
  • 28.
    Underlying Premise… Youshall know a word by the company it keeps Firth, 1957 ( Studies in Linguistic Analysis ) Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) Harris, 1968 ( Mathematical Structures of Language ) Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) Miller and Charles, 1991 ( Language and Cognitive Processes ) Various extensions… Similar contexts will have similar meanings, etc. Names that occur in similar contexts will refer to the same underlying person, etc.
  • 29.
    Identifying Lexical FeaturesMeasures of Association and Tests of Significance
  • 30.
    What are features?Features represent the (hopefully) salient characteristics of the contexts to be clustered Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features Vectors/contexts that include many of the same features will be similar to each other
  • 31.
    Where do featurescome from? In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step Email or news articles
  • 32.
    Feature Selection “Test” data – the contexts to be clustered Assume that the feature selection data is the same as the test data, unless otherwise indicated “ Training” data – a separate corpus of held out feature selection data (that will not be clustered) may need to use if you have a small number of contexts to cluster (e.g., web search results) This sense of “training” due to Schütze (1998)
  • 33.
    Lexical Features Unigram– a single word that occurs more than a given number of times Bigram – an ordered pair of words that occur together more often than expected by chance Consecutive or may have intervening words Co-occurrence – an unordered bigram Target Co-occurrence – a co-occurrence where one of the words is the target word
  • 34.
    Bigrams fine wine(window size of 2) baseball bat house of representatives (window size of 3) president of the republic (window size of 4) apple orchard Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?)
  • 35.
    Co-occurrences tropics waterboat fish law president train travel Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations
  • 36.
    Bigrams and Co-occurrencesPairs of words tend to be much less ambiguous than unigrams “ bank” versus “river bank” and “bank card” “ dot” versus “dot com” and “dot product” Three grams and beyond occur much less frequently (Ngrams very Zipfian) Unigrams are noisy, but bountiful
  • 37.
    “ occur togethermore often than expected by chance…” Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix Throw out bigrams that include one or two stop words Expected values are calculated, based on the model of independence and observed values How often would you expect these words to occur together, if they only occurred together by chance? If two words occur “significantly” more often than the expected value, then the words do not occur together by chance.
  • 38.
    2x2 Contingency Table100,000 300 !Artificial 400 100 Artificial !Intelligence Intelligence
  • 39.
    2x2 Contingency Table100,000 99,700 300 99,600 99,400 200 !Artificial 400 300 100 Artificial !Intelligence Intelligence
  • 40.
    2x2 Contingency Table100,000 99,700 300 99,600 99,400.0 99,301.2 200.0 298.8 !Artificial 400 300.0 398.8 100.0 000.12 Artificial !Intelligence Intelligence
  • 41.
  • 42.
  • 43.
    Interpreting the Scores…G^2 and X^2 are asymptotically approximated by the chi-squared distribution… This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get …
  • 44.
  • 45.
    Interpreting the Scores…Values above a certain level of significance can be considered grounds for rejecting the null hypothesis H0: the words in the bigram are independent 3.841 is associated with 95% confidence that the null hypothesis should be rejected
  • 46.
    Measures of AssociationThere are numerous measures of association that can be used to identify bigram and co-occurrence features Many of these are supported in the Ngram Statistics Package (NSP) http://www.d.umn.edu/~tpederse/nsp.html
  • 47.
    Measures Supported inNSP Log-likelihood Ratio (ll) True Mutual Information (tmi) Pearson’s Chi-squared Test (x2) Pointwise Mutual Information (pmi) Phi coefficient (phi) T-test (tscore) Fisher’s Exact Test (leftFisher, rightFisher) Dice Coefficient (dice) Odds Ratio (odds)
  • 48.
    NSP Will exploreNSP during practical session Integrated into SenseClusters, may also be used in stand-alone mode Can be installed easily on a Linux/Unix system from CD or download from http://www.d.umn.edu/~tpederse/nsp.html I’m told it can also be installed on Windows (via cygwin or ActivePerl), but I have no personal experience of this…
  • 49.
    Summary Identify lexicalfeatures based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data Language independent Unigrams usually only selected by frequency Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association Bigrams and co-occurrences need not be consecutive Stop words should be eliminated Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful)
  • 50.
    Related Work Moore,2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) Dunning, 1993 ( Computational Linguistics ) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
  • 51.
    Context Representations Firstand Second Order Methods
  • 52.
    Once features selected…We will have a set of unigrams, bigrams, co-occurrences or target co-occurrences that we believe are somehow interesting and useful We also have any frequency and measure of association score that have been used in their selection Convert contexts to be clustered into a vector representation based on these features
  • 53.
    First Order RepresentationEach context is represented by a vector with M dimensions, each of which indicates whether or not a particular feature occurred in that context Value may be binary, a frequency count, or an association score Context by Feature representation
  • 54.
    Contexts C1: Therewas an island curse of black magic cast by that voodoo child. C2: Harold, a known voodoo child, was gifted in the arts of black magic. C3: Despite their military might, it was a serious error to attack. C4: Military might is no defense against a voodoo child or an island curse.
  • 55.
    Unigram Feature Set island 1000 black 700 curse 500 magic 400 child 200 (assume these are frequency counts obtained from some corpus…)
  • 56.
    First Order Vectorsof Unigrams 1 0 1 0 1 C4 0 0 0 0 0 C3 1 1 0 1 0 C2 1 1 1 1 1 C1 child magic curse black island
  • 57.
    Bigram Feature Setisland curse 189.2 black magic 123.5 voodoo child 120.0 military might 100.3 serious error 89.2 island child 73.2 voodoo might 69.4 military error 54.9 black child 43.2 serious curse 21.2 (assume these are log-likelihood scores based on frequency counts from some corpus)
  • 58.
    First Order Vectorsof Bigrams 1 0 1 1 0 C4 0 1 1 0 0 C3 1 0 0 0 1 C2 1 0 0 1 1 C1 voodoo child serious error military might island curse black magic
  • 59.
    First Order VectorsCan have binary values or weights associated with frequency, etc. May optionally be smoothed/reduced with Singular Value Decomposition More on that later… The contexts are ready for clustering… More on that later…
  • 60.
    Second Order RepresentationBuild word by word matrix from features Must be bigrams or co-occurrences (optionally) reduce dimensionality w/SVD Each row represents first order co-occurrences Represent a context by replacing each word with an entry in the word by word matrix with its associated vector Average word vectors found for the context Due to Schuetze (1998)
  • 61.
    Word by WordMatrix 120.0 0 69.4 0 0 voodoo 0 89.2 0 21.2 0 serious 0 54.9 100.3 0 0 military 73.2 0 0 189.2 0 island 43.2 0 0 0 123.5 black child error might curse magic
  • 62.
    Word by WordMatrix … can also be used to identify sets of related words In the case of bigrams, rows represent the first word in a bigram and columns represent the second word Matrix is asymmetric In the case of co-occurrences, rows and columns are equivalent Matrix is symmetric The vector (row) for each word represent a set of first order features for that word Each word in a context to be clustered for which a vector exists (in the word by word matrix) is replaced by that vector in that context
  • 63.
    There was an island curse of black magic cast by that voodoo child. 120.0 0 69.4 0 0 voodoo 73.2 0 0 189.2 0 island 43.2 0 0 0 123.5 black child error might curse magic
  • 64.
    Second Order RepresentationThere was an [curse, child] curse of [magic, child] magic cast by that [might, child] child [curse, child] + [magic, child] + [might, child]
  • 65.
    There was an island curse of black magic cast by that voodoo child. 78.8 0 24.4 63.1 41.2 C1 child error might curse magic
  • 66.
    First versus SecondOrder First Order represents a context by showing which features occurred in that context This is what feature vectors normally do… Second Order allows for additional information about a word to be incorporated into the representation Feature values based on information found outside of the immediate context
  • 67.
    Second Order Co-Occurrences“black” and “island” show similarity because both words have occurred with “child” “black” and “island” are second order co-occurrence with each other, since both occur with “child” but not with each other (i.e., “black island” is not observed)
  • 68.
    Second Order Co-occurrencesImagine a co-occurrence graph Word network First order co-occurrences are directly connected Second order co-occurrences are to each connected via one other word kocos.pl program in Ngram Statistics Package finds kth order co-occurrences
  • 69.
    Summary First orderrepresentations are intuitive, but… Can suffer from sparsity Contexts represented based on the features that occur in those contexts Second order representations are harder to visualize, but… Allow a word to be represented by the words it co-occurs with (i.e., the company it keeps) Allows a context to be represented by the words that occur with the words in the context Helps combat sparsity…
  • 70.
    Related Work Pedersenand Bruce 1997 (EMNLP) presented first order method of discrimination http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf Schütze 1998 ( Computational Linguistics ) introduced second order method http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf Purandare and Pedersen 2004 (CoNLL) compared first and second order methods http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf First order better if you have lots of data Second order better with smaller amounts of data
  • 71.
  • 72.
    Motivation First ordermatrices are very sparse Word by word Context by feature NLP data is noisy No stemming performed synonyms
  • 73.
    Many Methods Singular Value Decomposition (SVD) SVDPACKC http:// www.netlib.org/svdpack / Multi-Dimensional Scaling (MDS) Principal Components Analysis (PCA) Independent Components Analysis (ICA) Linear Discriminant Analysis (LDA) etc…
  • 74.
    Effect of SVDSVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space If “dog” and “collie” and “wolf” are dimensions/columns in a word co-occurrence matrix, after SVD they may be a single dimension that represents “canines”
  • 75.
    Effect of SVDThe dimensions of the matrix after SVD are principal components that represent the meaning of concepts Similar columns are grouped together SVD is a way of smoothing a very sparse matrix, so that there are very few zero valued cells after SVD
  • 76.
    How can SVDbe used? SVD on first order contexts will reduce a context by feature representation down to a smaller number of features Latent Semantic Analysis typically performs SVD on a word by context representation, where the contexts are reduced SVD used in creating second order context representations Reduce word by word matrix SVD could also be used on resultant second order context representations (although not supported)
  • 77.
    Word by WordMatrix 4 2 0 0 0 3 0 1 box 0 1 2 2 1 2 0 0 memory 0 0 0 1 0 0 2 0 organ 0 2 0 3 2 0 0 0 debt 0 1 0 3 1 0 0 2 linux 0 1 0 3 2 0 0 0 sales 3 0 2 2 0 3 0 0 lab 1 0 2 0 0 1 2 0 petri 0 1 0 0 2 0 0 1 disk 1 0 2 0 0 0 3 0 body 0 0 0 3 1 0 0 2 pc plasma graphics tissue data ibm cells blood apple
  • 78.
  • 79.
    U -.52 .39-.48 .02 .09 .41 -.09 .40 -.30 .08 .31 .43 -.26 -.39 -.6 .20 .00 -.00 -.00 -.02 -.01 .00 -.02 -.00 -.07 -.3 .14 -.49 -.07 .30 .25 .56 -.01 .08 .05 -.01 .24 -.08 .11 .46 .08 .03 -.04 .72 .09 -.31 -.01 .37 -.07 .01 -.21 -.31 -.34 -.45 -.68 .29 .00 .05 .83 .17 -.02 .25 -.45 .08 .03 .20 -.22 .31 -.60 .39 .13 .35 -.01 -.04 -.44 .08 .44 .59 -.49 .05 -.02 .63 .02 -.09 .52 -.2 .09 .35
  • 80.
    D 0.00 0.000.00 0.66 1.26 2.30 2.52 3.25 3.99 6.36 9.19
  • 81.
    V -.20 .22-.07 -.10 -.87 -.07 -.06 .17 .19 -.26 .04 .03 .17 -.32 .02 .13 -.26 -.17 .06 -.04 .86 .50 -.58 .12 .09 -.18 -.27 -.18 -.12 -.47 .11 -.03 .12 .31 -.32 -.04 .64 -.45 -.14 -.23 .28 .07 -.23 -.62 -.59 .05 .02 -.12 .15 .11 .25 -.71 -.31 -.04 .08 .29 -.05 .05 .20 -.51 .09 -.03 .12 .31 -.01 .02 -.45 -.32 .50 .27 .49 -.02 .08 .21 -.06 .08 -.09 .52 -.45 -.01 .63 .03 -.12 -.31 .71 -.13 .39 -.12 .12 .15 .37 .07 .58 -.41 .15 .17 -.30 -.32 -.27 -.39 .11 .44 .25 .03 -.02 .26 .23 .39 .57 -.37 .04 .03 -.12 -.31 -.05 -.05 .04 .28 -.04 .08 .21
  • 82.
    Word by WordMatrix After SVD 1.1 1.0 .98 1.7 .86 .72 .85 .77 memory .00 .00 .17 1.2 .77 .00 .84 .00 organ .00 1.5 .00 3.2 2.1 .00 .00 1.2 debt .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .41 .85 .35 2.2 1.3 .39 .15 .73 sales 2.3 .18 2.5 1.7 .35 2.0 1.7 .21 lab 1.4 .00 1.5 .49 .00 1.2 1.1 .00 germ .00 .91 .00 2.1 1.3 .01 .00 .76 disk 1.5 .00 1.6 .33 .00 1.3 1.2 .00 body .09 .86 .01 2.0 1.3 .11 .00 .73 pc plasma graphics tissue data ibm cells blood apple
  • 83.
    Second Order RepresentationThese two contexts share no words in common, yet they are similar! disk and linux both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory” The two contexts are similar because they share many second order co-occurrences I got a new disk today! What do you think of linux? 1.0 .72 memory .00 .00 organ .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .00 .91 .00 2.1 1.3 .01 .00 .76 disk Plasma graphics tissue data ibm cells blood apple
  • 84.
  • 85.
    Many many methods…Cluto supports a wide range of different clustering methods Agglomerative Average, single, complete link… Partitional K-means Hybrid Repeated bisections SenseClusters integrates with Cluto http://www-users.cs.umn.edu/~karypis/cluto/
  • 86.
    General Methodology Representcontexts to be clustered in first or second order vectors Cluster the vectors directly, or convert to similarity matrix and then cluster vcluster scluster
  • 87.
    Agglomerative Clustering Createa similarity matrix of instances to be discriminated Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances Typically a first order representation, where similarity is based on the features observed in the pair of instances
  • 88.
    Measuring Similarity IntegerValues Matching Coefficient Jaccard Coefficient Dice Coefficient Real Values Cosine
  • 89.
    Agglomerative Clustering ApplyAgglomerative Clustering algorithm to similarity matrix To start, each instance is its own cluster Form a cluster from the most similar pair of instances Repeat until the desired number of clusters is obtained Advantages : high quality clustering Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons
  • 90.
    AverageLink Clustering 1 2 4 S3 1 2 4 S3 0 2 S4 0 3 S2 2 3 S1 S4 S2 S1 0 S4 0 S2 S1S3 S4 S2 S1S3 S4 S1S3S2 S4 S1S3S2
  • 91.
    Partitional Methods Selectsome number of contexts in feature space to act as centroids Assign each context to nearest centroid, forming cluster After all contexts assigned, recompute centroids Repeat until stable clusters found Centroids don’t shift from iteration to iteration
  • 92.
    Partitional Methods Advances: fast Disadvantages : very dependent on the initial placement of centroids
  • 93.
  • 94.
    Results of ClusteringEach cluster consists of some number of contexts Each context is a short unit of text Apply measures of association to the contents of each cluster to determine N most significant bigrams Use those bigrams as a label for the cluster
  • 95.
    Label Types TheN most significant bigrams for each cluster will act as a descriptive label The M most significant bigrams that are unique to each cluster will act as a discriminating label
  • 96.
    Evaluation Techniques Comparisonto gold standard data
  • 97.
    Evaluation If Sensetagged text is available, can be used for evaluation But don’t use sense tags for clustering or feature selection! Assume that sense tags represent “true” clusters, and compare these to discovered clusters Find mapping of clusters to senses that attains maximum accuracy
  • 98.
    Evaluation Pseudo wordsare especially useful, since it is hard to find data that is discriminated Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. http://www.d.umn.edu/~tpederse/tools.html Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier
  • 99.
    Evaluation Pseudo wordsare especially useful, since it is hard to find data that is discriminated Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. http://www.d.umn.edu/~kulka020/kanaghaName.html
  • 100.
    Baseline Algorithm BaselineAlgorithm – group all instances into one cluster, this will reach “accuracy” equal to majority classifier What if the clustering said everything should be in the same cluster?
  • 101.
    Baseline Performance (0+0+55)/170 = .32 if C3 is S1 (0+0+80)/170 = .47 if C3 is S3 170 55 35 80 Totals 170 55 35 80 C3 0 0 0 0 C2 0 0 0 0 C1 Totals S3 S2 S1 170 80 35 55 Totals 170 80 35 55 C3 0 0 0 0 C2 0 0 0 0 C1 Totals S1 S2 S3
  • 102.
    Evaluation Suppose thatC1 is labeled S1, C2 as S2, and C3 as S3 Accuracy = (10 + 0 + 10) / 170 = 12% Diagonal shows how many members of the cluster actually belong to the sense given on the column Can the “columns” be rearranged to improve the overall accuracy? Optimally assign clusters to senses 170 55 35 80 Totals 65 10 5 50 C3 60 40 0 20 C2 45 5 30 10 C1 Totals S3 S2 S1
  • 103.
    Evaluation The assignmentof C1 to S2, C2 to S3, and C3 to S1 results in 120/170 = 71% Find the ordering of the columns in the matrix that maximizes the sum of the diagonal. This is an instance of the Assignment Problem from Operations Research, or finding the Maximal Matching of a Bipartite Graph from Graph Theory. 170 80 55 35 Totals 65 50 10 5 C3 60 20 40 0 C2 45 10 5 30 C1 Totals S1 S3 S2
  • 104.
    Analysis Unsupervised methodsmay not discover clusters equivalent to the classes learned in supervised learning Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. Alternatives? Humans could look at the members of each cluster and determine the nature of the relationship or meaning that they all share Use the contents of the cluster to generate a descriptive label that could be inspected by a human
  • 105.
  • 106.
    Experimental Data Availableon Web Site http://senseclusters.sourceforge.net Available on CD Data/SenseClusters-Data SenseClusters requires data to be in the Senseval-2 lexical sample format Plenty of such data available on CD and from web site
  • 107.
    Creating Experimental DataNameConflate program Creates name conflated data from English GigaWord corpus Text2Headless program Convert plain text into headless contexts http:// www.d.umn.edu/~tpederse/tools.html
  • 108.
    Name Conflation DataSmaller Data Set (also on Web as SC-Web…) Country - Noun Name - Name Noun - Noun Larger Data Sets (also on Web as Split-Smaller…) Adidas - Puma Emile Lahoud – Askar Akayev CICLING data (CD only) David Beckham – Ronaldo Microsoft – IBM ACL 2005 demo data (CD only) Name - Name
  • 109.
    Clustering Contexts ACL2005 Demo (also on Web as Email…) Various partitions of 20 news groups data sets Spanish Data (web only) News articles each of which mention abbreviations PP or PSOE
  • 110.
  • 111.
  • 112.
    Headed Clustering NameDiscrimination Tom Hanks Russell Crowe
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
    Headless Contexts Email/ 20 newsgroups data Spanish Text
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148.
  • 149.
  • 150.
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
  • 156.
  • 157.
    If you afterall these matrices you crave knowledge based resources… Read on…
  • 158.
    WordNet-Similarity Not languageindependent Based on English WordNet But, can be combined with distributional methods to good effect McCarthy, et. al. ACL-2004 Perl module http:// search.cpan.org /dist/WordNet-Similarity Web interface http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
  • 159.
    Many thanks! Satanjeev“Bano” Banerjee (M.S., 2002) Inventor of Adapted Lesk Algorithm (IJCAI-2003), which is the earliest origin and motivation for WordNet-Similarity… Now PhD student at LTI/CMU… Siddharth Patwardhan (M.S., 2003) Founding developer of WordNet-Similarity (2001-2003) Now PhD student at University of Utah http:// www.cs.utah.edu/~sidd / Jason Michelizzi (M.S., 2005) Enhanced WordNet-Similarity in many ways and applied it to all words sense disambiguation (2003-2005) http://www.d.umn.edu/~mich0212 NSF for supporting Bano, and University of Minnesota for supporting Bano, Sid and Jason via various internal sources
  • 160.
    Vector measure Builda word by word matrix from WordNet Gloss Corpus 1.4 million words Treat glosses as contexts, and use second order representation where words are replaced with vectors from matrix Average together all vectors to represent concept/definition High correlation with human relatedness judgements
  • 161.
    Many other measuresPath Based Path Leacock & Chodorow Wu and Palmer Information Content Based Resnik Lin Jiang & Conrath Relatedness Hirst & St-Onge Adapted Lesk Vector
  • 162.
  • 163.
  • 164.
    Thank you! Questionsare welcome at any time. Feel free to contact me in person or via email ( [email_address] ) at any time! All of our software is free and open source, you are welcome to download, modify, redistribute, etc. http://www.d.umn.edu/~tpederse/code.html