Icon 2007 Pedersen
Upcoming SlideShare
Loading in...5
×
 

Icon 2007 Pedersen

on

  • 778 views

 

Statistics

Views

Total Views
778
Views on SlideShare
778
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Icon 2007 Pedersen Icon 2007 Pedersen Presentation Transcript

  • The Semantic Quilt: Contexts, Co-occurrences, Kernels, and Ontologies Ted Pedersen University of Minnesota, Duluth http://www.umn.edu/home/tpederse
  • Create by stitching together
  • Sew together different materials
  • Semantics in NLP
    • Believed to be useful for many applications
      • Machine Translation
      • Document or Story Understanding
      • Text Generation
      • Web Search
    • Can come from many sources
    • Not well integrated
    • Not well defined?
  • What do we mean by semantics ? …it depends on our resources…
    • Ontologies – semantics of a word are provided by relationship to other concepts
      • similar words are related to similar concepts
    • Dictionary – semantics of a word are provided by a definition
      • similar words have similar definitions
    • Contexts – a word is defined by the surrounding context
      • similar words occur in similar contexts
    • Co-occurrences – a word is defined by the company it keeps
      • words that occur with the same words are similar
  • What level of granularity?
    • words
    • terms / collocations
    • phrases
    • sentences
    • paragraphs
    • documents
    • books
  • The Terrible Tension : ambiguity versus granularity
    • Words are potentially very ambiguous
      • but we can list them (sort of)
      • we can define their meanings (sort of)
      • not ambiguous to human reader, but hard for a computer to know which meaning is intended
    • Terms / collocations are less ambiguous
      • harder to enumerate in general because there are so many, but can be done in a domain (e.g., medicine)
    • Phrases can still be ambiguous, but usually only when that is the intent of the speaker / writer
  • The Current State of Affairs
    • Most of our resources and methods focus on word or term semantics
      • makes it possible to build resources (manually or automatically) that have reasonable coverage
      • but, techniques are very resource dependent
      • but, resources introduce language dependencies
      • but, introduces a lot of ambiguity
      • but, not clear how to bring together resources
    • Similarity is an important organizing principle
      • but, there are lots of ways to be similar
  • Things we can do now…
    • Identify associated words
      • fine wine
      • baseball bat
    • Identify similar contexts
      • I bought some food at the store
      • I purchased something to eat at the market
    • Identify similar (or related) concepts
      • frog : amphibian
      • Duluth : snow
    • Assign meanings to words
      • I went to the bank /[financial-inst.] to deposit my check
  • Things we want to do…
    • Integrate different resources and methods
    • Solve bigger problems
      • much of what we do now is a means to an unclear end
    • Language Independence
    • Broad coverage
    • Less reliance on manually built resources
      • ontologies, dictionaries, training data…
  • Semantic Patches to Sew Together
    • Co-Occurrences
      • Ngram Statistics Package : measures association between words, can identify collocations or terms
    • Contexts
      • SenseClusters : measures similarity between written texts (i.e., contexts)
    • Ontologies
      • WordNet-Similarity : measures similarity between concepts found in WordNet
    • Disclosure : All of these are projects at the University of Minnesota, Duluth
  • Co-occurrences Ngram Statistics Package http://ngram.sourceforge.net
  • Things we can do now…
    • Identify associated words
      • fine wine
      • baseball bat
    • Identify similar contexts
      • I bought some food at the store
      • I purchased something to eat at the market
    • Identify similar (or related) concepts
      • frog : amphibian
      • Duluth : snow
    • Assign meanings to words
      • I went to the bank/[financial-inst.] to deposit my check
  • Co-occurrences and semantics?
    • single words are very ambiguous
      • bat
      • line
    • pairs of words disambiguate each other
      • baseball bat
      • vampire … Transylvania
      • product line
      • speech …. line
  • Why pairs of words?
    • Zipf's Law
      • most words are very rare
      • most bigrams are even more rare
      • most ngrams are even more more rare
    • Pairs of words are much less ambiguous than individual words, and yet can be found with reasonable frequency even in relatively small corpora
  • Bigrams
    • Window Size of 2
      • baseball bat, fine wine, apple orchard, bill clinton
    • Window Size of 3
      • house of representatives, bottle of wine,
    • Window Size of 4
      • president of the republic, whispering in the wind
    • Selected using a small window size (2-4 words)
    • Objective is to capture a regular or localized pattern between two words (collocation?)
  • “ occur together more often than expected by chance…”
    • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix
    • Expected values are calculated, based on the model of independence and observed values
      • How often would you expect these words to occur together, if they only occurred together by chance?
      • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance.
  • 2x2 Contingency Table 100,000 300 not Artificial 400 100 Artificial not Intelligence Intelligence
  • 2x2 Contingency Table 100,000 99,700 300 99,600 99,400 200 not Artificial 400 300 100 Artificial not Intelligence Intelligence
  • 2x2 Contingency Table 100,000 99,700 300 99,600 99,400.0 99,301.2 200.0 298.8 not Artificial 400 300.0 398.8 100.0 000.12 Artificial not Intelligence Intelligence
  • Measures of Association
  • Measures of Association
  • Interpreting the Scores…
    • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis
      • H0: the words in the bigram are independent
      • 3.84 is associated with 95% confidence that the null hypothesis should be rejected
  • Measures of Association all supported in NSP http://ngram.sourceforge.net
    • Log-likelihood Ratio
    • Mutual Information
    • Pointwise Mutual Information
    • Pearson’s Chi-squared Test
    • Phi coefficient
    • Fisher’s Exact Test
    • T-test
    • Dice Coefficient
    • Odds Ratio
  • What do we get at the end?
    • A list of bigrams or co-occurrences that are significant or interesting (meaningful?)
      • automatic
      • language independent
    • These can be viewed as fundamental building blocks for systems that do semantic processing
      • relatively unambiguous
      • often have high information content
      • can serve as a fingerprint for a document or book
    • What can we do with these though?????
  • Contexts SenseClusters http://senseclusters.sourceforge.net
  • Things we can do now…
    • Identify associated words
      • fine wine
      • baseball bat
    • Identify similar contexts
      • I bought some food at the store
      • I purchased something to eat at the market
    • Identify similar (or related) concepts
      • frog : amphibian
      • Duluth : snow
    • Assign meanings to words
      • I went to the bank/[financial-inst.] to deposit my check
  • Similar Contexts may have the same meaning…
      • Context 1: He drives his car fast
      • Context 2: Jim speeds in his auto
      • Car -> motor, garage, gasoline, insurance
      • Auto -> motor, insurance, gasoline, accident
      • Car and Auto share many co-occurrences…
  • Clustering Similar Contexts
    • A context is a short unit of text
      • often a phrase to a paragraph in length, although it can be longer
    • Input: N contexts
    • Output: k clusters
      • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters
  • Contexts (input)
    • I can hear the ocean in that shell .
    • My operating system shell is bash.
    • The shells on the shore are lovely.
    • The shell command line is flexible.
    • An oyster shell is very hard and black.
  • Contexts (output)
    • Cluster 1:
      • My operating system shell is bash.
      • The shell command line is flexible.
    • Cluster 2:
      • The shells on the shore are lovely.
      • An oyster shell is very hard and black.
      • I can hear the ocean in that shell.
  • General Methodology
    • Represent contexts using second order co-occurrences
    • Reduce dimensionality of vectors
      • Singular value decomposition
    • Cluster the context vectors
      • Find the number of clusters
      • Label the clusters
    • Evaluate and/or use the contexts!
  • Second Order Features
    • Second order features encode something ‘extra’ about a feature that occurs in a context, something not available in the context itself
      • Native SenseClusters : each feature is represented by a vector of the words with which it occurs
      • Latent Semantic Analysis : each feature is represented by a vector of the contexts in which it occurs
  • Second Order Context Representation
    • Bigrams used to create a word matrix
      • Cell values = log-likelihood of word pair
    • Rows are first order co-occurrence vector for a word
    • Represent context by averaging vectors of words in that context
      • Context includes the Cxt positions around the target, where Cxt is typically 5 or 20.
  • 2 nd Order Context Vectors
    • He won an Oscar, but Tom Hanks is still a nice guy.
    0 6272.85 2.9133 62.6084 20.032 1176.84 51.021 O2 context 0 18818.55 0 0 0 205.5469 134.5102 guy 0 0 0 136.0441 29.576 0 0 Oscar 0 0 8.7399 51.7812 30.520 3324.98 18.5533 won needle family war movie actor football baseball
  • Second Order Co-Occurrences
    • These two contexts share no words in common, yet they are similar! disk and linux both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory”
    • The two contexts are similar because they share many second order co-occurrences
    • I got a new disk today!
    • What do you think of linux?
    1.0 .72 memory .00 .00 organ .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .00 .91 .00 2.1 1.3 .01 .00 .76 disk Plasma graphics tissue data ibm cells blood apple
  • After context representation…
    • Second order vector is an average of word vectors that make up context, captures indirect relationships
      • Reduced by SVD to principal components
    • Now, cluster the vectors!
      • We use the method of repeated bisections
      • CLUTO
  • What do we get at the end?
    • contexts organized into some number of clusters based on the similarity of their co-occurrences
    • contexts which share words that tend to co-occur with the same other words are clustered together
      • 2 nd order co-occurences
  • Finding Similar Contexts
    • Find phrases that say the same thing using different words
      • I went to the store
      • Ted drove to Wal-Mart
    • Find words that have the same meaning in different contexts
      • The line is moving pretty fast
      • I stood in line for 12 hours
    • Find different words that have the same meaning in different contexts
      • The line is moving pretty fast
      • I stood in the queue for 12 hours
  • Semantic Similarity WordNet-Similarity http://wn-similarity.sourceforge.net
  • Things we can do now…
    • Identify associated words
      • fine wine
      • baseball bat
    • Identify similar contexts
      • I bought some food at the store
      • I purchased something to eat at the market
    • Identify similar (or related) concepts
      • frog : amphibian
      • Duluth : snow
    • Assign meanings to words
      • I went to the bank/[financial-inst.] to deposit my check
  • Similarity and Relatedness
    • Two concepts are similar if they are connected by is-a relationships.
      • A frog is-a-kind-of amphibian
      • An illness is-a heath_condition
    • Two concepts can be related many ways…
      • A human has-a-part liver
      • Duluth receives-a-lot-of snow
    • … similarity is one way to be related
  • Similarity as Organizing Principle
    • Measure word association using knowledge lean methods that are based on co-occurrence information from large corpora
    • Measure contextual similarity using knowledge lean methods that are based on co-occurrence information from large corpora
    • Measure conceptual similarity using a structured repository of knowledge
      • Lexical database WordNet
  • Why measure conceptual similarity?
    • A word will take the sense that is most related to the surrounding context
      • I love Java , especially the beaches and the weather.
      • I love Java , especially the support for concurrent programming.
      • I love java , especially first thing in the morning with a bagel.
  • Word Sense Disambiguation
    • … can be performed by finding the sense of a word most related to its neighbors
    • Here, we define similarity and relatedness with respect to WordNet
      • WordNet::Similarity
      • http://wn-similarity.sourceforge.net
    • WordNet::SenseRelate
      • AllWords – assign a sense to every content word
      • TargetWord – assign a sense to a given word
      • http://senserelate.sourceforge.net
  • WordNet-Similarity http://wn-similarity.sourceforge.net
    • Path based measures
      • Shortest path (path)
      • Wu & Palmer (wup)
      • Leacock & Chodorow (lch)
      • Hirst & St-Onge (hso)
    • Information content measures
      • Resnik (res)
      • Jiang & Conrath (jcn)
      • Lin (lin)
    • Gloss based measures
      • Banerjee and Pedersen (lesk)
      • Patwardhan and Pedersen (vector, vector_pairs)
  • Path Finding
    • Find shortest is-a path between two concepts?
      • Rada, et. al. (1989)
      • Scaled by depth of hierarchy
        • Leacock & Chodorow (1998)
      • Depth of subsuming concept scaled by sum of the depths of individual concepts
        • Wu and Palmer (1994)
  • watercraft instrumentality object artifact conveyance vehicle motor-vehicle car boat ark article ware table-ware cutlery fork from Jiang and Conrath [1997]
  • Information Content
    • Measure of specificity in is-a hierarchy (Resnik, 1995)
      • -log (probability of concept)
      • High information content values mean very specific concepts (like pitch-fork and basketball shoe)
    • Count how often a concept occurs in a corpus
      • Increment the count associated with that concept, and propagate the count up!
      • If based on word forms, increment all concepts associated with that form
  • Observed “car”... motor vehicle (327 +1) *root* (32783 + 1) minicab (6) cab (23) car (73 +1) bus (17) stock car (12)
  • Observed “stock car”... motor vehicle (328+1) *root* (32784+1) minicab (6) cab (23) car (74+1) bus (17) stock car (12+1)
  • After Counting Concepts... motor vehicle (329) IC = 1.9 *root* (32785) minicab (6) cab (23) car (75) bus (17) IC = 3.5 stock car (13) IC = 3.1
  • Similarity and Information Content
    • Resnik (1995) use information content of least common subsumer to express similarity between two concepts
    • Lin (1998) scale information content of least common subsumer with sum of information content of two concepts
    • Jiang & Conrath (1997) find difference between least common subsumer’s information content and the sum of the two individual concepts
  • Why doesn’t this solve problem?
    • Concepts must be organized in a hierarchy, and connected in that hierarchy
      • Limited to comparing nouns with nouns, or maybe verbs with verbs
      • Limited to similarity measures (is-a)
    • What about mixed parts of speech?
      • Murder (noun) and horrible (adjective)
      • Tobacco (noun) and drinking (verb)
  • Using Dictionary Glosses to Measure Relatedness
    • Lesk (1985) Algorithm – measure relatedness of two concepts by counting the number of shared words in their definitions
      • Cold - a mild viral infection involving the nose and respiratory passages (but not the lungs)
      • Flu - an acute febrile highly contagious viral disease
    • Adapted Lesk (Banerjee & Pedersen, 2003) – expand glosses to include those concepts directly related
      • Cold - a common cold affecting the nasal passages and resulting in congestion and sneezing and headache; mild viral infection involving the nose and respiratory passages (but not the lungs); a disease affecting the respiratory system
      • Flu - an acute and highly contagious respiratory disease of swine caused by the orthomyxovirus thought to be the same virus that caused the 1918 influenza pandemic; an acute febrile highly contagious viral disease; a disease that can be communicated from one person to another
  • Context/Gloss Vectors
    • Leskian approaches require exact matches in glosses
      • Glosses are short, use related but not identical words
    • Solution? Expand glosses by replacing each content word with a co-occurrence vector derived from corpora
      • Rows are words in glosses, columns are the co-occurring words in a corpus, cell values are their log-likelihood ratios
    • Average the word vectors to create a single vector that represents the gloss/sense (Patwardhan & Pedersen, 2003)
      • 2 nd order co-occurrences
    • Measure relatedness using cosine rather than exact match!
  • Gloss/Context Vectors
  • Word Sense Disambiguation
    • … can be performed by finding the sense of a word most related to its neighbors
    • Here, we define similarity and relatedness with respect to WordNet-Similarity
    • WordNet-SenseRelate
      • AllWords – assign a sense to every content word
      • TargetWord – assign a sense to a given word
      • http://senserelate.sourceforge.net
  • WordNet-SenseRelate http://senserelate.sourceforge.net
    • For each sense of a target word in context
      • For each content word in the context
        • For each sense of that content word
          • Measure similarity/relatedness between sense of target word and sense of content word with WordNet::Similarity
          • Keep running sum for score of each sense of target
    • Pick sense of target word with highest score with words in context
  • WSD Experiment
    • Senseval-2 data consists of 73 nouns, verbs, and adjectives, approximately 8,600 “training” examples and 4,300 “test” examples.
      • Best supervised system 64%
      • SenseRelate 53% (lesk, vector)
      • Most frequent sense 48%
    • Window of context is defined by position, includes 2 content words to both the left and right which are measured against the word being disambiguated.
      • Positional proximity is not always associated with semantic similarity.
  • Human Relatedness Experiment
    • Miller and Charles (1991) created 30 pairs of nouns that were scored on a relatedness scale by over 50 human subjects
    • Vector measure correlates at over 80% with human relatedness judgements
    • Next closest measure is lesk (at 70%)
    • All other measures at less than 65%
  • Coverage…
    • WordNet
      • Nouns – 80,000 concepts
      • Verbs – 13,000 concepts
      • Adjectives – 18,000 concepts
      • Adverbs – 4,000 concepts
    • Words not found in WordNet can’t be disambiguated by SenseRelate
    • language and resource dependent…
  • Supervised WSD http://www.d.umn.edu/~tpederse/supervised.html
  • Things we can do now…
    • Identify associated words
      • fine wine
      • baseball bat
    • Identify similar contexts
      • I bought some food at the store
      • I purchased something to eat at the market
    • Identify similar (or related) concepts
      • frog : amphibian
      • Duluth : snow
    • Assign meanings to words
      • I went to the bank/[financial-inst.] to deposit my check
  • Machine Learning Approach
    • Annotate text with sense tags
      • must select sense inventory
    • Find interesting features
      • bigrams and co-occurrences quite effective
    • Learn a model
    • Apply model to untagged data
    • Works very well…given sufficient quantities of training data and sufficient coverage of your sense inventory
  • Clever Ways to Get Training Data
    • Parallel Text
      • Senseval-3 task, use Hindi translations of English words as sense tags
    • Mining the Web for contexts that include an unambiguous collocation
      • line ambiguous, product line is not
  • Where does this leave us?
    • Ngram Statistics Package
      • Finding Co-occurrences and bigrams that carry some semantic weight
    • SenseClusters
      • Clustering Similar Contexts
    • WordNet-Similarity
      • Measuring Similarity between Concepts
    • SenseRelate Word Sense Disambiguation
      • knowledge/resource based method
    • Supervised WSD
      • Building models that assign a sense to a given word
  • Integration that already exists…
    • NSP feeds SenseClusters
    • NSP feeds Supervised WSD
    • WordNet-Similarity feeds SenseRelate
    • Could do a lot more…one example, how to give supervised WSD information beyond what it finds in annotated text, perhaps reducing the amount of such text needed
  • Kernels are similarity matrices
    • NSP produces word by word similarity matrices, for use by SenseClusters
    • SenseClusters produces context by context similarity matrices based on co-occurrences
    • WordNet-Similarity produces concept by concept similarity matrices
    • SenseRelate produces context by context similarity matrices based on concept similarity
    • All of these could be used as kernels for Supervised WSD
  • Conclusion
    • Time to integrate what we have at the word and term level
      • look for ways to stitch semantic patches together
    • This will increase our coverage and decrease language dependence
      • make the quilt bigger and sturdier
    • We will then be able to look at a broader range of languages and semantic problems
      • calm problems with the warmth of your lovely quilt…
  • Many Thanks …
    • SenseClusters
      • Amruta Purandare
        • MS 2004, now Pitt PhD
      • Anagha Kulkarni
        • MS 2006, now CMU PhD
      • Mahesh Joshi
        • MS 2006, now CMU MS
    • WordNet Similarity
      • Siddharth Patwardhan
        • MS 2003, now Utah PhD
      • Jason Michelizzi
        • MS 2005, now US Navy
    • Ngram Statistics Package
      • Satanjeev Banerjee
        • MS 2002, now CMU PhD
      • Saiyam Kohli
        • MS 2006, now Beckman-Coulter
      • Bridget Thomson-McInnes
        • MS 2004, now Minnesota PhD
    • Supervised WSD
      • Saif Mohammad
        • MS 2003, now Toronto PhD
      • Mahesh Joshi
      • Amruta Purandare
    • SenseRelate
      • Satanjeev Banerjee
      • Siddharth Patwardhan
      • Jason Michelizzi
  • URLs
    • Ngram Statistics Package
      • http://ngram.sourceforge.net
    • SenseClusters
      • http://senseclusters.sourceforge.net
      • IJCAI Tutorial on Jan 6 (afternoon)
    • WordNet-Similarity
      • http://wn-similarity.sourceforge.net
    • SenseRelate WSD
      • http://senserelate.sourceforge.net
    • Supervised WSD
      • http://www.d.umn.edu/~tpederse/supervised.html