Mehran Sahami
        Timothy D. Heilman
                Google Inc.

          Presented by Beibei Yang

                   With credits to:
Mehran Sahami, Stanford University,
                               and
         Ellen Spertus, Google Inc.
Mehran Sahami
 Associate Professor, Stanford Univ.,
 2007—
 Google Inc., 2002-2007



Timothy D. Heilman
 Sr.
 Sr Software Engineer, Google Inc
             Engineer         Inc.




  Presented By: Beibei Yang   2/19/2009   2
Presented By: Beibei Yang   2/19/2009   3
Semantic Web
 It’s all about understanding!
Semantic similarity
 A concept whereby a set of d
             h b             f documents or terms
 within term lists are assigned a metric based on
 the likeness of their meaning / semantic content.
Semantic Relatedness
 Publicly available means for approximating the
 relative meaning o wo ds docu e ts.
  elat ve ea g of words/documents.
 Have been used for essay-grading by the
 Educational Testing Service, search engine
 technology, predicting which links people are
 likely to click on, etc.
            Presented By: Beibei Yang   2/19/2009   4
What are they?
Example: Amazon




         Presented By: Beibei Yang   2/19/2009   5
“What to do when your TiVo thinks you’re
 What                              you re
gay”, Wall Street Journal, Nov. 26, 2002




         http://tinyurl.com/2qyepg
           Presented By: Beibei Yang   2/19/2009   6
“What to do when your TiVo thinks you’re
 What                              you re
gay”, Wall Street Journal, Nov. 26, 2002




         http://tinyurl.com/2qyepg
           Presented By: Beibei Yang   2/19/2009   7
“What to do when your TiVo thinks you’re
 What                              you re
gay”, Wall Street Journal, Nov. 26, 2002




         http://tinyurl.com/2qyepg
           Presented By: Beibei Yang   2/19/2009   8
Wal Mart
Wal-Mart DVD recommendations




        http://tinyurl.com/2gp2hm
          Presented By: Beibei Yang   2/19/2009   9
Wal Mart
Wal-Mart DVD recommendations




        http://tinyurl.com/2gp2hm
          Presented By: Beibei Yang   2/19/2009   10
Wal Mart
Wal-Mart DVD recommendations




        http://tinyurl.com/2gp2hm
          Presented By: Beibei Yang   2/19/2009   11
It s
It’s the degree to which text passages have
the same meaning.
Quite often we want to find how similar two
short text snippets are:
 Search engine queries
 Course d
 C       description
               i ti
 Policies of two insurance company
 You name it!




           Presented By: Beibei Yang   2/19/2009   12
Presented By: Beibei Yang   2/19/2009   13
The simplest way to calculate similarity of two
words is to find the minimum length of path
connecting these two. (Has its limitations.)
                                                  For example:
                                                  Similarity(boy,girl) = 4
                                                  Similarity(boy,teacher) = 6




Fig 1: An ISA hierarchical semantic knowledge base
                      Presented By: Beibei Yang   2/19/2009           14
Search Engine
 Sahami and Heilman's web-based kernel function.
 Bollegala, Matsuo, and Ishizuka's algorithm using page
 counts and text snippets.
 Iosif and Potamianos's two metric based approach
           Potamianos s two-metric         approach.
 Liu and Birnbaum's approach using Google Directory.
WordNet
 Varelas et al 's ontology mapping approach
            al. s                  approach.
 Yang and Powers's two-variant based approach: bidirectional
 depth-limit search (BDLS) and unidirectional breadth-first
 search (UBFS)
Text Corpus
 Islam and Inkpen's modified LCS (Longest Common
 Subsequence) string-matching algorithm.
Others
 Li et al.'s approach using multiple information sources.
              Presented By: Beibei Yang   2/19/2009   15
Chris Buckley, 1994
 Buckley, C., Salton, G., Allan, J., and Singhal, A.
 Automatic query expansion using smart: Trec 3. In TREC
 (1994), pp. 0-.
The definition and emphasis changed along the way
                                               way.
The process of reformulating a seed query to improve
retrieval performance in information retrieval
operations.
 p
Involves:
 Finding synonyms of words, and searching for the
 synonyms
 Finding all the various morphological forms of words by
 stemming each word in the search query
 Fixing spelling errors and automatically searching for the
 corrected form or suggesting it in the results
 Re-weighting the terms in the original query

              Presented By: Beibei Yang   2/19/2009   16
Presented By: Beibei Yang   2/19/2009   17
Let x represent a short text snippet, we
calculate the query expansion of x, QE(x) in this
way:
1.   Issue x as a query to a search engine S.
2.   Let R(x) be the set of (at most) n retrieved
     documents d1, d2, … , dn
3.   Compute th TFIDF t
     C      t the         term vector vi f each
                                   t     for    h
     document di R(x)
4.   Truncate each vector vi to include its m highest
     weighted terms
5.   Let C(x) be the centroid of the L2 normalized
     vectors vi:


6.   Let QE(x) be the L2 normalization of the centroid
     C(x):


                Presented By: Beibei Yang   2/19/2009   18
By G Salton and C Buckley 1988
   G.             C. Buckley,
Weight wi,j associated with with term ti in
document dj is defined to be:



tfi,j is the frequency of ti in dj
N is the total number of documents in the
corpus
dfi is the total number of documents that
contain ti.
      t i

           Presented By: Beibei Yang   2/19/2009   19
Define the semantic kernel function K as the
inner product of the query expansions for
two text snippets.
Given two short text snippets x and y, we
define the semantic similarity kernel
between them as:
            K(x, y) = QE(x)·QE(y)




           Presented By: Beibei Yang   2/19/2009   20
Acronyms:




            Presented By: Beibei Yang   2/19/2009   21
Individuals and their positions:




           Presented By: Beibei Yang   2/19/2009   22
Multi faceted
Multi-faceted terms:




           Presented By: Beibei Yang   2/19/2009   23
Search engine: this approach could be used
          g           pp
to generate the related query suggestions in
a large-scale system.
Question-answering
Question answering system: the question
could be matched against a list of candidate
answers to determine which is the most
similar semantically.
 i il          i ll
Since this kernel is not limited to use on the
web,
web it can also be computed using query
expansions generated over domain-specific
corpora in order to better capture contextual
semantics in particular domains
                         domains.

           Presented By: Beibei Yang   2/19/2009   24
Presented By: Beibei Yang   2/19/2009   25
Berners-Lee, T., Hendler, J., and Lassila, O. The Semantic Web. Scientific American
284, 5 (
    , (2001), 34-43.
              ),
Sahami, M. and Heilman, T. D. 2006. A web-based kernel function for measuring the
similarity of short text snippets. In Proceedings of the 15th international Conference
on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM, New
York, NY.
Buckley, C., Salton, G., Allan, J.,
Buckley C Salton G Allan J and Singhal A Automatic query expansion using
                                         Singhal, A.
smart: Trec 3. In TREC (1994), pp. 0-.
Abhishek, V., and Hosanagar, K. Keyword generation for search engine advertising
using semantic similarity between terms. In ICEC '07: Proceedings of the ninth
international conference on electronic commerce (New York, NY, USA, 2007), ACM
pp
pp. 89-94.
Bollegala, D., Matsuo, Y., and Ishizuka, M. Measuring semantic similarity between
words using web search engines. In WWW '07: Proceedings of the 16th international
conference on World Wide Web (New York, NY, USA, 2007), ACM, pp. 757-766.
Iosif, E., and Potamianos, A. Unsupervised semantic similarity computation using
web search engines. IEEE/WIC/ACM international conference on web intelligence
(Nov. 2007), 381-387.
Li, Y., Bandar, Z., and Mclean, D. An approach for measuring semantic similarity
between words using multiple information sources. IEEE transactions on knowledge
and data engineering 15, 4 (July-Aug. 2003), 871-882.




                     Presented By: Beibei Yang   2/19/2009                26
Presented By: Beibei Yang   2/19/2009   27

Google Kernel Function

  • 1.
    Mehran Sahami Timothy D. Heilman Google Inc. Presented by Beibei Yang With credits to: Mehran Sahami, Stanford University, and Ellen Spertus, Google Inc.
  • 2.
    Mehran Sahami AssociateProfessor, Stanford Univ., 2007— Google Inc., 2002-2007 Timothy D. Heilman Sr. Sr Software Engineer, Google Inc Engineer Inc. Presented By: Beibei Yang 2/19/2009 2
  • 3.
    Presented By: BeibeiYang 2/19/2009 3
  • 4.
    Semantic Web It’sall about understanding! Semantic similarity A concept whereby a set of d h b f documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content. Semantic Relatedness Publicly available means for approximating the relative meaning o wo ds docu e ts. elat ve ea g of words/documents. Have been used for essay-grading by the Educational Testing Service, search engine technology, predicting which links people are likely to click on, etc. Presented By: Beibei Yang 2/19/2009 4
  • 5.
    What are they? Example:Amazon Presented By: Beibei Yang 2/19/2009 5
  • 6.
    “What to dowhen your TiVo thinks you’re What you re gay”, Wall Street Journal, Nov. 26, 2002 http://tinyurl.com/2qyepg Presented By: Beibei Yang 2/19/2009 6
  • 7.
    “What to dowhen your TiVo thinks you’re What you re gay”, Wall Street Journal, Nov. 26, 2002 http://tinyurl.com/2qyepg Presented By: Beibei Yang 2/19/2009 7
  • 8.
    “What to dowhen your TiVo thinks you’re What you re gay”, Wall Street Journal, Nov. 26, 2002 http://tinyurl.com/2qyepg Presented By: Beibei Yang 2/19/2009 8
  • 9.
    Wal Mart Wal-Mart DVDrecommendations http://tinyurl.com/2gp2hm Presented By: Beibei Yang 2/19/2009 9
  • 10.
    Wal Mart Wal-Mart DVDrecommendations http://tinyurl.com/2gp2hm Presented By: Beibei Yang 2/19/2009 10
  • 11.
    Wal Mart Wal-Mart DVDrecommendations http://tinyurl.com/2gp2hm Presented By: Beibei Yang 2/19/2009 11
  • 12.
    It s It’s thedegree to which text passages have the same meaning. Quite often we want to find how similar two short text snippets are: Search engine queries Course d C description i ti Policies of two insurance company You name it! Presented By: Beibei Yang 2/19/2009 12
  • 13.
    Presented By: BeibeiYang 2/19/2009 13
  • 14.
    The simplest wayto calculate similarity of two words is to find the minimum length of path connecting these two. (Has its limitations.) For example: Similarity(boy,girl) = 4 Similarity(boy,teacher) = 6 Fig 1: An ISA hierarchical semantic knowledge base Presented By: Beibei Yang 2/19/2009 14
  • 15.
    Search Engine Sahamiand Heilman's web-based kernel function. Bollegala, Matsuo, and Ishizuka's algorithm using page counts and text snippets. Iosif and Potamianos's two metric based approach Potamianos s two-metric approach. Liu and Birnbaum's approach using Google Directory. WordNet Varelas et al 's ontology mapping approach al. s approach. Yang and Powers's two-variant based approach: bidirectional depth-limit search (BDLS) and unidirectional breadth-first search (UBFS) Text Corpus Islam and Inkpen's modified LCS (Longest Common Subsequence) string-matching algorithm. Others Li et al.'s approach using multiple information sources. Presented By: Beibei Yang 2/19/2009 15
  • 16.
    Chris Buckley, 1994 Buckley, C., Salton, G., Allan, J., and Singhal, A. Automatic query expansion using smart: Trec 3. In TREC (1994), pp. 0-. The definition and emphasis changed along the way way. The process of reformulating a seed query to improve retrieval performance in information retrieval operations. p Involves: Finding synonyms of words, and searching for the synonyms Finding all the various morphological forms of words by stemming each word in the search query Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results Re-weighting the terms in the original query Presented By: Beibei Yang 2/19/2009 16
  • 17.
    Presented By: BeibeiYang 2/19/2009 17
  • 18.
    Let x representa short text snippet, we calculate the query expansion of x, QE(x) in this way: 1. Issue x as a query to a search engine S. 2. Let R(x) be the set of (at most) n retrieved documents d1, d2, … , dn 3. Compute th TFIDF t C t the term vector vi f each t for h document di R(x) 4. Truncate each vector vi to include its m highest weighted terms 5. Let C(x) be the centroid of the L2 normalized vectors vi: 6. Let QE(x) be the L2 normalization of the centroid C(x): Presented By: Beibei Yang 2/19/2009 18
  • 19.
    By G Saltonand C Buckley 1988 G. C. Buckley, Weight wi,j associated with with term ti in document dj is defined to be: tfi,j is the frequency of ti in dj N is the total number of documents in the corpus dfi is the total number of documents that contain ti. t i Presented By: Beibei Yang 2/19/2009 19
  • 20.
    Define the semantickernel function K as the inner product of the query expansions for two text snippets. Given two short text snippets x and y, we define the semantic similarity kernel between them as: K(x, y) = QE(x)·QE(y) Presented By: Beibei Yang 2/19/2009 20
  • 21.
    Acronyms: Presented By: Beibei Yang 2/19/2009 21
  • 22.
    Individuals and theirpositions: Presented By: Beibei Yang 2/19/2009 22
  • 23.
    Multi faceted Multi-faceted terms: Presented By: Beibei Yang 2/19/2009 23
  • 24.
    Search engine: thisapproach could be used g pp to generate the related query suggestions in a large-scale system. Question-answering Question answering system: the question could be matched against a list of candidate answers to determine which is the most similar semantically. i il i ll Since this kernel is not limited to use on the web, web it can also be computed using query expansions generated over domain-specific corpora in order to better capture contextual semantics in particular domains domains. Presented By: Beibei Yang 2/19/2009 24
  • 25.
    Presented By: BeibeiYang 2/19/2009 25
  • 26.
    Berners-Lee, T., Hendler,J., and Lassila, O. The Semantic Web. Scientific American 284, 5 ( , (2001), 34-43. ), Sahami, M. and Heilman, T. D. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM, New York, NY. Buckley, C., Salton, G., Allan, J., Buckley C Salton G Allan J and Singhal A Automatic query expansion using Singhal, A. smart: Trec 3. In TREC (1994), pp. 0-. Abhishek, V., and Hosanagar, K. Keyword generation for search engine advertising using semantic similarity between terms. In ICEC '07: Proceedings of the ninth international conference on electronic commerce (New York, NY, USA, 2007), ACM pp pp. 89-94. Bollegala, D., Matsuo, Y., and Ishizuka, M. Measuring semantic similarity between words using web search engines. In WWW '07: Proceedings of the 16th international conference on World Wide Web (New York, NY, USA, 2007), ACM, pp. 757-766. Iosif, E., and Potamianos, A. Unsupervised semantic similarity computation using web search engines. IEEE/WIC/ACM international conference on web intelligence (Nov. 2007), 381-387. Li, Y., Bandar, Z., and Mclean, D. An approach for measuring semantic similarity between words using multiple information sources. IEEE transactions on knowledge and data engineering 15, 4 (July-Aug. 2003), 871-882. Presented By: Beibei Yang 2/19/2009 26
  • 27.
    Presented By: BeibeiYang 2/19/2009 27