Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?

418 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?

  1. 1. The Advanced E-Discovery Institute<br />November 12-13, 2009<br />What’s Under the Hood of your Favorite Search System?<br />Ellen Voorhees <br />ellen.voorhees@nist.gov<br />
  2. 2. So you want to build a search engine<br />What is the collection to be searched?<br />How will the content (text other media) be represented? [indexing]<br />How will the information need be represented? [query language]<br />How will respective representations be matched? [retrieval model]<br />How effective is the search?<br />The Advanced E-Discovery Institute  November 13, 2009<br />
  3. 3. Boolean Retrieval<br />The Model<br />documents represented by descriptors<br />descriptors originally manually assigned concepts from controlled vocabulary<br />modern implementations generally use words in text as descriptors<br />information need represented by descriptors structured with Boolean operators<br />modern implementations include more operators than just AND, OR, NOT<br />a match occurs if and only if doc satisfies Boolean expression<br />“fuzzy match” systems use descriptor weights, relax strict binary interpretation<br />Pros and cons<br />good: transparency---clear exactly why doc retrieved<br />bad: little control over retrieved set size; no ranking; searchers must learn query language<br />The Advanced E-Discovery Institute  November 13, 2009<br />
  4. 4. Vector Space Model<br />The Model<br />documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set<br />term is usually a word (stem); but might be phrase or thesaurus class<br />terms are weighted based on frequency and distribution of occurrences<br />information need is natural language text mapped in same space<br />matching is similarity between query and doc vectors<br />example similarity: cosine of angle between vectors<br />allows documents to be ranked by decreasing similarity<br />Pros and Cons<br />good: less brittle than pure Boolean<br />bad: less transparency---depending on weights, a doc with few query terms can be ranked higher than a doc with many<br />The Advanced E-Discovery Institute  November 13, 2009<br />
  5. 5. Vector Similarities<br />Document-Document similarity<br />docs are similar to the extent they contain the same terms<br />doc pairs with maximal similarity detects duplicates<br />document clustering<br />cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.”<br />thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query<br />Term-Term similarity<br />terms are similar to the extent the occur inthe same documents<br />term clustering<br />query expansion<br />provide bottom-up description of document set<br />T1 T2 T3 T4 …<br />5 0 33 0 …<br />0 0 8 0 …<br />1 4 0 2 …<br />0 3 0 4 …<br />0 1 0 0 …<br />3 2 0 …<br />D1<br />D2<br />D3<br />D4<br />D5<br />D6…<br />The Advanced E-Discovery Institute  November 13, 2009<br />
  6. 6. Further Matrix Manipulation: Latent Semantic Indexing<br />Mathematically, the axes in a vector space are orthogonal to one another<br />so, vector space model technically assumes words occur in documents independently of any other words (which is nonsense)<br />this vector space is very large, and very sparse<br />Perform singular value decomposition of original matrix and select first X eigenvectors as new axes<br />X chosen to be much smaller than number of terms, producing much smaller denser vector space<br />project document vectors into new space<br />elements in vector no longer correspond to words<br />new axes capture some (but which?) dependencies among original word occurrences<br />The Advanced E-Discovery Institute  November 13, 2009<br />
  7. 7. How Effective is the Search?<br /><ul><li>Evaluation for technology development:
  8. 8. comparative evaluation using mean scores on test collections
  9. 9. Absolute evaluation of current e-discovery search:
  10. 10. very little guidance in IR literature: you don’t know what you don’t know!
  11. 11. too much variability for test collections to predict tight bounds</li></ul>Number relevant<br />num_rel = num_ret<br />R<br />R<br />Number retrieved<br />number relevant retrieved<br />Precision =<br />number retrieved<br />number relevant retrieved<br />Recall =<br />total relevant<br />2×Precision×Recall<br />F =<br />Precision + Recall<br />

×