CSE509 Lecture 2

616 views
561 views

Published on

Lecture 2 of CSE509:Web Science and Technology Summer Course

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
616
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Highly successful, all major businesses use an RDB system
  • Popular approach in 80s but has not been successful for gen. IR
  • System returns best-matching documents given the query Had limited appeal until the Web became popular
  • System returns best-matching documents given the query Had limited appeal until the Web became popular
  • System returns best-matching documents given the query Had limited appeal until the Web became popular
  • What documents are good matches for a query?
  • What documents are good matches for a query?
  • What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
  • Bag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queriesObviously, not true…But it works pretty well!
  • What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
  • Processing the enormous quantities of data necessary for these advances requires largeclusters, making distributed computing paradigms more crucial than ever. MapReduce is a programmingmodel for expressing distributed computations on massive datasets and an execution frameworkfor large-scale data processing on clusters of commodity servers. The programming model providesan easy-to-understand abstraction for designing scalable algorithms, while the execution frameworktransparently handles many system-level details, ranging from scheduling to synchronization to faulttolerance.MapReduce+Cloud Computing Debate
  • CSE509 Lecture 2

    1. 1. CSE509: Introduction to Web Science and Technology<br />Lecture 2: Basic Information Retrieval Models<br />Muhammad AtifQureshi and ArjumandYounus<br />Web Science Research Group<br />Institute of Business Administration (IBA)<br />
    2. 2. Last Time…<br />What is Web Science?<br />Why We Need Web Science?<br />Implications of Web Science<br />Web Science Case-Study: Diff-IE<br />July 16, 2011<br />
    3. 3. Today<br />Basic Information Retrieval<br />Approaches<br />Bag of Words Assumption<br />Information Retrieval Models<br />Boolean model<br />Vector-space model<br />Topic/Language models<br />July 16, 2011<br />
    4. 4. Computer-based Information Management<br />Basic problem<br />How to use computers to help humans store, organize and retrieve information?<br />What approaches have been taken and what has been successful?<br />July 16, 2011<br />
    5. 5. Three Major Approaches<br />Database approach<br />Expert-system approach<br />Information retrieval approach<br />July 16, 2011<br />
    6. 6. Database Approach<br />Information is stored in a highly-structured way<br />Data stored in relational tables as tuples<br />Simple data model and query language<br />Relational model and SQL query language<br />Clear interpretation of data and query<br />No ambition to be “intelligent” like humans<br />Main focus on system efficiency<br />Pros and cons?<br />July 16, 2011<br />
    7. 7. Expert-System Approach<br />Information stored as a set of logical predicates<br />Bird(X).Cat(X).Fly(X)<br />Given a query, the system infers the answer through logical inference<br />Bird(Ostrich)  Fly(Ostrich)?<br />Pros and cons?<br />July 16, 2011<br />
    8. 8. Information Retrieval Approach<br />Uses existing text documents as information source<br />No special structuring or database construction required<br />Text-based query language<br />Keyword-based query or natural language query<br />Pros and cons?<br />July 16, 2011<br />
    9. 9. Database vs. Information Retrieval<br />July 16, 2011<br />IR<br />Databases<br />What we’re retrieving<br />Mostly unstructured. Free text with some metadata.<br />Structured data. Clear semantics based on a formal model.<br />Queries we’re posing<br />Vague, imprecise information needs (often expressed in natural language).<br />Formally (mathematically) defined queries. Unambiguous.<br />Results we get<br />Sometimes relevant, often not.<br />Exact. Always correct in a formal sense.<br />Interaction with system<br />Interaction is important.<br />One-shot queries.<br />Other issues<br />Issues downplayed.<br />Concurrency, recovery, atomicity are all critical.<br />
    10. 10. Main Challenge of Information Retrieval Approach<br />Interpretation of query and data is not straightforward<br />Both queries and data are fuzzy<br />Unstructured text and natural language query<br />July 16, 2011<br />
    11. 11. Need forInformation Retrieval Model<br />Computers do not understand the document or the query<br />For implementation of information retrieval approach a computerizable model is essential<br />July 16, 2011<br />
    12. 12. What is a Model?<br />A model is a construct designed to help us understand a complex system<br />A particular way of “looking at things”<br />Models work by adopting simplifying assumptions<br />Different types of models<br />Conceptual model<br />Physical analog model<br />Mathematical models<br />…<br />July 16, 2011<br />
    13. 13. Major Simplification: Bag of Words<br />Consider each document as a “bag of words”<br />Consider queries as a “bag of words” as well<br />Great oversimplification, but works adequately in many cases<br />Still how do we match documents and queries?<br />July 16, 2011<br />
    14. 14. Sample Document<br />July 16, 2011<br />16 × said <br />14 × McDonalds<br />12 × fat<br />11 × fries<br />8 × new<br />6 × company french nutrition<br />5 × food oil percent reduce taste Tuesday<br />…<br />McDonald's slims down spuds<br />Fast-food chain to reduce certain types of fat in its french fries with new cooking oil.<br />NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.<br />But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.<br />But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.<br />Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.<br />…<br />“Bag of Words”<br />
    15. 15. Vector Representation<br /><ul><li>“Bags of words” can be represented as vectors
    16. 16. Computational efficiency
    17. 17. Ease of manipulation
    18. 18. A vector is a set of values recorded in any consistent order</li></ul>July 16, 2011<br />“The quick brown fox jumped over the lazy dog’s back” <br /> [ 1 1 1 1 1 1 1 1 2 ]<br />1st position corresponds to “back”<br />2nd position corresponds to “brown”<br />3rd position corresponds to “dog”<br />4th position corresponds to “fox”<br />5th position corresponds to “jump”<br />6th position corresponds to “lazy”<br />7th position corresponds to “over”<br />8th position corresponds to “quick”<br />9th position corresponds to “the”<br />
    19. 19. Representing Documents<br />July 16, 2011<br />aid<br />0<br />1<br />all<br />0<br />1<br />back<br />1<br />0<br />brown<br />1<br />0<br />come<br />0<br />1<br />dog<br />1<br />0<br />fox<br />1<br />0<br />good<br />0<br />1<br />jump<br />1<br />0<br />lazy<br />1<br />0<br />men<br />0<br />1<br />now<br />0<br />1<br />over<br />1<br />0<br />party<br />0<br />1<br />quick<br />1<br />0<br />their<br />0<br />1<br />time<br />0<br />1<br />Document 1<br /> Term<br />Document 1<br />Document 2<br />The quick brown <br />fox jumped over <br />the lazy dog’s <br />back. <br />Stopword<br /> List<br />for<br />is<br />of<br />Document 2<br />the<br />to<br />Now is the time <br />for all good men <br />to come to the <br />aid of their party.<br />
    20. 20. Boolean Model<br />Return all documents that contain the words in the query<br />Weights assigned to terms are either “0” or “1” <br />“0” represents “absence”: term isn’t in the document<br />“1” represents “presence”: term is in the document<br />No notion of “ranking”<br />A document is either a match or a non-match<br />July 16, 2011<br />
    21. 21. Evaluation: Precision and Recall<br />Q: Are all matching documents what users want?<br />Basic idea: a model is good if it returns a document if and only if it is relevant<br />R: set of relevant documents<br /> D: set of documents returned by a model<br />July 16, 2011<br />
    22. 22. Ranked Retrieval<br />Order documents by how likely they are to be relevant to the information need<br />Arranging documents by relevance is<br />Closer to how humans think: some documents are “better” than others<br />Closer to user behavior: users can decide when to stop reading<br />Best (partial) match: documents need not have all query terms<br />Although documents with more query terms should be “better”<br />July 16, 2011<br />
    23. 23. Similarity-Based Queries<br />Let’s replace relevance with “similarity”<br />Rank documents by their similarity with the query<br />Treat the query as if it were a document<br />Create a query bag-of-words<br />Find its similarity to each document<br />Rank order the documents by similarity<br />Surprisingly, this works pretty well!<br />July 16, 2011<br />
    24. 24. Vector Space Model<br />Documents “close together” in vector space “talk about” the same things<br /> Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)<br />July 16, 2011<br />t3<br />d2<br />d3<br />d1<br />θ<br />φ<br />t1<br />d5<br />t2<br />d4<br />
    25. 25. Similarity Metric<br />Angle between the vectors<br />July 16, 2011<br />Query Vector<br />Document Vector<br />
    26. 26. How do we Weight Document Terms?<br />July 16, 2011<br /><ul><li>Here’s the intuition:
    27. 27. Terms that appear often in a document should get high weights
    28. 28. Terms that appear in many documents should get low weights
    29. 29. How do we capture this mathematically?
    30. 30. Term frequency
    31. 31. Inverse document frequency</li></li></ul><li>TF.IDF Term Weighting<br />Simple, yet effective!<br />July 16, 2011<br />weight assigned to term i in document j<br />number of occurrence of term i in document j<br />number of documents in entire collection<br />number of documents with term i<br />
    32. 32. TF.IDF Example<br />July 16, 2011<br />Wi,j<br />tf<br />idf<br />1<br />2<br />3<br />4<br />1<br />2<br />3<br />4<br />5<br />2<br />1.51<br />0.60<br />complicated<br />0.301<br />4<br />1<br />3<br />0.50<br />0.13<br />0.38<br />contaminated<br />0.125<br />5<br />4<br />3<br />0.63<br />0.50<br />0.38<br />fallout<br />0.125<br />6<br />3<br />3<br />2<br />information<br />0.000<br />1<br />0.60<br />interesting<br />0.602<br />3<br />7<br />0.90<br />2.11<br />nuclear<br />0.301<br />6<br />1<br />4<br />0.75<br />0.13<br />0.50<br />retrieval<br />0.125<br />2<br />1.20<br />siberia<br />0.602<br />
    33. 33. Normalizing Document Vectors<br />Recall our similarity function:<br />Normalize document vectors in advance<br />Use the “cosine normalization” method: divide each term weight through by length of vector<br />July 16, 2011<br />
    34. 34. Normalization Example<br />July 16, 2011<br />tf<br />1<br />2<br />3<br />4<br />5<br />2<br />4<br />1<br />3<br />5<br />4<br />3<br />6<br />3<br />3<br />2<br />1<br />3<br />7<br />6<br />1<br />4<br />2<br />Wi,j<br />W'i,j<br />idf<br />1<br />2<br />3<br />4<br />1<br />2<br />3<br />4<br />1.51<br />0.60<br />0.57<br />0.69<br />complicated<br />0.301<br />0.50<br />0.13<br />0.38<br />0.29<br />0.13<br />0.14<br />contaminated<br />0.125<br />0.63<br />0.50<br />0.38<br />0.37<br />0.19<br />0.44<br />fallout<br />0.125<br />information<br />0.000<br />0.60<br />0.62<br />interesting<br />0.602<br />0.90<br />2.11<br />0.53<br />0.79<br />nuclear<br />0.301<br />0.75<br />0.13<br />0.50<br />0.77<br />0.05<br />0.57<br />retrieval<br />0.125<br />1.20<br />0.71<br />siberia<br />0.602<br />1.70<br />0.97<br />2.67<br />0.87<br />Length<br />
    35. 35. Retrieval Example<br />July 16, 2011<br />W'i,j<br />1<br />2<br />3<br />4<br />0.57<br />0.69<br />0.29<br />0.13<br />0.14<br />0.37<br />0.19<br />0.44<br />0.62<br />0.53<br />0.79<br />0.77<br />0.05<br />0.57<br />0.71<br />Query: contaminated retrieval<br />W'i,j<br />query<br />complicated<br />1<br />contaminated<br />fallout<br />Ranked list:<br />Doc 2<br />Doc 4<br />Doc 1<br />Doc 3<br />information<br />interesting<br />nuclear<br />1<br />retrieval<br />siberia<br />0.29<br />0.9<br />0.19<br />0.57<br />similarity score<br />
    36. 36. Language Models<br />Language models<br />Based on the notion of probabilities and processes for generating text<br />Documents are ranked based on the probability that they generated the query<br />Best/partial match<br />July 16, 2011<br />
    37. 37. What is a Language Model?<br />Probability distribution over strings of text<br />How likely is a string in a given “language”?<br />Probabilities depend on what language we’re modeling<br />July 16, 2011<br />p1 = P(“a quick brown dog”)<br />p2 = P(“dog quick a brown”)<br />p3 = P(“быстрая brown dog”)<br />p4 = P(“быстраясобака”)<br />In a language model for English: p1 > p2 > p3 > p4<br />In a language model for Russian: p1 < p2 < p3 < p4<br />
    38. 38. How do we Model a Language?<br />July 16, 2011<br /><ul><li>Brute force counts?
    39. 39. Think of all the things that have ever been said or will ever be said, of any length
    40. 40. Count how often each one occurs
    41. 41. Is understanding the path to enlightenment?
    42. 42. Figure out how meaning and thoughts are expressed
    43. 43. Build a model based on this
    44. 44. Throw up our hands and admit defeat?</li></li></ul><li>Unigram Language Model<br />July 16, 2011<br /><ul><li>Assume each word is generated independently
    45. 45. Obviously this is not true…
    46. 46. But it seems to work well in practice!
    47. 47. The probability of a string given a model</li></ul>The probability of a sequence of words decomposes into a product of the probabilities of individual words<br />
    48. 48. Physical Metaphor<br />July 16, 2011<br /> P ( )<br /> P ( )<br /><br /> P ( )<br /><br /> P ( )<br /><br />P ( ) =<br /><ul><li>Colored balls are randomly drawn from an urn (with replacement)</li></ul>M<br />words<br />= (4/9)  (2/9)  (4/9)  (3/9)<br />
    49. 49. An Example<br />July 16, 2011<br />Model M<br />P(w) w<br />0.2 the<br />0.1 a<br />0.01 man<br />0.01 woman<br />0.03 said<br />0.02 likes<br />…<br />the<br />man<br />likes<br />the<br />0.2<br />0.01<br />0.02<br />0.2<br />0.01<br />multiply<br />P(“the man likes the woman”|M)<br />= P(the|M)  P(man|M)  P(likes|M)  P(the|M)  P(man|M)<br />= 0.00000008<br />
    50. 50. QUESTIONS?<br />July 16, 2011<br />

    ×