Information Retrieval


Published on

This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information Retrieval

  1. 1. INFORMATION RETRIEVAL<br />A Look into the Science of Web Search Engines<br />1<br />Reach me on Twitter: @matifq Email<br />Muhammad AtifQureshi<br />
  2. 2. Contents<br />Story Mode Learning<br />Learning by Imagination<br />Appendix<br />2<br />Reach me on Twitter: @matifq Email<br />
  3. 3. Story Mode Learning<br />(Borrowed from Prof. Jimmy Lin,<br />University of Maryland, Scientist in Twitter)<br />3<br />Reach me on Twitter: @matifq Email<br />
  4. 4. Information Retrieval Systems<br />Information<br />What is “information”?<br />Retrieval<br />What do we mean by “retrieval”?<br />What are different types information needs?<br />Systems<br />How do computer systems fit into the human information seeking process?<br />4<br />Reach me on Twitter: @matifq Email<br />
  5. 5. What is Information?<br />What do you think?<br />There is no “correct” definition<br />Cookie Monster’s definition:<br /> “news or facts about something”<br />Different approaches:<br />Philosophy<br />Psychology<br />Linguistics<br />Electrical engineering<br />Physics<br />Computer science<br />Information science<br />5<br />Reach me on Twitter: @matifq Email<br />
  6. 6. Dictionary says…<br />Oxford English Dictionary<br />information: informing, telling; thing told, knowledge, items of knowledge, news<br />knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known<br />Random House Dictionary<br />information: knowledge communicated or received concerning a particular fact or circumstance; news<br />6<br />Reach me on Twitter: @matifq Email<br />
  7. 7. Intuitive Notions<br />Information must<br />Be something, although the exact nature (substance, energy, or abstract concept) is not clear;<br />Be “new”: repetition of previously received messages is not informative<br />Be “true”: false or counterfactual information is “mis-information”<br />Be “about” something<br />Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269.<br />7<br />Reach me on Twitter: @matifq Email<br />
  8. 8. Three Views of Information<br />Information as process<br />Information as communication<br />Information as message transmission and reception<br />8<br />Reach me on Twitter: @matifq Email<br />
  9. 9. One View<br />Information = characteristics of the output of a process<br />Tells us something about the process and the input<br />Information-generating process do not occur in isolation<br />Input<br />Output<br />Process<br />Input<br />Output<br />Input<br />Output<br />Process1<br />Process2<br />Input<br />Output<br />…<br />9<br />Reach me on Twitter: @matifq Email<br />
  10. 10. Where’s the human?<br />If a tree falls in the forest, and no one is around to hear it, is information transmitted?<br />In the “information as process”: Yes, but that’s not very interesting to us<br />We’re concerned about information for human consumption<br />Transmission of information from one person to another<br />Recording of information<br />Reconstruction of stored information<br />10<br />Reach me on Twitter: @matifq Email<br />
  11. 11. Another View<br />Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient”<br />This implies that the sender has knowledge of the recipient's structure<br />Text = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient”<br />Information = “the structure of any text which is capable of changing the image-structure of a recipient”<br />11<br />Reach me on Twitter: @matifq Email<br />
  12. 12. Transfer of Information<br />Communication = transmission of information<br />Thoughts<br />Thoughts<br />Telepathy?<br />Words<br />Words<br />Writing<br />Sounds<br />Sounds<br />Speech<br />Encoding<br />Decoding<br />12<br />Reach me on Twitter: @matifq Email<br />
  13. 13. Information Theory<br />Better called “communication theory”<br />Developed by Claude Shannon in 1940’s<br />Concerned with the transmission of electrical signals over wires<br />How do we send information quickly and reliably?<br />Underlies modern electronic communication:<br />Voice and data traffic…<br />Over copper, fiber optic, wireless, etc.<br />Famous result: Channel Capacity Theorem<br />Formal measure of information in terms of entropy<br />Information = “reduction in surprise”<br />13<br />Reach me on Twitter: @matifq Email<br />
  14. 14. The Noisy Channel Model<br />Communication = producing the same message at the destination that was sent at the source<br />The message must be encoded for transmission across a medium (called channel)<br />But the channel is noisy and can distort the message<br />Semantics (meaning) is irrelevant<br />channel<br />Receiver<br />message<br />Transmitter<br />noise<br />Source<br />Destination<br />message<br />14<br />Reach me on Twitter: @matifq Email<br />
  15. 15. A Synthesis<br />Information retrieval as communication over time and space, across a noisy channel<br />Sender<br />Recipient<br />Encoding<br />Decoding<br />Transmitter<br />Receiver<br />channel<br />storage<br />message<br />message<br />indexing/writing<br />retrieval/reading<br />noise<br />Source<br />Destination<br />message<br />message<br />noise<br />15<br />Reach me on Twitter: @matifq Email<br />
  16. 16. “Retrieval?”<br />“Fetch something” that’s been stored<br />Recover a stored state of knowledge<br />Search through stored messages to find some messages relevant to the task at hand<br />Encoding<br />Decoding<br />storage<br />Sender<br />Recipient<br />message<br />message<br />indexing/writing<br />Retrieval/reading<br />noise<br />16<br />Reach me on Twitter: @matifq Email<br />
  17. 17. What is IR?<br />Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user<br />Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143.<br />17<br />Reach me on Twitter: @matifq Email<br />
  18. 18. Types of Information Needs<br />Retrospective<br />“Searching the past”<br />Different queries posed against a static collection<br />Time invariant<br />Prospective<br />“Searching the future”<br />Static query posed against a dynamic collection<br />Time dependent<br />18<br />Reach me on Twitter: @matifq Email<br />
  19. 19. Retrospective Searches (I)<br />Ad hoc retrieval: find documents “about this”<br />Known item search<br />Directed exploration<br />Identify positive accomplishments of the Hubble telescope since it was launched in 1991.<br />Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.<br />Find Jimmy Lin’s homepage.<br />What’s the ISBN number of “Modern Information Retrieval”?<br />Who makes the best chocolates?<br />What video conferencing systems exist for digital reference desk services?<br />19<br />Reach me on Twitter: @matifq Email<br />
  20. 20. Retrospective Searches (II)<br />Question answering<br />Who discovered Oxygen?<br />When did Hawaii become a state?<br />Where is Ayer’s Rock located?<br />What team won the World Series in 1992?<br />“Factoid”<br />What countries export oil?<br />Name U.S. cities that have a “Shubert” theater.<br />“List”<br />Who is Aaron Copland?<br />What is a quasar?<br />“Definition”<br />20<br />Reach me on Twitter: @matifq Email<br />
  21. 21. Prospective “Searches”<br />Filtering<br />Make a binary decision about each incoming document<br />Routing<br />Sort incoming documents into different bins?<br />Spam or not spam?<br />Categorize news headlines: World? Nation? Metro? Sports?<br />21<br />Reach me on Twitter: @matifq Email<br />
  22. 22. What types of information?<br />Text (Documents and portions thereof)<br />XML and structured documents<br />Images<br />Audio (sound effects, songs, etc.) <br />Video<br />Source code<br />Applications/Web services<br />22<br />Reach me on Twitter: @matifq Email<br />
  23. 23. Content-Based Search<br />This is a relative new concept!<br />What else would you search on?<br />What’s more effective?<br />Why is this hard in many applications?<br />23<br />Reach me on Twitter: @matifq Email<br />
  24. 24. Interesting Examples<br />Google image search<br />Google video search<br />Query by humming<br /><br /><br /><br />24<br />Reach me on Twitter: @matifq Email<br />
  25. 25. What about databases?<br />What are examples of databases?<br />Banks storing account information<br />Retailers storing inventories<br />Universities storing student grades<br />What exactly is a (relational) database?<br />Think of them as a collection of tables<br />They model some aspect of “the world”<br />25<br />Reach me on Twitter: @matifq Email<br />
  26. 26. A (Simple) Database Example<br />Student Table<br />Department Table<br />Course Table<br />Enrollment Table<br />26<br />Reach me on Twitter: @matifq Email<br />
  27. 27. Database Queries<br />What would you want to know from a database?<br />What classes is John Arrow enrolled in?<br />Who has the highest grade in LBSC 690?<br />Who’s in the history department?<br />Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address?<br />27<br />Reach me on Twitter: @matifq Email<br />
  28. 28. Databases vs. IR<br />28<br />IR<br />Databases<br />What we’re retrieving<br />Mostly unstructured. Free text with some metadata.<br />Structured data. Clear semantics based on a formal model.<br />Queries we’re posing<br />Vague, imprecise information needs (often expressed in natural language).<br />Formally (mathematically) defined queries. Unambiguous.<br />Results we get<br />Sometimes relevant, often not.<br />Exact. Always correct in a formal sense.<br />Interaction with system<br />Interaction is important.<br />One-shot queries.<br />Other issues<br />Issues downplayed.<br />Concurrency, recovery, atomicity are all critical.<br />Reach me on Twitter: @matifq Email<br />
  29. 29. The Big Picture<br />The four components of the information retrieval environment:<br />User<br />Process<br />System<br />Collection<br />What computer geeks care about!<br />What we care about!<br />29<br />Reach me on Twitter: @matifq Email<br />
  30. 30. The Information Retrieval Cycle<br />Resource<br />Query<br />Ranked List<br />Documents<br />query reformulation,<br />vocabulary learning,<br />relevance feedback<br />Documents<br />source reselection<br />Source<br />Selection<br />Query<br />Formulation<br />Search<br />Selection<br />Examination<br />Delivery<br />30<br />Reach me on Twitter: @matifq Email<br />
  31. 31. Supporting the Search Process<br />Source<br />Selection<br />Resource<br />Query<br />Formulation<br />Query<br />Search<br />Ranked List<br />Selection<br />Indexing<br />Documents<br />Index<br />Examination<br />Acquisition<br />Documents<br />Collection<br />Delivery<br />31<br />Reach me on Twitter: @matifq Email<br />
  32. 32. Simplification?<br />Resource<br />Query<br />Ranked List<br />Documents<br />query reformulation,<br />vocabulary learning,<br />relevance feedback<br />Documents<br />source reselection<br />Source<br />Selection<br />Is this itself a vast simplification?<br />Query<br />Formulation<br />Search<br />Selection<br />Examination<br />Delivery<br />32<br />Reach me on Twitter: @matifq Email<br />
  33. 33. Tackling the IR Challenge<br />Divide and conquer!<br />Strategy: use encapsulation to limit complexity<br />Approach:<br />Define interfaces (input and output) for each component<br />Define the functions performed by each component<br />Study each component in isolation<br />Repeat the process within components as needed<br />Make sure that this decomposition makes sense<br />Result: a hierarchical decomposition<br />33<br />Reach me on Twitter: @matifq Email<br />
  34. 34. Where do we make the cut?<br />Study the IR black box in isolation<br />Simple behavior: in goes query, out comes documents<br />Optimize the quality of documents that come out<br />Study everything else around the black box<br />Put the human back in the loop!<br />Search<br />Query<br />Ranked List<br />34<br />Reach me on Twitter: @matifq Email<br />
  35. 35. The IR Black Box<br />Documents<br />Query<br />Hits<br />35<br />Reach me on Twitter: @matifq Email<br />
  36. 36. Inside The IR Black Box<br />Documents<br />Query<br />Representation<br />Function<br />Representation<br />Function<br />Query Representation<br />Document Representation<br />Index<br />Comparison<br />Function<br />Hits<br />36<br />Reach me on Twitter: @matifq Email<br />
  37. 37. The Central Problem in IR<br />Information Seeker<br />Authors<br />Concepts<br />Concepts<br />Query Terms<br />Document Terms<br />Do these represent the same concepts?<br />37<br />Reach me on Twitter: @matifq Email<br />
  38. 38. Learning by Imagination<br />38<br />Reach me on Twitter: @matifq Email<br />
  39. 39. Imagine a System<br />We have 1000s of web pages, what make these web pages different?<br />May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing)<br />39<br />Reach me on Twitter: @matifq Email<br />
  40. 40. Realize Query Needs<br />What do we expect when query?<br />Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan")<br />How to manage data of web pages<br />Bag of words data structure with/without position of words/terms (simply, posting list of words/terms)<br />What’s the best match?<br />We have many matching results, but what’s the order?<br />40<br />Reach me on Twitter: @matifq Email<br />
  41. 41. Order of Matching Results<br />How could we rank web pages? <br />Via query content matching score against web pages i.e., content based methods <br />Via importance of web pages i.e., link based methods<br />41<br />Reach me on Twitter: @matifq Email<br />
  42. 42. What does Content Tell?<br />Content Information:<br />Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy)<br />So what does common words make? <br />Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents)<br />42<br />Reach me on Twitter: @matifq Email<br />
  43. 43. Ranking Methods<br />Content based methods:<br />Examples: Tf-idf with cosine similarity, bm25, etc.<br />Link based methods:<br />Examples: PageRank, HITS, etc.<br />43<br />Reach me on Twitter: @matifq Email<br />
  44. 44. What is More in Ranking?<br />What other measures we can take for ranking better?<br />Combining content based methods with link based methods<br />How about learning to rank by user click through data (apply machine learning)<br />How about learning from social web (apply social science theories)<br />44<br />Reach me on Twitter: @matifq Email<br />
  45. 45. Lots of Web Pages<br />How about scalability? <br />We have too many words, can we limit them? <br />Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study)<br />Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education)<br />45<br />Reach me on Twitter: @matifq Email<br />
  46. 46. Is it sufficient?<br />Can we feel confident about how Web Search Engine works?<br />No, it was just a summary for the day<br />46<br />Reach me on Twitter: @matifq Email<br />
  47. 47. Guess! what next you would see?<br />?<br />47<br />Reach me on Twitter: @matifq Email<br />
  48. 48. Our search engine<br />Yes we are making it<br />48<br />Reach me on Twitter: @matifq Email<br />
  49. 49. Appendix<br />49<br />Reach me on Twitter: @matifq Email<br />
  50. 50. Outline<br />What is Research?<br />How to prepare yourself for IR research?<br />50<br />Reach me on Twitter: @matifq Email<br />
  51. 51. What is Research?<br />51<br />Reach me on Twitter: @matifq Email<br />
  52. 52. What is Research?<br />Research<br />Discover new knowledge <br />Seek answers to questions<br />Basic research<br />Goal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? )<br />Often driven by curiosity (but not always)<br />High impact examples: relativity theory, DNA, … <br />Applied research<br />Goal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?)<br />Driven by practical needs<br />High impact examples: computers, transistors, vaccinations, …<br />The boundary is vague; distinction isn’t important<br />52<br />Reach me on Twitter: @matifq Email<br />
  53. 53. Why Research?<br />Funding<br />Curiosity<br />Utility of <br />Applications<br />Advancement of <br />Technology<br />Amount of <br />knowledge<br />Application<br />Development<br />Applied Research<br />Basic Research<br />53<br />Reach me on Twitter: @matifq Email<br />
  54. 54. Where’s IR Research?<br />Information Science<br />Funding<br />Quality of <br />Life<br />Utility of <br />Applications<br />Advancement of <br />Technology<br />Amount of <br />knowledge<br />Computer Science<br />Application<br />Development<br />Applied Research<br />Basic Research<br />54<br />Reach me on Twitter: @matifq Email<br />
  55. 55. Research Process<br />Identification of the topic (e.g., Web search)<br />Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art)<br />Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data)<br />Test hypothesis (e.g., compare X and Y on the data)<br />Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)<br />55<br />Reach me on Twitter: @matifq Email<br />
  56. 56. Typical IR Research Process<br />Look for a high-impact topic (basic or applied)<br />New problem: define/frame the problem <br />Identify weakness of existing solutions if any<br />Propose new methods <br />Choose data sets (often a main challenge)<br />Design evaluation measures (can be very difficult)<br />Run many experiments (need to have clear research hypotheses)<br />Analyze results and repeat the steps above if necessary<br />Publish research results<br />56<br />Reach me on Twitter: @matifq Email<br />
  57. 57. Research Methods<br />Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”)<br />Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”)<br />Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”)<br /> The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…<br />57<br />Reach me on Twitter: @matifq Email<br />
  58. 58. Types of Research Questions and Results<br />Exploratory (Framework): What’s out there? <br />Descriptive (Principles): What does it look like? How does it work?<br />Evaluative (Empirical results): How well does a method solve a problem? <br />Explanatory (Causes): Why does something happen the way it happens? <br />Predictive (Models): What would happen if xxx ?<br />58<br />Reach me on Twitter: @matifq Email<br />
  59. 59. Solid and High Impact Research<br />Solid work: <br />A clear hypothesis (research question) with conclusive result (either positive or negative)<br />Clearly adds to our knowledge base (what can we learn from this work?)<br />Implications: a solid, focused contribution is often better than a non-conclusive broad exploration<br />High impact = high-importance-of-problem * high-quality-of-solution<br />high impact = open up an important problem<br />high impact = close a problem with the best solution<br />high impact = major milestones in between<br />Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best <br />59<br />Reach me on Twitter: @matifq Email<br />
  60. 60. How to Prepare Yourself for IR Research?<br />60<br />Reach me on Twitter: @matifq Email<br />
  61. 61. What it Takes to do Research?<br />Curiosity: allow you to ask questions<br />Critical thinking: allow you to challenge assumptions<br />Learning: take you to the frontier of knowledge<br />Persistence: so that you don’t give up<br />Respect data and truth: ensure your research is solid<br />Communication: allow you to publish your work<br />…<br />61<br />Reach me on Twitter: @matifq Email<br />
  62. 62. Learning about IR (1/2)<br />Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…)<br />Then read “Readings in IR” by Karen Sparck Jones, Peter Willett <br />And read papers recommended in the following article:<br />Read other papers published in recent IR/IR-related conferences<br />62<br />Reach me on Twitter: @matifq Email<br />
  63. 63. Learning about IR (2/2)<br />Getting more focused <br />Choose your favorite sub-area (e.g., retrieval models)<br />Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization)<br />Stay in frontier:<br />Keep monitoring literature in both IR and related areas<br />Broaden your view: Keep an eye on <br />Industry activities <br />Read about industry trends<br />Try out novel prototype systems<br />Funding trends<br />Read request for proposals<br />63<br />Reach me on Twitter: @matifq Email<br />
  64. 64. Critical Thinking<br />Develop a habit of asking questions, especially why questions<br />Always try to make sense of what you have read/heard; don’t let any question pass by<br />Get used to challenging everything<br />Practical advice<br />Question every claim made in a paper or a talk (can you argue the other way?) <br />Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it)<br />Force yourself to challenge one point in every talk that you attend and raise a question<br />64<br />Reach me on Twitter: @matifq Email<br />
  65. 65. Respect Data and Truth<br />Be honest with the experiment results <br />Don’t throw away negative results! <br />Try to learn from negative results<br />Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data<br />Be objective in data analysis and interpretation; don’t mislead readers <br />Aim at understanding/explanation instead of just good results<br />Be careful not to over-generalize (for both good and bad results); you may be far from the truth<br />65<br />Reach me on Twitter: @matifq Email<br />
  66. 66. Communications<br />General communication skills: <br />Oral and written<br />Formal and informal<br />Talk to people with different level of backgrounds<br />Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) <br />English proficiency<br />Get used to talking to people from different fields<br />66<br />Reach me on Twitter: @matifq Email<br />
  67. 67. Persistence<br />Work only on topics that you are passionate about<br />Work only on hypotheses that you believe in<br />Don’t draw negative conclusions prematurely and give up easily<br />positive results may be hidden in negative results<br />In many cases, negative results don’t completely reject a hypothesis <br />Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper)<br />Think of possibilities of repositioning a work<br />67<br />Reach me on Twitter: @matifq Email<br />
  68. 68. Optimize Your Training<br />Know your strengths and weaknesses<br />strong in math vs. strong in system development<br />creative vs. thorough<br />…<br />Train yourself to fix weaknesses<br />Find strategic partners<br />Position yourself to take advantage of your strengths<br />68<br />Reach me on Twitter: @matifq Email<br />
  69. 69. Thank You<br />Reach me on Twitter: @matifq<br />Email me:<br />69<br />Reach me on Twitter: @matifq Email<br />