Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Information Retrieval intro TMM

206 views

Published on

Intro IR for the Text and Multimedia Mining course

Published in: Science
  • Be the first to comment

  • Be the first to like this

Information Retrieval intro TMM

  1. 1. Text Mining lecture Information Retrieval Prof.dr.ir. Arjen P. de Vries arjen@acm.org Nijmegen, October 18th , 2017
  2. 2. A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc
  3. 3. Core Research Questions  How to represent information? - The information need and search requests - The objects to be shown in response to an information request  How to match information representations? The information objects to be retrieved are not necessarily textual! Van Rijsbergen, 1979
  4. 4. Two views on ‘search’ DB  Business applications  Deductive reasoning  Precise and efficient query processing  Users with technical skills (SQL) and precise information needs Selection Books where category=‘CS’ IR  Digital libraries, patent collections, etc.  Inductive reasoning  Best-effort processing  Untrained users with imprecise information needs Ranking Books about CS Note: SemWeb more DB than IR!!! Symbolic Connectionist
  5. 5. Search Flow Chart A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5
  6. 6. IR vs. AI  Many related topics in AI: - Computational Linguistics - Natural Language Processing - Question Answering - Information Extraction - Machine Translation - Computer vision / Multimedia vs.  Information Retrieval?
  7. 7. IR vs. AI (Kunstmatige Intelligentie) “In some sense, of course, classic IR is superhuman: there was no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.” Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006
  8. 8. IR vs. AI “In some sense, of course, classic IR is superhuman: there was no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.” Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006
  9. 9. IR vs. AI “In some sense, of course, classic IR is superhuman: there was no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.” Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006
  10. 10. Relevance  Inherently dependent on user, context and task  Different “relevance criteria” - Topicality: is the document about the information request? - Readability: can I understand the text? - Authoritiveness: can I trust the text? - Child-suitability: is the text appropriate for children? - Etc.
  11. 11. “Computational Relevance” “Intellectually it is possible for a human to establish the relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.” Van Rijsbergen, 1976 Retrieval Model
  12. 12. ‘Computational Relevance’  How to combine different indicators of relevance? - E.g., topicality, child- suitability, polarity, …  Apply ‘copulas’ (a technique from econometrics) to model non-linear dependencies (SIGIR 2013, CIKM 2014)
  13. 13. Relevance  Various aspects of understanding this notion of relevance position information retrieval between computer science and information science  Examples of questions that traditionally do not even presume involvement of a computer: - What makes an information object relevant? - What stages constitute a search process? - How does relevance evolve during this search process? - How do users learn from the search process? - Why do users issue short queries even if we know that long ones are more effective? Etc.
  14. 14. NLP in IR  Stemming & Stopping - De facto default setting  N-grams (bi-grams) - SDM (Sequential Dependence Model)  Entity tagging
  15. 15. Footnote in Victor Lavrenko’s PhD thesis  “It is my personal observation that almost every mathematically inclined graduate student in Information Retrieval attempts to formulate some sort of a non- independent model of IR within the first two or three years of his studies. The vast majority of these attempts yield no improvements and remain unpublished.”
  16. 16. Take words as they stand !
  17. 17. The Secret  The user can simply reformulate their information need in response to insufficiently relevant results retrieved by the system!
  18. 18. Why Search Remains Difficult to Get Right  Heterogeneous data sources - WWW, wikipedia, news, e-mail, patents, twitter, personal information, …  Varying result types - “Documents”, tweets, courses, people, experts, gene expressions, temperatures, …  Multiple dimensions of relevance - Topicality, recency, reading level, … Actual information needs often require a mix within and across dimensions. E.g., “recent news and patents from our top competitors”
  19. 19.  System’s internal information representation - Linguistic annotations - Named entities, sentiment, dependencies, … - Knowledge resources - Wikipedia, Freebase, IDC9, IPTC, … - Links to related documents - Citations, urls  Anchors that describe the URI - Anchor text  Queries that lead to clicks on the URI - Session, user, dwell-time, …  Tweets that mention the URI - Time, location, user, …  Other social media that describe the URI - User, rating - Tag, organisation of `folksonomy’ + UNCERTAINTY ALL OVER!

×