Aggregation for searching complex information spaces


Published on

The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.

This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Passage retrieval was an active research area in the mid 90s. The idea here is that a document is decomposed into passages, and then we are doing IR against passages and not the whole document. A main issue is the actually decomposition of the document in passages. There were three main techniques: sliding windows of words using the discourse, e.g. every sentence is a passage, or more likely, every paragraph is a passage using a topic segmentation algorithm such as TextTiling, to identify shifts in topics, and to assign passage boundaries in such occurring topic shifts.
  • Now I will concentrate on the use of the structure and in particular XML to perform focused retrieval. But let me say few words about structure and document. Structure is everywhere in a document. I will mainly concentrate on the structure in the sense of the hierarchical or logical structure.
  • This illustrates the logical structure of a document, and how this is represented in the XML markup language. XML is de facto markup language for documents, and represent both the content and the structure of documents. Note that here do, head, text, and quote are what are called XML elements. Doc is the root element, or in other words the whole documents.
  • How can the structure be used for focused retrieval? Before describing the use of structure to retrieve so-called XML elements, I want to say few words about complex information needs.
  • To express information needs in the context of XML retrieval, we need query languages. These can be divided into four main groups, where from up here to down here, we have an increase in expressivity. What is the appropriate query language depends on the application and its users.
  • Now what XML retrieval is about: Before hand we do not know at which level of the hierarchy the best answer is contained for a given query. So we do not know what is actually a retrieval units, as any elements cane bee returned. Elements are related to each other, and this may be useful to find these that should be returned. Finally we may have structural constraints to satisfy.
  • This is an example of the outcomes of the techniques described so far, where as a result to a given query, we have a ranked list of element.
  • Lots of models from IR and approaches from other areas were used to rank elements for a given queries. There is yet to know what is the best model. XML element retrieval is more a combination of evidence problem, where what matters is what get combined. We know that whatever the model combining the element score (estimated relevance), the document score and some information about the size is what bring the best performance and this whatever the model.
  • At INEX, a number of user studies were carried (Tassos Tombros at QMUL was heavily involved in these). One outcome was that, although users liked to be returned XML elements, they preferred these to be grouped by document containing them. This led to the definition of a retrieval task at INEX, called releance in context. Rank article Highlight relevant elements This can be implemented as a heat map. This is an example of a aggregated result.
  • A second line of outcomes of the user studies was that user like to be shown the context of the element they were reading. This can be done by the following interface, where on one side we display the element, and the other side the ToC of the document containing that element. As part of his PhD student, Zoltan looked at way to build a ToC that adapt to the element be displayed, where for example the structure of the elements around the displayed elements are shown in more details than those further away. This is again an example of an aggregated result.
  • Initial ranking of elements, where elements in the same documents are shown in the same colour. We could aggregate elements in various ways to form virtual documents. The terminology of virtual document came a while ago from a discussion between TR and YC, and Thomas has extended it in so-called augmentation context. The RiC is a special case of the above, where we are grouping elements per document. There is nothing stopping us to do the grouping across documents.
  • The notion of virtual documents is not new for example on the web, although the aggregated aspect is not exactly the same. Example of virtual documents is for example the organization of result into clusters. A cluster is a virtual document. We can go further and provide a summary of the documents forming the clusters, which is what WebInEssence (not working any more) is doing. In the news domain, the TDT at TREC is about about detecting topics and tracking them so that to present the while news item in one go to users
  • Aggregated retrieval may also consist of what I call aggregated views. Going back to our example, we could for instance have one window showing one type of results (title element), and a window showing abstract element, and so on. Note that ToC is special case, where we have an element in one window and on the other a selected number of elements that correspond to the ToC for that element.
  • Test collection generation: Existing test collections (topic, doc, judgment) Defining a set of verticals (varied at different genres and media) Document classification (mapping documents in existing test collections into the verticals defined) Duplicate topic detection (detecting duplicate topics from existing test collections, e.g. “golden gate bridge” in image and web test collection) Extracting topics with more than one vertical intent Adding topics with only “General Web” vertical intent
  • Two tables + one figure Table 1: statistics of whole test collection Figure 1: breakdown of text documents into different genres Table 2: basic statistics of topics Those results are based on AIRS paper.
  • Aggregation for searching complex information spaces

    1. 1. Aggregation for searching complex information spaces Mounia Lalmas [email_address]
    2. 2. Outline <ul><li>Document Retrieval </li></ul><ul><li>Focused Retrieval </li></ul><ul><li>Aggregated Retrieval </li></ul><ul><li>(My) Current research on aggregated search </li></ul><ul><li>Some perspectives on aggregated search </li></ul>INEX - INitiative for the Evaluation of XML Retrieval Complexity of the information space (s)
    3. 3. A bit about myself <ul><li>1999-2008: Lecturer to Professor at Queen Mary University of London </li></ul><ul><li>2008-2010 Microsoft Research/RAEng Research Professor at the University of Glasgow (and live outside London) </li></ul><ul><li>2011- Visiting Principal Scientist at Yahoo! Research Barcelona </li></ul><ul><li>Research topics </li></ul><ul><ul><li>XML retrieval and evaluation (INEX) </li></ul></ul><ul><ul><li>Quantum theory to model interactive information retrieval </li></ul></ul><ul><ul><li>Aggregated search </li></ul></ul><ul><ul><li>Bridging the digital divide (Eastern Cape is South Africa) </li></ul></ul><ul><ul><li>Models and measures of user engagement (Yahoo!) </li></ul></ul>
    4. 4. Three retrieval paradigms Document Retrieval Focused Retrieval Aggregated Retrieval Complexity of the information space (s)
    5. 5. Classical document retrieval Retrieval System Query Document corpus Ranked Documents One homogeneous information space
    6. 6. Classical document retrieval process Documents Query Ranked documents Representation Function Representation Function Query Representation Document Representation Retrieval Function Index
    7. 7. Information retrieval process Documents Query Results Representation Function Representation Function Query representation Object representation Retrieval Function Index Task Context Interface Interaction Multimodality Genre Media Language Structure Heterogeneity The Turn, Ingwersen & Jarvelin, 2005
    8. 8. Focused Retrieval <ul><li>Question & Answering </li></ul><ul><li>Passage Retrieval </li></ul><ul><li>(XML) Element Retrieval </li></ul><ul><li>… </li></ul>One information space A more complex one and/or several of them
    9. 9. Focused Retrieval - Question & Answering
    10. 10. Focused Retrieval - Passage Retrieval 1.2 3.2 3.4 3.7 1.2 2.2 2.2 3.4 3.2 1.4 3.4 3.5 1.4 2.2 2.3 2.4 Document segmented into passages Passages are returned as answers to a given query <ul><li>Passage defined based on: </li></ul><ul><ul><li>Window </li></ul></ul><ul><ul><li>Discourse </li></ul></ul><ul><ul><li>Topic </li></ul></ul>Lots of work in mid 90s
    11. 11. Structure in documents <ul><li>Linear order of words, sentences, paragraphs </li></ul><ul><li>Hierarchy or logical structure of a book’s chapters, sections </li></ul><ul><li>Links (hyperlink), cross-references, citations </li></ul><ul><li>Temporal and spatial relationships in multimedia documents </li></ul><ul><li>Fields with factual data </li></ul>author date
    12. 12. Logical structure - XML Document This is a heading This is some text This is a quote <doc> <head>This is a heading</head> <text>This is some text</text> <quote>This is a quote</quote> </doc> doc head text quote This is a heading This is a quote This is some text
    13. 13. Using the (XML) structure <ul><li>Traditional document retrieval is about finding relevant documents to a user’s information need, e.g. entire book. </li></ul><ul><li>Structure allows retrieval of document parts (XML elements) to a user’s information need, e.g. a chapter, a page, several paragraphs of a book, instead of an entire book. </li></ul><ul><li>Structure can be exploited to express complex information needs, e.g. a section about wine making in a chapter about German wine </li></ul>
    14. 14. Query languages for XML Retrieval <ul><li>Keyword search </li></ul><ul><ul><ul><li>“ football” </li></ul></ul></ul><ul><li>Tag + Keyword search </li></ul><ul><ul><ul><li>section: “football” </li></ul></ul></ul><ul><li>Path Expression + Keyword search (NEXI) </li></ul><ul><ul><ul><li>/section[ about(./title, “world cup football”) ] </li></ul></ul></ul><ul><li>XQuery + Complex full-text search </li></ul><ul><ul><ul><li>for $b in /book//section let score $s := $b contains text “world” ftand “cup” distance at most 5 words </li></ul></ul></ul>Sihem Amer-Yahia
    15. 15. XML Retrieval <ul><li>No predefined retrieval units </li></ul><ul><li>Dependency of retrieval units </li></ul><ul><li>Structural constraints </li></ul>Book Chapters Sections Subsections <ul><li>Retrieval aims: </li></ul><ul><ul><li>Not only to find relevant elements with respect to content and structure </li></ul></ul><ul><ul><li>But those at the appropriate level of granularity </li></ul></ul>
    16. 16. Evaluation of XML Retrieval: INEX <ul><li>Promote research and stimulate development of XML information access and retrieval, through </li></ul><ul><ul><li>Creation of evaluation infrastructure and organisation of regular evaluation campaigns for system testing </li></ul></ul><ul><ul><li>Building of an XML information access and retrieval research community </li></ul></ul><ul><ul><li>Construction of test-suites </li></ul></ul>Collaborative effort  participants contribute to the development of the collection End with a yearly workshop, in December, in Dagstuhl, Germany INEX has allowed a new community in XML information access to emerge Fuhr
    17. 17. XML “Element” Retrieval (Courtesy of Norbert Goevert )
    18. 18. “ Element” Ranking algorithms Combination of evidence Element score Document score Element size … … “ Aggregation” in semi-complex information spaces vector space model language model extending DB model polyrepresentation probabilistic model logistic regression Bayesian network divergence from randomness Boolean model machine learning belief model statistical model natural language processing structured text models
    19. 19. Machine learning <ul><li>Use of standard machine learning to train a function that combines </li></ul><ul><ul><li>Parameter for a given element type </li></ul></ul><ul><ul><li>Parameter  score(element) </li></ul></ul><ul><ul><li>Parameter  score(parent(element)) </li></ul></ul><ul><ul><li>Parameter  score (document) </li></ul></ul><ul><li>Training done on relevance data (previous years) </li></ul><ul><li>Scoring done using OKAPI </li></ul>relationship type “ Aggregation” in semi-complex information spaces
    20. 20. This is not the end <ul><li>XML retrieval is not element retrieval </li></ul><ul><li>In fact, XML retrieval is aggregated retrieval </li></ul><ul><ul><li>Relevant in Context task as INEX </li></ul></ul><ul><ul><li>Element-biased table of content </li></ul></ul><ul><ul><li>… </li></ul></ul>The complexity of the information space (s) increases - complexity of content/data - complexity of retrieval task/information need - complexity of context - complexity of presentation of results - …
    21. 21. Aggregated result - Relevance in context (Courtesy of Jaap Kamps)
    22. 22. Aggregated result - Element-biased table of content (Courtesy of Zoltan Szlavik)
    23. 23. Aggregated answers in XML retrieval and beyond … <ul><li>Relevance in context </li></ul><ul><li>Element-biased table of content </li></ul><ul><li>… </li></ul>Let us be more adventurous and attempt to create the perfect answer … the beyond bit
    24. 24. Aggregated (virtual) documents 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 Special case: relevant in context Chiaramella & Roelleke
    25. 25. Aggregated (virtual) documents <ul><li>Web search </li></ul><ul><ul><li>Clustering (Yippy) </li></ul></ul><ul><ul><li>Summarisation (WebInEssence) </li></ul></ul><ul><li>News domains </li></ul><ul><ul><li>Topic detection & tracking (TREC track) </li></ul></ul>
    26. 26. Yippy – Clustering search engine from Vivisimo
    27. 27. Multi-document summarization
    28. 28. “ Fictitious” document generation (Courtesy of Cecile Paris)
    29. 29. Aggregated views 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 Special case: element-biased table of content
    30. 30. Aggregated views (non-blended)
    31. 31. – Korean search engine
    32. 32. Aggregated views (blended)
    33. 33. Aggregated views (entities and relationships)
    34. 34. Research questions <ul><li>What is the core information the user is seeking? </li></ul><ul><li>How to express (complex) information needs? </li></ul><ul><li>What information should be presented to the user? </li></ul><ul><li>How should the information be presented to the user? </li></ul><ul><li>How should the user interact with the system? </li></ul>
    35. 35. Current work on aggregated search <ul><li>Web search context </li></ul><ul><li>Domain/genre (vertical) </li></ul><ul><ul><li>Understanding: Log analysis </li></ul></ul><ul><ul><li>Result presentation: User studies </li></ul></ul><ul><ul><li>Evaluation: Test collections </li></ul></ul>
    36. 36. <ul><li>Microsoft 2006 RFP data set: </li></ul><ul><li>query log of 15 millions queries from US users sampled over one month </li></ul><ul><li>hypothesised that aggregated search is most useful for </li></ul><ul><li>non-navigational queries (two+ click sessions) </li></ul><ul><li>Three domains: image, video, map </li></ul><ul><li>Three genres: news, blog, wikipedia </li></ul><ul><li>What are the frequent combinations of domain and genre intents within a search session? </li></ul><ul><li>Do domain and genre intents evolve according to some patterns? </li></ul><ul><li>Is there a relation between query reformulation and a change of intent? </li></ul>Understanding: Log analysis
    37. 37. Understanding: Log analysis <ul><li>Using a rule-based and an SVM classifier, approximately 8% of clicks classified as image, video, map, news, blog and wikipedia intents . </li></ul><ul><li>Users do not often mix intents, and if they do, they had at most two intents, mostly a web intent and another one. </li></ul><ul><li>Except for wikipedia, users tend to follow the same intent for a while and then switch to another. </li></ul><ul><li>For video, news and map clicks, often completely different queries were submitted, whereas, for blog and wikipedia clicks, the same query was used, when the intent changed. </li></ul><ul><li>Intent-specific terms (“video”, “map”) were often used when query was modified. </li></ul>Sushmita & Piwowarski
    38. 38. Images on top Images in the middle Images at the bottom Images at top-right Images on the left Images at the bottom-right Result presentation: User studies Blended vs non-blended interfaces 3 verticals (image, video, news) 3 positions 3 vertical intents (high, medium, low)
    39. 39. <ul><li>Designers of aggregated search interfaces should account for the aggregation styles </li></ul><ul><li>blended case  accurate estimation of the best position of “vertical” result </li></ul><ul><li>non-blended  accurate selection of the type of “vertical” result </li></ul><ul><li>for both, vertical intent key for deciding on position and type of “vertical” results </li></ul>Sushmita & Hideo Result presentation: User studies
    40. 40. Evaluation: Test collections ImageCLEF photo retrieval track …… TREC web track INEX ad-hoc track TREC blog track topic t 1 doc d 1 d 2 d 3 … d n judgment R N R … R …… Blog Vertical Reference (Encyclopedia) Vertical Image Vertical General Web Vertical Shopping Vertical topic t 1 doc d 1 d 2 … d V1 judgment R N … R vertical V 1 V 2 d 1 d 2 … d V2 N N … R …… V k d 1 d 2 … d Vk N N … N t 1 existing test collections (simulated) verticals
    41. 41. Evaluation: Test collections * There are on an average more than 100 events/shots contained in each video clip (document). Zhou Statistics on Topics number of topics 150 average rel docs per topic 110.3 average rel verticals per topic 1.75 ratio of “General Web” topics 29.3% ratio of topics with two vertical intents 66.7% ratio of topics with more than two vertical intents 4.0% quantity/media text image video total size (G) 2125 41.1 445.5 2611.6 number of documents 86,186,315 670,439 1,253* 86,858,007
    42. 42. There is related work <ul><li>“ aggregated search” </li></ul><ul><li>“ (Google) universal search” </li></ul><ul><li>“ aggregated retrieval” </li></ul><ul><li>“ meta-search” </li></ul>“ combination of evidence” “ uncertainty theory” “ machine learning” “ agent technology” poly-representation cognitive overlap complex information needs structured query languages aggregator operators “ distributed retrieval” “ data fusion” “ federated search” “ federated digital libraries” “ resource selection” “ heterogeneous collection” Web search Information retrieval and digital libraries Cognitive science Databases Artificial intelligence
    43. 43. Final word - Aggregated search <ul><li>Complex information spaces </li></ul><ul><ul><li>vertical/domain/genre </li></ul></ul><ul><ul><li>cross/multi-lingual content </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>We are still at the beginning… </li></ul><ul><ul><li>reduce information overload </li></ul></ul><ul><ul><li>increase result space </li></ul></ul><ul><ul><li>diversity of result </li></ul></ul><ul><ul><li>provide/capture context </li></ul></ul><ul><ul><li>presentation/interface </li></ul></ul><ul><ul><li>… </li></ul></ul>
    44. 44. Thank you <ul><ul><li>[email_address] </li></ul></ul><ul><ul><li> </li></ul></ul>