Search result diversification is a common technique for tackling the problem of ambiguous and multi-faceted queries by maximizing query aspects or subtopics in a result list. In some special cases, subtopics associated to such queries can be temporally ambiguous, for instance, the query US Open is more likely to be targeting the tennis open in September, and the golf tournament in June. More precisely, users' search intent can be identified by the popularity of a subtopic with respect to the time where the query is issued. In this paper, we study search result diversification for time-sensitive queries, where the temporal dynamics of query subtopics are explicitly determined and modeled into result diversification. Unlike aforementioned work that, in general, considered only static subtopics, we leverage dynamic subtopics by analyzing two data sources (i.e., query logs and a document collection). By using these data sources, it provides the insights from different perspectives of how query subtopics change over time. Moreover, we propose novel time-aware diversification methods that leverage the identified dynamic subtopics. A key idea is to re-rank search results based on the freshness and popularity of subtopics. To this end, our experimental results show that the proposed methods can significantly improve the diversity and relevance effectiveness for time-sensitive queries in comparison with state-of-the-art methods.
Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam Traub
Digitized document collections often suffer from OCR errors that may impact a document’s readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library’s search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an important role in many domains, e.g., digital libraries and document archives. Temporal queries can be divided into two types: 1) those with temporal criteria explicitly provided by users, and 2) those with no temporal criteria provided. In this paper, we deal with the latter type of queries, i.e., queries that comprise only keywords, and their relevant documents are associated to particular time periods not given by the queries. We propose a number of methods to determine the time of queries using temporal language models. After that, we show how to increase the retrieval effectiveness by using the determined time of queries to re-rank the search results. Through extensive experiments we show that our proposed approaches improve retrieval effectiveness.
We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
Searching the Temporal Web: Challenges and Current ApproachesNattiya Kanhabua
This talk gives a survey of current approaches to searching the temporal web. In such a web collection, the contents are created and/or edited over time, and examples are web archives, news archives, blogs, micro-blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching the temporal web. The reason for this is multifold: 1) the collection is strongly time-dependent, i.e., with multiple versions of documents, 2) the contents of documents are about events happened at particular time periods, 3) the meanings of semantic annotations can change over time, and 4) a query representing an information need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed, namely, 1) How to understand temporal search intent represented by time-sensitive queries? 2) How to handle the temporal dynamics of queries and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
Dynamicity of Web and its implications on various components of search systems have taken a large attention in the last decade. This course, in the first place, aims to introduce students to the general and wide topic of Web evolution, and then pinpoint a number of issues that is related to temporal aspects of search and IR. We plan to start with an overview of seminal works that shed light on the evolution of Web within time. Next, we will focus on the impacts of this evolution on search and we will essentially focus on indexing of versioned document collections and time-aware retrieval and ranking. We will discuss evolution of search results and its effects on caching, and wrap up the course with a review of some recent approaches that aim to predict and search the future!
In this talk, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exist three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The objective is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam Traub
Digitized document collections often suffer from OCR errors that may impact a document’s readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library’s search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an important role in many domains, e.g., digital libraries and document archives. Temporal queries can be divided into two types: 1) those with temporal criteria explicitly provided by users, and 2) those with no temporal criteria provided. In this paper, we deal with the latter type of queries, i.e., queries that comprise only keywords, and their relevant documents are associated to particular time periods not given by the queries. We propose a number of methods to determine the time of queries using temporal language models. After that, we show how to increase the retrieval effectiveness by using the determined time of queries to re-rank the search results. Through extensive experiments we show that our proposed approaches improve retrieval effectiveness.
We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
Searching the Temporal Web: Challenges and Current ApproachesNattiya Kanhabua
This talk gives a survey of current approaches to searching the temporal web. In such a web collection, the contents are created and/or edited over time, and examples are web archives, news archives, blogs, micro-blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching the temporal web. The reason for this is multifold: 1) the collection is strongly time-dependent, i.e., with multiple versions of documents, 2) the contents of documents are about events happened at particular time periods, 3) the meanings of semantic annotations can change over time, and 4) a query representing an information need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed, namely, 1) How to understand temporal search intent represented by time-sensitive queries? 2) How to handle the temporal dynamics of queries and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
Dynamicity of Web and its implications on various components of search systems have taken a large attention in the last decade. This course, in the first place, aims to introduce students to the general and wide topic of Web evolution, and then pinpoint a number of issues that is related to temporal aspects of search and IR. We plan to start with an overview of seminal works that shed light on the evolution of Web within time. Next, we will focus on the impacts of this evolution on search and we will essentially focus on indexing of versioned document collections and time-aware retrieval and ranking. We will discuss evolution of search results and its effects on caching, and wrap up the course with a review of some recent approaches that aim to predict and search the future!
In this talk, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exist three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The objective is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
Taking the temporal dimension into account in searching, i.e., using time of content creation as part of the search condition, is now gaining increasingly interest. However, in the case of web search and web warehousing, the timestamps (time of creation or creation of contents) of web pages and documents found on the web are in general not known or cannot be trusted, and must be determined otherwise. In this paper, we describe approaches that enhance and increase the quality of existing techniques for determining timestamps based on a temporal language model. Through a number of experiments on temporal document collections we show how our new methods improve the accuracy of timestamping compared to the previous models.
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
With the growing volumes of and reliance on digital content, there is a clear need for better information access solutions that keep relevant information accessible and usable in long-term. Inspired by the role of forgetting in the human brain, we envision a concept of managed forgetting for systematically dealing with information that progressively ceases in importance as well as with redundant information. Although inspired by human memory, managed forgetting is meant to complement rather than copy human remembering and forgetting. It can be regarded as functions of attention and significance dynamics relying on multi-faceted information assessment. This talk introduces our vision for managed forgetting on the conceptual level as part of an Integrated Cognitive Framework for Time-aware Information Access. We discuss relevant research and application aspects for managed forgetting. To this end, we present our first results and point out issues where further research is required.
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
In human memory, forgetting plays a crucial role for focusing on important things and neglecting irrelevant details. In digital memories, the idea of systematic forgetting has found little attention, so far. At first glance, forgetting seems to contradict the purpose of archival and preservation. However, we are currently facing a tremendous growth in volumes of digital content. Thus, it becomes ever more important to focus, while forgetting irrelevant details, redundancies and noise. This holds true for better organizing the information space as well as in preservation management for making and revisiting decisions on what to keep. Therefore, we propose the introduction of the concept of managed forgetting as part of a joint information management and preservation management process in digital memories. Managed forgetting models resource selection as a function of attention and significance dynamics. Based on dynamic, multidimensional information value assessment it identifies information objects, e.g., documents or images of decreasing importance and/or topicality and triggers forgetting actions. Those actions include a variety of options, namely, aggregation and summarization, revised search and ranking behavior, elimination of redundancy, and finally, also deletion. In this paper, we present our vision for managed forgetting, discuss the challenges as well as our first ideas for its introduction, and present a case study for its motivation.
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its con- tent is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system deter- mines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline.
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
Wikipedia has become a widely accepted reference point for information of all kinds; real-world events (e.g., natural disasters, man-made incidents, and political events) as well as specific entities like politicians, celebrities, and entities involved in an event. Due to its open construction and negotiation, Wikipedia is an important new cultural and societal phenomenon, and the content of Wikipedia articles is a valuable source for different applications. For instance, the edit history and view logs of Wikipedia can be leveraged for detecting an event and its associated entities. In this study, we analyze temporal anchor texts extracted from the edit history. We propose a model for Wikipedia and anchor texts viewed as a temporal resource and a probabilistic method for ranking temporal anchor texts. Our preliminary results show that relevant anchor texts composed of evolving information (e.g., the changes of names and semantic roles, as well as evolving context) that reflects societal trends and perceptions, thus being candidates for capturing entity evolution.
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
Event detection is an interesting task for many applications, for instance: surveillance, scientific discovery, and Topic Detection and Tracking. Numerous works have focused on detecting events from unstructured text and determining what features constitutes an event, e.g., key terms or named entities. Although most works are able to find interesting time associated to an event, there is a lack in research on determining the relevance of time for an event. In this paper, we propose a method for automatically extracting real-world events from unstructured text documents. In addition, we propose a machine learning approach to identifying relevant time (i.e., temporal expressions) for the extracted events using three classes of features: sentence-based, document-based and corpus-specific features. Through experiments using real-world data and 3,500 manually judged relevance pairs, we show that our proposed approach is able to identify the relevant time of events with good accuracy.
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
Query expansion of named entities can be employed in order to increase the retrieval effectiveness. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance, and synonym relationships between terms change with time. In this paper, we present an approach to extracting synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relationships change over time. Further, we describe how to make use of both types of synonyms to increase the retrieval effectiveness, i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt. temporal criteria. Finally, through an evaluation based on TREC collections, we demonstrate how retrieval performance of queries consisting of named entities can be improved using our approach.
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
With the tremendous amount of news published on the Web every day, helping users explore news events on a given topic of interest is an acute problem. Timeline summaries have recently emerged as a simple and effective solution for users to navigate through temporally related news events. In this paper, we propose an optimization framework and demonstrate the use of Learning To Rank (LTR) to automatically construct timeline summaries from Web news articles. Experimental evaluations show that our approach outperforms existing solutions in producing high quality timeline summaries.
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
Going beyond its role as an encyclopedia, Wikipedia becomes a global memory place for high-impact events, such as, natural disasters and manmade incidents, thus influencing collective memory, i.e., the way we remember the past. Due to the importance of collective memory for framing the assessment of new situations, our actions and value systems, its open construction and negotiation in Wikipedia is an important new cultural and societal phenomenon. The analysis of this phenomenon does not only promise new insights in collective memory. It is also an important foundation for technology, which more effectively complements the processes of human forgetting and remembering and better enables us to learn from the past. In this paper, we analyse the long-term dynamics of Wikipedia as a global memory place for high-impact events. This complements existing work in analysing the collective memory negotiation and construction process in Wikipedia directly following the event. In more detail, we are interested in catalysts for reviving memories, i.e., in the fuel that keeps memories of past events alive, interrupting the general trend for fast forgetting. For this purpose, we study the trigger of revisiting behavior for a large set of event pages by exploiting page views and time series analysis, as well as identify of most important catalyst features.
Humans are very effective in remembering by abstraction, pattern exploitation, or contextualization. On the other hand, humans are also capable of forgetting irrelevant details, an important role in the human brain helping us to focus on relevant things instead of drowning in details by remembering everything. The research question that we address in this paper is: Can we learn from human remembering and forgetting in order to develop more advanced preservation technology? In particular, we aim at studying how a managed or controlled form of forgetting can play a role in digital preservation, including personal and organizational archives as well as collective memories. Our research goal is twofold: 1) to establish effective preservation for more concise and accessible digital memories, and 2) to enable the easier and wider adoption of preservation technology. The concept of managed forgetting is discussed in more detail in the research work of the European project ForgetIT, which investigates the proposed concept by mean of an integrated information and preservation management approach.
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
A microblogging service like Twitter continues to surge in importance as a means of sharing information in social networks. In the medical domain, several works have shown the potential of detecting public health events (i.e., infectious disease outbreaks) using Twitter messages or tweets. Given its real-time nature, Twitter can enhance early outbreak warning for public health authorities in order that a rapid response can take place. Most of previous works on detecting outbreaks in Twitter simply analyze tweets matched disease names and/or locations of interests. However, the effectiveness of such method is limited for two main reasons. First, disease names are highly ambiguous, i.e., referring slangs or non health-related contexts. Second, the characteristics of infectious diseases are highly dynamic in time and place, namely, strongly time-dependent and vary greatly among different regions. In this paper, we propose to analyze the temporal diversity of tweets during the known periods of real-world outbreaks in order to gain insight into a temporary focus on specific events. More precisely, our objective is to understand whether the temporal diversity of tweets can be used as indicators of outbreak events, and to which extent. We employ an efficient algorithm based on sampling to compute the diversity statistics of tweets at particular time. To this end, we conduct experiments by correlating temporal diversity with the estimated event magnitude of 14 real-world outbreak events manually created as ground truth. Our analysis shows that correlation results are diverse among different outbreaks, which can reflect the characteristics (severity and duration) of outbreaks.
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
In a text retrieval community, many researchers have shown a good quality of searching a current snapshot of the Web. However, only a small number have demonstrated a good quality of searching a long-term archival domain, where documents are preserved for a long time, i.e., ten years or more. In such a domain, a search application is not only applicable for archivists or historians, but also in a context of national library and enterprise search (searching document repositories, emails, etc.). In the rest of this paper, we will explain three problems of searching document archives and propose possible approaches to solve these problems. Our main research question is: How to improve the quality of search in a document archive using temporal information?
We address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.
Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. We analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms.
In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
In this paper, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exists three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The aim of this paper is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
My PhD oral defense presentation (as of Oct 3rd 2017)
The dissertation can be requested at this link https://www.researchgate.net/publication/323308750_A_task-based_scientific_paper_recommender_system_for_literature_review_and_manuscript_preparation
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
Taking the temporal dimension into account in searching, i.e., using time of content creation as part of the search condition, is now gaining increasingly interest. However, in the case of web search and web warehousing, the timestamps (time of creation or creation of contents) of web pages and documents found on the web are in general not known or cannot be trusted, and must be determined otherwise. In this paper, we describe approaches that enhance and increase the quality of existing techniques for determining timestamps based on a temporal language model. Through a number of experiments on temporal document collections we show how our new methods improve the accuracy of timestamping compared to the previous models.
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
With the growing volumes of and reliance on digital content, there is a clear need for better information access solutions that keep relevant information accessible and usable in long-term. Inspired by the role of forgetting in the human brain, we envision a concept of managed forgetting for systematically dealing with information that progressively ceases in importance as well as with redundant information. Although inspired by human memory, managed forgetting is meant to complement rather than copy human remembering and forgetting. It can be regarded as functions of attention and significance dynamics relying on multi-faceted information assessment. This talk introduces our vision for managed forgetting on the conceptual level as part of an Integrated Cognitive Framework for Time-aware Information Access. We discuss relevant research and application aspects for managed forgetting. To this end, we present our first results and point out issues where further research is required.
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
In human memory, forgetting plays a crucial role for focusing on important things and neglecting irrelevant details. In digital memories, the idea of systematic forgetting has found little attention, so far. At first glance, forgetting seems to contradict the purpose of archival and preservation. However, we are currently facing a tremendous growth in volumes of digital content. Thus, it becomes ever more important to focus, while forgetting irrelevant details, redundancies and noise. This holds true for better organizing the information space as well as in preservation management for making and revisiting decisions on what to keep. Therefore, we propose the introduction of the concept of managed forgetting as part of a joint information management and preservation management process in digital memories. Managed forgetting models resource selection as a function of attention and significance dynamics. Based on dynamic, multidimensional information value assessment it identifies information objects, e.g., documents or images of decreasing importance and/or topicality and triggers forgetting actions. Those actions include a variety of options, namely, aggregation and summarization, revised search and ranking behavior, elimination of redundancy, and finally, also deletion. In this paper, we present our vision for managed forgetting, discuss the challenges as well as our first ideas for its introduction, and present a case study for its motivation.
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its con- tent is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system deter- mines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline.
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
Wikipedia has become a widely accepted reference point for information of all kinds; real-world events (e.g., natural disasters, man-made incidents, and political events) as well as specific entities like politicians, celebrities, and entities involved in an event. Due to its open construction and negotiation, Wikipedia is an important new cultural and societal phenomenon, and the content of Wikipedia articles is a valuable source for different applications. For instance, the edit history and view logs of Wikipedia can be leveraged for detecting an event and its associated entities. In this study, we analyze temporal anchor texts extracted from the edit history. We propose a model for Wikipedia and anchor texts viewed as a temporal resource and a probabilistic method for ranking temporal anchor texts. Our preliminary results show that relevant anchor texts composed of evolving information (e.g., the changes of names and semantic roles, as well as evolving context) that reflects societal trends and perceptions, thus being candidates for capturing entity evolution.
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
Event detection is an interesting task for many applications, for instance: surveillance, scientific discovery, and Topic Detection and Tracking. Numerous works have focused on detecting events from unstructured text and determining what features constitutes an event, e.g., key terms or named entities. Although most works are able to find interesting time associated to an event, there is a lack in research on determining the relevance of time for an event. In this paper, we propose a method for automatically extracting real-world events from unstructured text documents. In addition, we propose a machine learning approach to identifying relevant time (i.e., temporal expressions) for the extracted events using three classes of features: sentence-based, document-based and corpus-specific features. Through experiments using real-world data and 3,500 manually judged relevance pairs, we show that our proposed approach is able to identify the relevant time of events with good accuracy.
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
Query expansion of named entities can be employed in order to increase the retrieval effectiveness. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance, and synonym relationships between terms change with time. In this paper, we present an approach to extracting synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relationships change over time. Further, we describe how to make use of both types of synonyms to increase the retrieval effectiveness, i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt. temporal criteria. Finally, through an evaluation based on TREC collections, we demonstrate how retrieval performance of queries consisting of named entities can be improved using our approach.
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
With the tremendous amount of news published on the Web every day, helping users explore news events on a given topic of interest is an acute problem. Timeline summaries have recently emerged as a simple and effective solution for users to navigate through temporally related news events. In this paper, we propose an optimization framework and demonstrate the use of Learning To Rank (LTR) to automatically construct timeline summaries from Web news articles. Experimental evaluations show that our approach outperforms existing solutions in producing high quality timeline summaries.
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
Going beyond its role as an encyclopedia, Wikipedia becomes a global memory place for high-impact events, such as, natural disasters and manmade incidents, thus influencing collective memory, i.e., the way we remember the past. Due to the importance of collective memory for framing the assessment of new situations, our actions and value systems, its open construction and negotiation in Wikipedia is an important new cultural and societal phenomenon. The analysis of this phenomenon does not only promise new insights in collective memory. It is also an important foundation for technology, which more effectively complements the processes of human forgetting and remembering and better enables us to learn from the past. In this paper, we analyse the long-term dynamics of Wikipedia as a global memory place for high-impact events. This complements existing work in analysing the collective memory negotiation and construction process in Wikipedia directly following the event. In more detail, we are interested in catalysts for reviving memories, i.e., in the fuel that keeps memories of past events alive, interrupting the general trend for fast forgetting. For this purpose, we study the trigger of revisiting behavior for a large set of event pages by exploiting page views and time series analysis, as well as identify of most important catalyst features.
Humans are very effective in remembering by abstraction, pattern exploitation, or contextualization. On the other hand, humans are also capable of forgetting irrelevant details, an important role in the human brain helping us to focus on relevant things instead of drowning in details by remembering everything. The research question that we address in this paper is: Can we learn from human remembering and forgetting in order to develop more advanced preservation technology? In particular, we aim at studying how a managed or controlled form of forgetting can play a role in digital preservation, including personal and organizational archives as well as collective memories. Our research goal is twofold: 1) to establish effective preservation for more concise and accessible digital memories, and 2) to enable the easier and wider adoption of preservation technology. The concept of managed forgetting is discussed in more detail in the research work of the European project ForgetIT, which investigates the proposed concept by mean of an integrated information and preservation management approach.
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
A microblogging service like Twitter continues to surge in importance as a means of sharing information in social networks. In the medical domain, several works have shown the potential of detecting public health events (i.e., infectious disease outbreaks) using Twitter messages or tweets. Given its real-time nature, Twitter can enhance early outbreak warning for public health authorities in order that a rapid response can take place. Most of previous works on detecting outbreaks in Twitter simply analyze tweets matched disease names and/or locations of interests. However, the effectiveness of such method is limited for two main reasons. First, disease names are highly ambiguous, i.e., referring slangs or non health-related contexts. Second, the characteristics of infectious diseases are highly dynamic in time and place, namely, strongly time-dependent and vary greatly among different regions. In this paper, we propose to analyze the temporal diversity of tweets during the known periods of real-world outbreaks in order to gain insight into a temporary focus on specific events. More precisely, our objective is to understand whether the temporal diversity of tweets can be used as indicators of outbreak events, and to which extent. We employ an efficient algorithm based on sampling to compute the diversity statistics of tweets at particular time. To this end, we conduct experiments by correlating temporal diversity with the estimated event magnitude of 14 real-world outbreak events manually created as ground truth. Our analysis shows that correlation results are diverse among different outbreaks, which can reflect the characteristics (severity and duration) of outbreaks.
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
In a text retrieval community, many researchers have shown a good quality of searching a current snapshot of the Web. However, only a small number have demonstrated a good quality of searching a long-term archival domain, where documents are preserved for a long time, i.e., ten years or more. In such a domain, a search application is not only applicable for archivists or historians, but also in a context of national library and enterprise search (searching document repositories, emails, etc.). In the rest of this paper, we will explain three problems of searching document archives and propose possible approaches to solve these problems. Our main research question is: How to improve the quality of search in a document archive using temporal information?
We address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.
Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. We analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms.
In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
In this paper, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exists three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The aim of this paper is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
My PhD oral defense presentation (as of Oct 3rd 2017)
The dissertation can be requested at this link https://www.researchgate.net/publication/323308750_A_task-based_scientific_paper_recommender_system_for_literature_review_and_manuscript_preparation
Temporal Web Dynamics: Implications from Search PerspectiveNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
20160922 Materials Data Facility TMS WebinarBen Blaiszik
Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
A tutorial on the query auto-completion, with more than 10 conference papers, about the development of current query auto-completion and its personalized, time-sensitive , mobile features and so on.
A tutorial on query auto-completion (QAC), which refer from 10 more search conference papers in recent years. About the development of the QAC, personalized QAC, time-sensitive QAC, QAC in mobile and the future QAC.
Efficient analysis of large scientific datasets often requires a means to rapidly search and select interesting portions of data
based on ad-hoc search criteria. We present our work on integrating an efficient searching technology named
FastBit
[2, 3]
with HDF5. The integrated system named
HDF5-FastQuery
allows the users to efficiently generate complex selections on
HDF5 datasets using compound range queries of the form
(
temperature>
1000)
AND
(70
<pressure><
90)
. The FastBit
technology generates compressed bitmap indices that accelerate searches on HDF5 datasets and can be stored together with
those datasets in an HDF5 file. Compared with other indexing schemes, compressed bitmap indices are compact and very well
suited for searching over multidimensional data – even for arbitrarily complex combinations of range conditions.
Over the past year the University of New Mexico (UNM) Libraries instituted a new digital preservation initiative that was literally built from the ground up. Initially conceived as a means to preserve the libraries' digital collections, the project involved developing program structure, improving tools and working with vendors. As the project developed, the digital preservation needs of a broader community than originally planned became vividly apparent, and it evolved into a much larger endeavor that includes preservation of research data, university archives and digital cultural heritage collections from partner institutions around the state. The presenters will discuss their experiences implementing digital preservation at UNM, and talk about how the initiative is starting to encompass the preservation needs of partner organizations.
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
We propose an approach to detect topics, overlapping communities of interest, expertise, trends and activities in user-generated content sites and in particular in question-answering forums such as StackOverflow. We first describe QASM (Question & Answer Social Media), a system based on social network analysis to manage the two main resources in question-answering sites: users and content. We also introduce the QASM vocabulary used to formalize both the level of interest and the expertise of users on topics. We then propose an efficient approach to detect communities of interest. It relies on another method to enrich questions with a more general tag when needed. We compared three detection methods on a dataset extracted from the popular Q&A site StackOverflow. Our method based on topic modeling and user membership assignment is shown to be much simpler and faster while preserving the quality of detection. We then propose an additional method to automatically generate a label for a detected topic by analyzing the meaning and links of its bag of words. We conduct a user study to compare different algorithms to choose a label. Finally we extend our probabilistic graphical model to jointly model topics, expertise, activities and trends. We performed experiments with real-world data to confirm the effectiveness of our joint model, studying user behaviors and topic dynamics.
http://www-sop.inria.fr/members/Zide.Meng/
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
The modern chemist has access to large databases containing both experimental and calculated data. The power of HPC resources continues to increase, with more practitioners having routine access to powerful computational chemistry tools. This places an increasingly high burden on users to assimilate these resources into their workflow in order to effectively utilize resources. The creation of an open, extensible application framework that puts computational tools, data, and domain specific knowledge at the fingertips of chemists is increasingly important. A data-centric approach to chemistry, storing all data in a searchable database, will empower users to efficiently collaborate, innovate, and push the frontiers of research. Providing an open, user-friendly and extensible application will open up new tools to experimental chemists, while providing computational chemists the ability to address greater challenges. Additionally, by distributing experimental and computational data across the research community, incorporating cheminformatics analytics techniques, and providing visual search for chemical structures, the workflow of both groups can be significantly improved. This requires suitable data formats for data exchange, and databases with appropriate APIs for querying, and uploading data in order to effectively share. This talk will discuss recent progress made in developing a suite of open chemistry applications on the desktop. The applications can query online databases, such as the NIH structure resolver service, download and manipulate structures, and prepare input files for standalone computational chemistry codes. Another application developed to submit jobs, monitor and retrieve results from HPC resources will also be shown, and a desktop chemistry database browser. The Quixote project aims to establish standards for data exchange in computational chemistry, along with data repositories for organizations. Establishing these standards is important to promote open, reproducible chemistry, and their integration into user-friendly desktop applications will promote their integration in the standard workflow of researchers.
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
11. - Probabilistic model:
- Pr(c|q): weight of certain subtopic in a query
- Pr(q|d), Pr(d|q): relation between document and query
- Pr(c|d), Pr(d|c): relation between document and subtopic
- IA-Select objective function:
11
IA-Select Model [Vallet and Castells. 2012]
document
relevance novelty
12. • MMR [Carbonell and Goldstein. 98]
- diversify based on similarity of document contents
• IA-Select [Agrawal et al. 2009]
- diversify based on the taxonomy of subtopic categories
• xQuaD [Santos et al. 2010]
- general form of IA-Select
- define objective function as a mixture of relevance and diversity
probabilities
• Topic richness [Dou et al. 2011]
- general form of xQuaD and IA-Select models
- accepts topics from multiple sources
12
Search Result Diversification Models
13. - Probabilistic model:
- Pr(c|q): weight of certain subtopic in a query
- Pr(q|d), Pr(d|q): relation between document and query
- Pr(c|d), Pr(d|c): relation between document and subtopic
- xQuaD objective function:
13
xQuaD Model [Vallet and Castells. 2012]
novelty
document-topic
relevance
document-query
relevance
16. • TREC Blogs08 Collection
- crawled from Jan 2008 to Feb 2009
- clean HTML tags using HtmlCleaner and Boilerpipe libraries
- index using Lucene Core
- document’s publication date extracted from:
- Blog content
- URL
- Retrieval date
16
Experiment Settings
17. • Retrieval baseline:
- Okapi BM25
• Relevance assessments:
- human assessment
- binary relevance judgment, follows TREC Diversity Track 2009 and 2011
- 2 dimensional assessment: relevance and time
- exclude topics mined from query log (time gap between AOL and
Blogs08)
- top 10 most probability words represents a topic.
• Querying-time points:
- how popular is the query at particular time t
- how different to the previous time slice t-1
17
Experiment Settings
18. 18
Relevance assessments
Document Publication
Date
Subtopic Hitting time Relevance
Title: It’s The Most Wonderful Time Of The
Year
Content: The greatest sports week of the
year is upon us, and Chris’s Sports Blog is
ready. Check back daily for coverage of the
ACC and NCAA Tournament, tips on how to
fill out your bracket …
2008-03-17 ncaa basketball
tournament
2008-03 1
Title: Is there a bigger joke than the NCAA?
Content: Each year the NCAA discovers a
new way to make a bigger ass of itself than
the previous one. This years specialty is to
bar from NCAA playoffs in every sport any
school who persists in using "unacceptable"
team names and school mascots, by their
exclusive definition…
2005-08-22 ncaa basketball
tournament
2008-03 0
Title: Apple Quince Jam
Content: The apple quince is a fruit that
ripens in the period from October to
November, it is an apple with a strange
shape: it looks like a pear and apple, and it is
lumpy…
2007-11-15 apple jam 2008-03 1
19. • α-NDCG
- adding diversity and novelty to nDCG
• Intent-Aware Precision (Precision-IA)
- intent-aware version of Precision
- treat subtopic as distinct interpretation of query
• Intent-Aware Expected Reciprocal Rank (ERR-IA)
- based on cascade model of search
19
Evaluation Metrics
22. Conclusion
• studied temporally ambiguous and multi-faceted queries
- subtopic temporal variability
- subtopic mining from two different sources (query logs, document
collection)
• propose time-aware search results diversification frameworks
• Model and predict the subtopic change
• Combine diversifying by subtopics and time in a unified framework
22
Future work
24. Settings
• Estimate natural number of subtopics
– Suresh et al. 2010 view LDA as matrix factorization mechanism
– Cd×w = M1d×t × M2t×w
• d: number of document in the corpus
• w: size of vocabulary
• t: the number of topics
– optimum t is with minimal divergence value
•
• CM1 is the distribution of singular values of M1
• CM2 is obtained by normalize vector L · M2, L is 1 x D vector of lengths of
each document in C.
24
25. • New documents appear all the time
• Document content change over time
• Queries and query volumes change over time
• Example: [Kulkarni et al. 2011]
25
march madness
ncaa
Motivation
27. Cluster Subtopic Candidates
• Clustering approach [Song et al. 2011]:
– step 1: Construct a similarity matrix of the related queries
– step 2: Cluster using Affinity Propagation algorithm
– step 3: Extract a set of exemplars as subtopics of the query
• Similarity metrics:
– lexical similarity:
• keywords and cosine similarity
– co-click similarity:
• based on fraction of common clicks
– semantic similarity:
• use WordNet as external KB
27
28. • Vector-based:
– Cosine Similarity
• Bag of words-based:
– Jaccard Coefficient
• Ranked list of words-based:
– Kendall τ Coefficient -based:
• Multinomial distribution-based:
– Kullback-Leibler Divergence
– Jensen-Shannon Divergence
Topic Similarity Metrics [Kim and Oh. 2011]
28
29. Subtopic Mining Approach
• Dynamic queries: select 57 out of 61 queries from the AOL query log i.e.
yearly recurrent or time-independent.
• Settings
– partition collection into 14 one-month length time slices
– training data in time slice ti is top 2000 documents D with d ∈ D,
pubDate(d) ∈ t
– number of subtopic is preset in the range from 5 to 20
• Subtopic weight:
– weight w(c) is the probability that a given query q implies subtopic c
–
29
30. Temporal Document Collection
• TREC Blogs08 Collection
- crawled from Jan 2008 to Feb 2009
- clean HTML tags using HtmlCleaner and Boilerpipe libraries
- index using Lucene Core
- document’s publication date extracted from:
- Blog content
- URL
- Retrieval date
30
31. Subtopic Evaluation
• 61 queries: 51 event-related queries, 10 standard ambiguous queries
– aspect removed e.g. march madness brackets → march madness
• Subtopic evaluation metrics [Radlinski et al. 2010]:
– coherence
– distinctness
– plausibility
– completeness
31
32. Subtopic Evaluation
- Perplexity: a measure of the ability of a model to generalize documents
-
- use holdout validation with 90% data for training and 10% for testing
- randomly select 20 out of 57 queries at a random time slice
32
Editor's Notes
In this work, we use a state-of-the-art
finding related queries technique for a query that
is considered as a dynamic query to mine its set of related queries, assuming that these
related queries expose the underlining subtopics of the dynamic query.
Here we run a Markov random walk with restart (RWR) on the weighted bipartite
graph composed of two sets of nodes, queries and URLs. The bipartite graph is partitioned
by time into separated parts, with which we obtained a set of (explicit and implicit) related
queries at different time intervals.
LDA is a Bayesian multinomial mixture model which has become a state of the art and popular method in text analysis due to its
ability to produce interpretable and semantically coherent topics.
a topic is represented by a ranked list of word probability
MMR instantiates the scoring function by estimating the similarity between d ∈ R\S and its most dissimilar document dj ∈ S.
IA-Select investigate the problem of ambiguous queries with the overall objective of maximizing the probability that an average user finds at least one relevant doc- ument in the top n search results. Their model assumes an explicit taxonomy of subtopics is available, and both documents and queries may fall into multiple subtopics.
xQuaD defined the objective function for diversification as a mixture of relevance and diversity probabilities
topic richness is a generalization of xQuaD and IA-Select that combine subtopics from different sources
MMR instantiates the scoring function by estimating the similarity between d ∈ R\S and its most dissimilar document dj ∈ S.
IA-Select investigate the problem of ambiguous queries with the overall objective of maximizing the probability that an average user finds at least one relevant doc- ument in the top n search results. Their model assumes an explicit taxonomy of subtopics is available, and both documents and queries may fall into multiple subtopics.
xQuaD defined the objective function for diversification as a mixture of relevance and diversity probabilities
topic richness is a generalization of xQuaD and IA-Select that combine subtopics from different sources
MMR instantiates the scoring function by estimating the similarity between d ∈ R\S and its most dissimilar document dj ∈ S.
IA-Select investigate the problem of ambiguous queries with the overall objective of maximizing the probability that an average user finds at least one relevant doc- ument in the top n search results. Their model assumes an explicit taxonomy of subtopics is available, and both documents and queries may fall into multiple subtopics.
xQuaD defined the objective function for diversification as a mixture of relevance and diversity probabilities
topic richness is a generalization of xQuaD and IA-Select that combine subtopics from different sources
There are temporal aspects to Web search queries that search engines must account for in order to provide the most relevant results to their users. asn example,inthe middle of March 2008,the query march madness suddenly became very popular, occurring thousands of times when one month before it occurred infrequently. The rise in popularity was a result of the popular annual college basketball championship in the United States.
In addition to changes in query frequency, there were other changes associated with the query march madness during the championship period. For one, the National Collegiate Athletic Association (NCAA) homepage (http://ncaa.com) became very relevant to the query. The page provides comprehensive coverage of US college sports, but does not typically focus on basketball – except in March during March Madness. Other results were more relevant to the query march madness during March because they provided dynamic content. For example, the CBS Sports college basketball page, which provides real time game information, became relevant to people seeking to learn the score of a game in progress. In contrast, relatively static pages, like the Wikipedia page about March Madness, became less relevant during this period of high interest. Such pages are useful for learning about March Madness in general, but not for actively monitoring the event, and thus are better suited to satisfy the need of searchers when the query is not spiking. The changes in which pages were relevant to the query march madness during the month of March reflects the fact that people’s query intent was also changing.
The Random walk with Restart on the clickthrough graph giving us a set of related queries.
However, these related queries can be duplicated or near-duplicated in semantic meanings.
In the next step, we cluster these related queries in order to eliminate redundancy in the list.
We apply Affinity Propagation (AP) as the clustering method. AP is an
exemplar-based clustering method. It takes as input similarities between data points and outputs a set of data points
(exemplars) that best represent the data, and assign each non-exemplar point to its most appropriate exemplar, thereby the data points are grouped into clusters.