We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
The Efficacy of Technology Acceptance Model: A Review of Applicable Theoretic...QUESTJOURNAL
ABSTRACT: This is a review of theoretical models most recently used in Information Technology adoption research. A literature review approach has been adopted. More than 25 literatures were reviewed in the area of information adoption covering the last 30 years. We identified the strengths and weaknesses of each of the theory used. It is found that Technology Acceptance Model is by far the most used to underpin research work in this area follow by Theory of planned behaviour.
The Efficacy of Technology Acceptance Model: A Review of Applicable Theoretic...QUESTJOURNAL
ABSTRACT: This is a review of theoretical models most recently used in Information Technology adoption research. A literature review approach has been adopted. More than 25 literatures were reviewed in the area of information adoption covering the last 30 years. We identified the strengths and weaknesses of each of the theory used. It is found that Technology Acceptance Model is by far the most used to underpin research work in this area follow by Theory of planned behaviour.
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
Text mining to correct missing CRM information: a practical data science projectJonathan Sedar
20min talk given at PyData London 2014
A client in the energy sector wanted to create predictive behavioural models of business customers at the company level, but the CRM data was messy, often containing several sub-accounts for each business, without any grouping identifiers, and so aggregation was impossible. In this talk I describe a short project where we used text mining, a handful of unsupervised learning techniques and pragmatic use of human skill, to identify the true company level structures in the CRM data.
Recommender Systems and Active LearningDain Kaplan
This presentation presents a high level overview of recommender systems and active learning, including from the viewpoint of startups vs. established companies, the cold-start problem, etc.
Online recommendations at scale using matrix factorisationMarcus Ljungblad
This presentation was used for my thesis defense held at Universidad Politecnica de Catalunya, Spain, for a double-degree master programme in Distributed Computing. The other two universities participating in the programme are Royal Institute of Technology, Stockholm, Sweden and Instituto Tecnico Superior, Lisbon, Portugal.
Data management services outsourcing – data mining, data entry and data proce...Sam Studio
Sam studio is a outsource data management services provider. We offer data mining, data entry, data processing, data conversion, electronic publication, data analysis and OCR services to our globalized customers.
To download please go to: http://www.intelligentmining.com/category/knowledge-base/
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on Dec. 10, 2009.
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
Forecasting is everywhere. This talk covers:
• Fundamental concepts of time series
• Data preprocessing (imputation and outlier analysis)
• Feature engineering and EDA for time series
• Statistical and machine learning algorithms
• Model evaluation through backtesting
• Model explanation using SHAP
• Model monitoring and deployment considerations
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
Text mining to correct missing CRM information: a practical data science projectJonathan Sedar
20min talk given at PyData London 2014
A client in the energy sector wanted to create predictive behavioural models of business customers at the company level, but the CRM data was messy, often containing several sub-accounts for each business, without any grouping identifiers, and so aggregation was impossible. In this talk I describe a short project where we used text mining, a handful of unsupervised learning techniques and pragmatic use of human skill, to identify the true company level structures in the CRM data.
Recommender Systems and Active LearningDain Kaplan
This presentation presents a high level overview of recommender systems and active learning, including from the viewpoint of startups vs. established companies, the cold-start problem, etc.
Online recommendations at scale using matrix factorisationMarcus Ljungblad
This presentation was used for my thesis defense held at Universidad Politecnica de Catalunya, Spain, for a double-degree master programme in Distributed Computing. The other two universities participating in the programme are Royal Institute of Technology, Stockholm, Sweden and Instituto Tecnico Superior, Lisbon, Portugal.
Data management services outsourcing – data mining, data entry and data proce...Sam Studio
Sam studio is a outsource data management services provider. We offer data mining, data entry, data processing, data conversion, electronic publication, data analysis and OCR services to our globalized customers.
To download please go to: http://www.intelligentmining.com/category/knowledge-base/
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on Dec. 10, 2009.
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
Forecasting is everywhere. This talk covers:
• Fundamental concepts of time series
• Data preprocessing (imputation and outlier analysis)
• Feature engineering and EDA for time series
• Statistical and machine learning algorithms
• Model evaluation through backtesting
• Model explanation using SHAP
• Model monitoring and deployment considerations
2016 data-science-salary-survey - O’Reilly Data ScienceAdam Rabinovitch
IN THIS FOURTH EDITION of the O’Reilly Data Science
Salary Survey, they analyzed input from 983 respondents
working in the data space, across a variety of industries—
representing 45 countries and 45 US states. Through the
results of our 64-question survey, we’ve explored which tools
data scientists, analysts, and engineers use, which tasks they
engage in, and of course—how much they make.
Key findings include:
• Python and Spark are among the tools that contribute
most to salary.
• Among those who code, the highest earners are the ones
who code the most.
• SQL, Excel, R and Python are the most commonly used
tools.
• Those who attend more meetings, earn more.
• Women make less than men, for doing the same thing.
• Country and US state GDP serves as a decent proxy for
geographic salary variation (not as a direct estimate, but
as an additional input for a model).
• The most salient division between tool and tasks usage
is between those who mostly use Excel, SQL, and a small
number of closed source tools—and those who use more
open source tools and spend more time coding.
• R is used across this division: even people who don’t code
much or use many open source tools, use R.
• A secondary division emerges among the coding half—
separating a younger, Python-heavy data scientist/analyst
group, from a more experienced data scientist/engineer
cohort that tends to use a high number of tools and earns
the highest salaries.
A presentation on the value and the risks of identifying, mining, and visualizing data. All this is described in a big-data-aware setting. The presentation is meant for a wide audience, not requiring deep technical background.
The original presentation was done within the KAS Seminar on Data Journalism in Dec 2017.
Annotation_M1.docxSubject Information SystemsJoshi, G. (2.docxjustine1simpson78276
Annotation_M1.docx
Subject: Information Systems
Joshi, G. (2013). Management information systems (Oxford Higher Education). New Delhi: Oxford University Press.
This work is an informational piece written to provide knowledge about management information systems for individuals interested in management. The author is a very intelligent and reliable source for this topic with a vast knowledge base of all things entailed in information systems. His in-depth coverage of the structures and concepts of this topic contribute greatly to the excellence of this book. Infrastructure information found in the work provides great insight into things such as databases, hardware, software, and other components of information systems. Talented author Joshi wraps up his ideas by highlighting the importance of the development, management, and challenges of management information systems. This is a great work that serves as an extremely reliable source of reference for those interested in this topic.
Gregor, S. (2006). The Nature of Theory in Information Systems. MIS Quarterly, 30(3), 611-642.
Gregor takes to his work, The Nature of Theory in Information Systems, to shed light on his research regarding the nature of information systems. Through this article the author addresses things such as prediction, generalization, and explanation. While he does so in a well- researched and knowledgeable manner, he does neglect a few points about information systems such as its structure. His concentration on things such as prediction, explanation, and other related topics are to be praised, as he provides an intelligent take on these things. As for the article as a whole, the author could have done a better job by providing my details about the aforementioned, and other, neglected topics.
Varajão, J. (2013). Enterprise information systems. The Learning Organization, 20(6) doi:10.1108/TLO-10-2013-0059.
In these articles, author Varajao gives a great account of the significance of enterprise information systems and its importance to individuals in fields which they frequently use information systems to execute their jobs. The author brings together five articles that all support his topic of enterprise information systems. He is able to successfully cover every aspect of the topic and leaves nothing to the imagination. His expertise serves as a reliable source of information for this topic, as well as all of its supporting information. His vast knowledge of enterprise information systems and how it can benefit many different areas is extremely commendable and allots him the ability to serve as such an expert in the field.
Wang, C., Wu, C., Wu, C., Chen, D., & Hu, Q. (2008). Communicating Between Information Systems. Information Sciences, 178(16), 3228-3239. doi:10.1016/j.ins.2008.03.017.
This article serves as an intelligent source of reference for one interested in the art of communicating between information systems by highlighting how to do so and why it is importan.
This is my class project using UCI Mashable dataset to determine what constitutes popular news. In this project, I used (1) multiple regression and model building and (2) PCA and factor analysis.
Data Analytics Tools: SAS and R
Corporate bankruptcy prediction using Deep learning techniquesShantanu Deshpande
Corporate Bankruptcy prediction using Recurrent neural networks – Aim is to build a recurrent neural network-based model to predict whether company will become bankrupt or not using financial ratios of Polish companies.
Methodologies & Tools: CRISP-DM, SMOTE-ENN, GA Algorithm, LSTM network (type of RNN)
Crime Risk Forecasting: Near Repeat Pattern Analysis & Load ForecastingAzavea
http://www.azavea.com/hunchlab
This is a rather technical dive into the near repeat pattern analysis and load forecasting features that we've built into HunchLab. Both of these features are aimed at helping a law enforcement agency to better predict risk levels across their jurisdictions and allocate resources according. While no application of predictive analytics will be perfect, forecasting risk based on models of the past can help officers and analysts to anticipate the appropriate next steps.
Near repeat pattern analysis helps officers quantify the risk that arises from multiple incidents happening close to one another in space and time. What we are quantifying is how the fact that your neighbor's house is burgled raises your risk of a burglary in the coming days and weeks.
With load forecasting we are looking at cyclical temporal patterns in incidents. How does the time of year, time of day, and day of week change the levels of crime incidents that we should expect across a jurisdiction? By modeling these cyclical patterns we can project crime levels into the future, helping law enforcement agencies to allocate resources appropriately as well as better manage organizational accountability.
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
In human memory, forgetting plays a crucial role for focusing on important things and neglecting irrelevant details. In digital memories, the idea of systematic forgetting has found little attention, so far. At first glance, forgetting seems to contradict the purpose of archival and preservation. However, we are currently facing a tremendous growth in volumes of digital content. Thus, it becomes ever more important to focus, while forgetting irrelevant details, redundancies and noise. This holds true for better organizing the information space as well as in preservation management for making and revisiting decisions on what to keep. Therefore, we propose the introduction of the concept of managed forgetting as part of a joint information management and preservation management process in digital memories. Managed forgetting models resource selection as a function of attention and significance dynamics. Based on dynamic, multidimensional information value assessment it identifies information objects, e.g., documents or images of decreasing importance and/or topicality and triggers forgetting actions. Those actions include a variety of options, namely, aggregation and summarization, revised search and ranking behavior, elimination of redundancy, and finally, also deletion. In this paper, we present our vision for managed forgetting, discuss the challenges as well as our first ideas for its introduction, and present a case study for its motivation.
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
A microblogging service like Twitter continues to surge in importance as a means of sharing information in social networks. In the medical domain, several works have shown the potential of detecting public health events (i.e., infectious disease outbreaks) using Twitter messages or tweets. Given its real-time nature, Twitter can enhance early outbreak warning for public health authorities in order that a rapid response can take place. Most of previous works on detecting outbreaks in Twitter simply analyze tweets matched disease names and/or locations of interests. However, the effectiveness of such method is limited for two main reasons. First, disease names are highly ambiguous, i.e., referring slangs or non health-related contexts. Second, the characteristics of infectious diseases are highly dynamic in time and place, namely, strongly time-dependent and vary greatly among different regions. In this paper, we propose to analyze the temporal diversity of tweets during the known periods of real-world outbreaks in order to gain insight into a temporary focus on specific events. More precisely, our objective is to understand whether the temporal diversity of tweets can be used as indicators of outbreak events, and to which extent. We employ an efficient algorithm based on sampling to compute the diversity statistics of tweets at particular time. To this end, we conduct experiments by correlating temporal diversity with the estimated event magnitude of 14 real-world outbreak events manually created as ground truth. Our analysis shows that correlation results are diverse among different outbreaks, which can reflect the characteristics (severity and duration) of outbreaks.
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
In this paper, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exists three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The aim of this paper is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationNattiya Kanhabua
Search result diversification is a common technique for tackling the problem of ambiguous and multi-faceted queries by maximizing query aspects or subtopics in a result list. In some special cases, subtopics associated to such queries can be temporally ambiguous, for instance, the query US Open is more likely to be targeting the tennis open in September, and the golf tournament in June. More precisely, users' search intent can be identified by the popularity of a subtopic with respect to the time where the query is issued. In this paper, we study search result diversification for time-sensitive queries, where the temporal dynamics of query subtopics are explicitly determined and modeled into result diversification. Unlike aforementioned work that, in general, considered only static subtopics, we leverage dynamic subtopics by analyzing two data sources (i.e., query logs and a document collection). By using these data sources, it provides the insights from different perspectives of how query subtopics change over time. Moreover, we propose novel time-aware diversification methods that leverage the identified dynamic subtopics. A key idea is to re-rank search results based on the freshness and popularity of subtopics. To this end, our experimental results show that the proposed methods can significantly improve the diversity and relevance effectiveness for time-sensitive queries in comparison with state-of-the-art methods.
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
Wikipedia has become a widely accepted reference point for information of all kinds; real-world events (e.g., natural disasters, man-made incidents, and political events) as well as specific entities like politicians, celebrities, and entities involved in an event. Due to its open construction and negotiation, Wikipedia is an important new cultural and societal phenomenon, and the content of Wikipedia articles is a valuable source for different applications. For instance, the edit history and view logs of Wikipedia can be leveraged for detecting an event and its associated entities. In this study, we analyze temporal anchor texts extracted from the edit history. We propose a model for Wikipedia and anchor texts viewed as a temporal resource and a probabilistic method for ranking temporal anchor texts. Our preliminary results show that relevant anchor texts composed of evolving information (e.g., the changes of names and semantic roles, as well as evolving context) that reflects societal trends and perceptions, thus being candidates for capturing entity evolution.
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its con- tent is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system deter- mines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline.
Temporal Web Dynamics: Implications from Search PerspectiveNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Humans are very effective in remembering by abstraction, pattern exploitation, or contextualization. On the other hand, humans are also capable of forgetting irrelevant details, an important role in the human brain helping us to focus on relevant things instead of drowning in details by remembering everything. The research question that we address in this paper is: Can we learn from human remembering and forgetting in order to develop more advanced preservation technology? In particular, we aim at studying how a managed or controlled form of forgetting can play a role in digital preservation, including personal and organizational archives as well as collective memories. Our research goal is twofold: 1) to establish effective preservation for more concise and accessible digital memories, and 2) to enable the easier and wider adoption of preservation technology. The concept of managed forgetting is discussed in more detail in the research work of the European project ForgetIT, which investigates the proposed concept by mean of an integrated information and preservation management approach.
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
With the growing volumes of and reliance on digital content, there is a clear need for better information access solutions that keep relevant information accessible and usable in long-term. Inspired by the role of forgetting in the human brain, we envision a concept of managed forgetting for systematically dealing with information that progressively ceases in importance as well as with redundant information. Although inspired by human memory, managed forgetting is meant to complement rather than copy human remembering and forgetting. It can be regarded as functions of attention and significance dynamics relying on multi-faceted information assessment. This talk introduces our vision for managed forgetting on the conceptual level as part of an Integrated Cognitive Framework for Time-aware Information Access. We discuss relevant research and application aspects for managed forgetting. To this end, we present our first results and point out issues where further research is required.
In this talk, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exist three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The objective is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Searching the Temporal Web: Challenges and Current ApproachesNattiya Kanhabua
This talk gives a survey of current approaches to searching the temporal web. In such a web collection, the contents are created and/or edited over time, and examples are web archives, news archives, blogs, micro-blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching the temporal web. The reason for this is multifold: 1) the collection is strongly time-dependent, i.e., with multiple versions of documents, 2) the contents of documents are about events happened at particular time periods, 3) the meanings of semantic annotations can change over time, and 4) a query representing an information need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed, namely, 1) How to understand temporal search intent represented by time-sensitive queries? 2) How to handle the temporal dynamics of queries and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
Taking the temporal dimension into account in searching, i.e., using time of content creation as part of the search condition, is now gaining increasingly interest. However, in the case of web search and web warehousing, the timestamps (time of creation or creation of contents) of web pages and documents found on the web are in general not known or cannot be trusted, and must be determined otherwise. In this paper, we describe approaches that enhance and increase the quality of existing techniques for determining timestamps based on a temporal language model. Through a number of experiments on temporal document collections we show how our new methods improve the accuracy of timestamping compared to the previous models.
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
In a text retrieval community, many researchers have shown a good quality of searching a current snapshot of the Web. However, only a small number have demonstrated a good quality of searching a long-term archival domain, where documents are preserved for a long time, i.e., ten years or more. In such a domain, a search application is not only applicable for archivists or historians, but also in a context of national library and enterprise search (searching document repositories, emails, etc.). In the rest of this paper, we will explain three problems of searching document archives and propose possible approaches to solve these problems. Our main research question is: How to improve the quality of search in a document archive using temporal information?
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an important role in many domains, e.g., digital libraries and document archives. Temporal queries can be divided into two types: 1) those with temporal criteria explicitly provided by users, and 2) those with no temporal criteria provided. In this paper, we deal with the latter type of queries, i.e., queries that comprise only keywords, and their relevant documents are associated to particular time periods not given by the queries. We propose a number of methods to determine the time of queries using temporal language models. After that, we show how to increase the retrieval effectiveness by using the determined time of queries to re-rank the search results. Through extensive experiments we show that our proposed approaches improve retrieval effectiveness.
We address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.
Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. We analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms.
In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsRosie Wells
Insight: In a landscape where traditional narrative structures are giving way to fragmented and non-linear forms of storytelling, there lies immense potential for creativity and exploration.
'Collapsing Narratives: Exploring Non-Linearity' is a micro report from Rosie Wells.
Rosie Wells is an Arts & Cultural Strategist uniquely positioned at the intersection of grassroots and mainstream storytelling.
Their work is focused on developing meaningful and lasting connections that can drive social change.
Please download this presentation to enjoy the hyperlinks!
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Ranking Related News Predictions
1. Ranking Related News Predictions
Ranking Related News Predictions
Nattiya Kanhabua1, Roi Blanco2 and Michael Matthews2
1Norwegian University of Science and Tech., Norway
2Yahoo! Research, Barcelona, Spain
SIGIR’2011, Beijing
2. Ranking Related News Predictions
Outline
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
3. Ranking Related News Predictions
Outline
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
4. Ranking Related News Predictions
Outline
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
5. Ranking Related News Predictions
Outline
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
6. Ranking Related News Predictions
Introduction
Problem Statement
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
7. Ranking Related News Predictions
Introduction
Problem Statement
Problem statement
People are naturally curious about the future.
◮ How long will a war in the middle east last?
◮ What is the latest health care plan?
◮ What will happen to EU economies in next 5 years?
◮ What will be potential effects of climate changes?
Over 32% of 2.5M documents from Yahoo! News (July 2009 to
July 2010) contain at least one prediction.
A new task called ranking related news predictions.
◮ Retrieve predictions related to a news story in news archives.
◮ Rank them according to their relevance to the news story.
8. Ranking Related News Predictions
Introduction
Problem Statement
Problem statement
People are naturally curious about the future.
◮ How long will a war in the middle east last?
◮ What is the latest health care plan?
◮ What will happen to EU economies in next 5 years?
◮ What will be potential effects of climate changes?
Over 32% of 2.5M documents from Yahoo! News (July 2009 to
July 2010) contain at least one prediction.
A new task called ranking related news predictions.
◮ Retrieve predictions related to a news story in news archives.
◮ Rank them according to their relevance to the news story.
9. Ranking Related News Predictions
Introduction
Problem Statement
Problem statement
People are naturally curious about the future.
◮ How long will a war in the middle east last?
◮ What is the latest health care plan?
◮ What will happen to EU economies in next 5 years?
◮ What will be potential effects of climate changes?
Over 32% of 2.5M documents from Yahoo! News (July 2009 to
July 2010) contain at least one prediction.
A new task called ranking related news predictions.
◮ Retrieve predictions related to a news story in news archives.
◮ Rank them according to their relevance to the news story.
10. Ranking Related News Predictions
Introduction
Problem Statement
Related News Predictions
11. Ranking Related News Predictions
Introduction
Problem Statement
Related News Predictions
12. Ranking Related News Predictions
Introduction
Problem Statement
Related News Predictions
Query = <gas, emission, percent,
european, global, climate>
13. Ranking Related News Predictions
Introduction
Related Work
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
14. Ranking Related News Predictions
Introduction
Related Work
Future-related Information Analyzing Tools
Recorded Future
Difference: a user must specify a query in advance using “predefined” entities.
15. Ranking Related News Predictions
Introduction
Related Work
Future-related Information Analyzing Tools
Yahoo’s Time Explorer
Difference: No ranking or performance evaluation is done.
16. Ranking Related News Predictions
Introduction
Related Work
Previous Work on Future Retrieval
R. Baeza-Yates. Searching the future. SIGIR’2005 Workshop
on Mathematical/Formal Methods in IR.
◮ Extract temporal expressions from news articles.
◮ Retrieve future information using a probabilistic model, i.e.,
multiplying term similarity and a time confidence.
◮ Only a small data set and a year granularity are used.
17. Ranking Related News Predictions
Introduction
Related Work
Previous Work on Future Retrieval
A. Jatowt et al. Supporting analysis of future-related
information in news archives and the web. JCDL’2009.
◮ Extract future mentions from news snippets obtained from
search engines.
◮ Summarize and aggregate results using clustering methods.
◮ Not focus on relevance and ranking of future information.
18. Ranking Related News Predictions
Introduction
Contributions
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
19. Ranking Related News Predictions
Introduction
Contributions
Contributions
I. Formally define ranking related news predictions.
II. Four classes of features: term similarity, entity-based
similarity, topic similarity and temporal similarity.
III. Extensive evaluation using dataset with over 6000
judgments from the NYT Annotated Corpus.
20. Ranking Related News Predictions
Task Definition
System Architecture
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
21. Ranking Related News Predictions
Task Definition
System Architecture
System Architecture
Step 1: Document annotation.
◮ Extract temporal expressions
using time and event recognition.
◮ Normalize them to dates so they
can be anchored on a timeline.
◮ Output: sentences annotated
with named entities and dates,
i.e., predictions.
22. Ranking Related News Predictions
Task Definition
System Architecture
System Architecture
Step 2: Retrieving predictions.
◮ Automatically generate a query
from a news article being read.
◮ Retrieve predictions that match
the query.
◮ Rank predictions by relevance. A
prediction is “relevant” if it is
about the topics of the article.
23. Ranking Related News Predictions
Task Definition
Models
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
24. Ranking Related News Predictions
Task Definition
Models
Annotated Document Model
Collection C = {d1, . . . , dn}.
Document d = {{w1, . . . , wn} , time(d)}.
◮ time(d) gives the publication date of d.
Annotated document ˆd is composed of:
◮ Named entities ˆde = {e1, . . . , en}
◮ Temporal expressions ˆdt = {t1, . . . , tm}
◮ Sentences ˆds = {s1, . . . , sz}
25. Ranking Related News Predictions
Task Definition
Models
Annotated Document Model
Collection C = {d1, . . . , dn}.
Document d = {{w1, . . . , wn} , time(d)}.
◮ time(d) gives the publication date of d.
Annotated document ˆd is composed of:
◮ Named entities ˆde = {e1, . . . , en}
◮ Temporal expressions ˆdt = {t1, . . . , tm}
◮ Sentences ˆds = {s1, . . . , sz}
26. Ranking Related News Predictions
Task Definition
Models
Prediction Model
Let dp be the parent document of a prediction p.
p is a sentence containing field/value pairs:
Field Value
ID 1136243_1
PARENT_ID 1136243
TITLE Gore Pledges A Health Plan For Every Child
TEXT Vice President Al Gore proposed today to guarantee access to
affordable health insurance for all children by 2005, expanding
on a program enacted two years ago that he conceded had had
limited success so far.
CONTEXT Mr. Gore acknowledged that the number of Americans without
health coverage had increased steadily since he and President
Clinton took office.
ENTITY Al Gore
FUTURE_DATE 2005
PUB_DATE 1999/09/08
27. Ranking Related News Predictions
Task Definition
Models
Query Model
Query q is extracted from a news article being read dq.
1. Keywords qtext
2. Time constraints qtime
28. Ranking Related News Predictions
Task Definition
Models
Query Keywords
A news article
being read
Query keyword
extraction
Term query
(1) (2) (3)
Entity query
Q Q
E T
Field
A prediction
ID
PARENT_ID
TITLE
TEXT
ENTITY
CONTEXT
FUTURE_DATE
PUB_DATE
Combined query
Q
C
QE = {e1, . . . , em}
E.g., Barack Obama, Iraq, America
29. Ranking Related News Predictions
Task Definition
Models
Query Keywords
A news article
being read
Query keyword
extraction
Term query
(1) (2) (3)
Entity query
Q Q
Combined query
Q
E T C
Field
A prediction
ID
PARENT_ID
TITLE
TEXT
ENTITY
CONTEXT
FUTURE_DATE
PUB_DATE
QE = {w1, . . . , wn}
E.g., troop, war, withdraw
30. Ranking Related News Predictions
Task Definition
Models
Query Keywords
A news article
being read
Query keyword
extraction
Term query
(1) (2) (3)
Entity query
Q Q
Combined query
Q
E T C
Field
A prediction
ID
PARENT_ID
TITLE
TEXT
ENTITY
CONTEXT
FUTURE_DATE
PUB_DATE
QC = {e1, . . . , em} ∪ {w1, . . . , wn}
E.g., Barack Obama, Iraq, America, troop, war, withdraw
31. Ranking Related News Predictions
Task Definition
Models
Query Time
Time constraints qtime
1. only predictions that are future to time(dq) - (time(dq), tmax]
2. only articles published before time(dq) - [tmin, time(dq)]
now futurepast
Query
2016 203320181999 20062002
P P P
32. Ranking Related News Predictions
Approach
Features
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
33. Ranking Related News Predictions
Approach
Features
Term Similarity
Capture the term-similarity between q and p.
1. retScore(q,p) Lucene’s TF-IDF scoring function
◮ Problem: keyword matching, short texts
◮ Predictions not containing query terms are not retrieved.
2. bm25f(q,p) field-aware ranking function
◮ Extend a sentence structure by surrounding sentences.
◮ Search CONTEXT in addition to TEXT [Blanco et al. 2010].
34. Ranking Related News Predictions
Approach
Features
Term Similarity
Capture the term-similarity between q and p.
1. retScore(q,p) Lucene’s TF-IDF scoring function
◮ Problem: keyword matching, short texts
◮ Predictions not containing query terms are not retrieved.
2. bm25f(q,p) field-aware ranking function
◮ Extend a sentence structure by surrounding sentences.
◮ Search CONTEXT in addition to TEXT [Blanco et al. 2010].
35. Ranking Related News Predictions
Approach
Features
Term Similarity
Capture the term-similarity between q and p.
1. retScore(q,p) Lucene’s TF-IDF scoring function
◮ Problem: keyword matching, short texts
◮ Predictions not containing query terms are not retrieved.
2. bm25f(q,p) field-aware ranking function
◮ Extend a sentence structure by surrounding sentences.
◮ Search CONTEXT in addition to TEXT [Blanco et al. 2010].
36. Ranking Related News Predictions
Approach
Features
Entity-based Similarity
Measure the similarity between q and p
by exploiting annotated entities in dp
, p,
q.
◮ Only applicable for QE and QC.
◮ Features commonly employed in
entity ranking tasks.
◮ Time distance captures the
relationship of term and time.
ID Feature
1 entitySim(q, p)
2 title(e, dp)
3 titleSim(e, dp)
4 senPos(e, dp)
5 senLen(e, dp)
6 cntSenSubj(e, dp)
7 cntEvent(e, dp)
8 cntFuture(e, dp)
9 cntEventSubj(e, dp)
10 cntFutureSubj(e, dp)
11 timeDistEvent(e, dp)
12 timeDistFuture(e, dp)
13 tagSim(e, dp)
14 isSubj(e, p)
15 timeDist(e, p)
37. Ranking Related News Predictions
Approach
Features
Topic Similarity
Compute the similarity between q and p on a topic level.
◮ Latent Dirichlet allocation [Blei et al. 2003] for modeling topics.
1. Train a topic model
2. Infer topics
3. Compute topic similarity
38. Ranking Related News Predictions
Approach
Features
Topic Similarity
Step 1: Learn a topic model.
◮ Partition DN into sub-collections,
called document snapshot Dtrain,tk
.
◮ For each Dtrain,tk
, randomly select
documents for training a topic model.
◮ Output: topic models at different
time snapshots, e.g., φtk
at tk .
39. Ranking Related News Predictions
Approach
Features
Topic Similarity
Step 2: Infer topics.
◮ Determine topics for q and p using
their contents, called topic inference.
◮ Both q and p are represented by a
probability distribution of topics.
◮ pφ = p(z1), . . . , p(zn), where p(z) is
a probability of a topic z.
40. Ranking Related News Predictions
Approach
Features
Topic Similarity
I. Which model snapshot should be used for inference?
Select a topic model φtk
for inference in 2 ways:
◮ tk = time(dq
)
◮ tk = time(dp
)
II. Which contents should be used for inference?
For a query q, the parent document dq is used. For a
prediction p, the contents can be:
◮ Only text ptxt
◮ Both text ptxt and context pctx
◮ Parent document dp
41. Ranking Related News Predictions
Approach
Features
Topic Similarity
I. Which model snapshot should be used for inference?
Select a topic model φtk
for inference in 2 ways:
◮ tk = time(dq
)
◮ tk = time(dp
)
II. Which contents should be used for inference?
For a query q, the parent document dq is used. For a
prediction p, the contents can be:
◮ Only text ptxt
◮ Both text ptxt and context pctx
◮ Parent document dp
42. Ranking Related News Predictions
Approach
Features
Topic Similarity
I. Which model snapshot should be used for inference?
Select a topic model φtk
for inference in 2 ways:
◮ tk = time(dq
)
◮ tk = time(dp
)
II. Which contents should be used for inference?
For a query q, the parent document dq is used. For a
prediction p, the contents can be:
◮ Only text ptxt
◮ Both text ptxt and context pctx
◮ Parent document dp
43. Ranking Related News Predictions
Approach
Features
Topic Similarity
I. Which model snapshot should be used for inference?
Select a topic model φtk
for inference in 2 ways:
◮ tk = time(dq
)
◮ tk = time(dp
)
II. Which contents should be used for inference?
For a query q, the parent document dq is used. For a
prediction p, the contents can be:
◮ Only text ptxt
◮ Both text ptxt and context pctx
◮ Parent document dp
44. Ranking Related News Predictions
Approach
Features
Topic Similarity
Step 3: Measuring topic similarity.
◮ q and p are represented by topic
distributions.
◮ qφ = p(z1), . . . , p(zn)
◮ pφ = p(z1), . . . , p(zn)
◮ Compute the topic similarity using
cosine similarity.
topicSim(q, p) =
qφ · pφ
||qφ|| · ||pφ||
= z∈Z qφz · pφz
z∈Z q2
φz
· z∈Z p2
φz
45. Ranking Related News Predictions
Approach
Features
Temporal Similarity
Hypothesis I. Predictions that are more recent to the query are
more relevant.
now futurepast
Query
2016 203320181999 20062002
P P P
Time distance
46. Ranking Related News Predictions
Approach
Features
Temporal Similarity
Hypothesis II. Predictions extracted from more recent
documents are more relevant.
now futurepast
Query
2016 203320181999 20062002
P P P
Time distance
◮ Timestamp-based Uncertainty (TSU) [Kanhabua and Nørvåg 2010]
◮ FussySet (FS) [Kalczynski and Chou 2005]
47. Ranking Related News Predictions
Approach
Ranking Method
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
48. Ranking Related News Predictions
Approach
Ranking Method
Ranking Method
Learning-to-rank: Given an unseen (q, p), p is ranked using a
model trained over a set of labeled query/prediction pairs.
score(q, p) =
N
i=1
wi × fi
◮ SVMMAP [Yue et al. 2007]
◮ RankSVM [Joachims 2002]
◮ SGD-SVM [Zhang 2004]
◮ PegasosSVM [Shalev-Shwartz et al. 2007]
◮ PA-Perceptron [Crammer et al. 2006]
49. Ranking Related News Predictions
Evaluation
Experiment Setting
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
50. Ranking Related News Predictions
Evaluation
Experiment Setting
Document collection
NYT Annotated Corpus 1.8M from 1987 to 2007.
◮ More than 25% contain at least one prediction
Annotation process uses several language processing tools.
◮ OpenNLP for tokenizing, sentence splitting, part-of-speech
tagging, shallow parsing
◮ SuperSense tagger for named entity recognition
◮ TARSQI for extracting temporal expressions
Apache Lucene for indexing and retrieving.
◮ 44,335,519 sentences and 548,491 predictions
◮ 939,455 future dates (avg. future date/prediction is 1.7)
51. Ranking Related News Predictions
Evaluation
Experiment Setting
Relevance judgments
42 future-related topics
POLITICS ENVIRONMENT SPACE
president election global warming Mars
Iraq war energy efficiency Moon
SCIENCE PHYSICS HEALTH
earthquake particle Physics bird flue
tsunami Big Bang influenza
BUSINESS SPORT TECHNOLOGY
subprime Olympics Internet
financial crisis World cup search engine
52. Ranking Related News Predictions
Evaluation
Experiment Setting
Relevance judgments
Human assessors gave a relevance score Grade(q, p, t).
◮ 4 (very relevant), 3 (relevant), 2 (related), 1 (non-relevant), and 0
(incorrect tagged date)
◮ relevant if Grade(q, p, t) ≥ 3 and non-relevant if
1 ≤ Grade(q, p, t) ≤ 2
In total, assessors judged 52 queries.
◮ On average 94 predictions were retrieved per query
◮ 4,888 query/prediction pairs (approximately 6,032 of triples)
Available for download at:
www.idi.ntnu.no/~nattiya/data/sigir2011/futurepredictions.zip
53. Ranking Related News Predictions
Evaluation
Experiment Setting
Parameter setting
BM25F: b = 0.75, k1 = 1.2 [Robertson et al. 1994]
◮ boost(TEXT) = 5.0
◮ boost(CONTEXT) = 1.0
◮ boost(TITLE) = 2.0
LDA: Stanford Topic Modeling Toolbox
◮ randomly select 4% of documents in each year for training
◮ filter 100 most common terms and in less than 15 documents
◮ number of topics Nz is 500
◮ collapsed variational Bayes approximation algorithm
Temporal features:
◮ DecayRate = 0.5, λ = 0.5, µ = 2y
◮ n = 2, m = 2, smin = 4y, smax = 2y
◮ α1 = time(dq) − 4y, α2 = time(dq) + 2y
54. Ranking Related News Predictions
Evaluation
Experimental Results
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
55. Ranking Related News Predictions
Evaluation
Experimental Results
Methods for comparison
Baseline: QE , QT , QC
◮ Rank using Lucene’s default ranking function.
Our approach: Re-QE , Re-QT , Re-QC
◮ Re-rank the baseline results using learning-to-rank.
Metrics: P@1, P@3, MRR
◮ Typically, a user is interested in a few top predictions.
56. Ranking Related News Predictions
Evaluation
Experimental Results
Selecting top-m entities and top-n terms
Select m and n with reasonable improvement in a hold-out set.
◮ Using QE to retrieve predictions, choose m = 11.
◮ Observing the performance of QC when m = 11, choose n = 10.
0
0.1
0.2
0.3
0.4
0.5
0.6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
top-m entities
P10
MAP
0
0.1
0.2
0.3
0.4
0.5
0.6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
top-n terms
P10
MAP
57. Ranking Related News Predictions
Evaluation
Experimental Results
Compare all other methods against QE
0
0.2
0.4
0.6
0.8
1
P1
QE
QT
QC
Re-QE
Re-QT
Re-QC
0
0.2
0.4
0.6
0.8
1
P3
QE
QT
QC
Re-QE
Re-QT
Re-QC
0
0.2
0.4
0.6
0.8
1
MRR
QE
QT
QC
Re-QE
Re-QT
Re-QC
Results:
◮ QE performs worst among the baselines, while QC is superior to QT .
◮ Re-QC gains the highest effectiveness followed by Re-QT .
◮ Re-ranking approach gains improvement, except Re-QE .
58. Ranking Related News Predictions
Evaluation
Experimental Results
Compare all other methods against QE
0
0.2
0.4
0.6
0.8
1
P1
QE
QT
QC
Re-QE
Re-QT
Re-QC
0
0.2
0.4
0.6
0.8
1
P3
QE
QT
QC
Re-QE
Re-QT
Re-QC
0
0.2
0.4
0.6
0.8
1
MRR
QE
QT
QC
Re-QE
Re-QT
Re-QC
Analysis:
◮ QE not retrieved any relevant result in the judged pool, difficult for re-ranking.
◮ Entity-based features perform well for some topics.
59. Ranking Related News Predictions
Evaluation
Experimental Results
Feature analysis
Top-5 features with highest weights and lowest weights for each query type.
QE QT QC
Feature Wi Feature Wi Feature Wi
tagSim 1.00 bm25f 1.00 LDA1,parent,k 1.00
FS1 0.97 retScore 0.60 retScore 0.99
TSU2 0.88 LDA1,parent,k 0.55 LDA1,parent,all 0.96
LDA1,txt,k 0.87 LDA2,parent,k 0.51 bm25f 0.93
LDA1,txt,all 0.82 LDA1,parent,all 0.49 isSubj 0.87
cntSenSubj 0.01 timeDistEvent -0.03 cntEventSen -0.02
cntEventSubj 0.01 timeDistFuture -0.11 querySim -0.05
isInTitle 0.00 cntEventSen -0.12 cntFutureSen -0.10
cntEventSen 0.00 cntFutureSen -0.12 timeDistFuture -0.14
querySim -0.01 senLen -0.16 senLen -0.18
◮ Topic-based features play an important role in the re-ranking model.
◮ Although relying on terms, retScore and bm25f help to re-rank predictions.
◮ Features in top-5 features with lowest weights are from the entity-based class.
60. Ranking Related News Predictions
Conclusions
Conclusions and Future Work
Outline
Introduction
Problem Statement
Related Work
Contributions
Task Definition
System Architecture
Models
Approach
Features
Ranking Method
Evaluation
Experiment Setting
Experimental Results
61. Ranking Related News Predictions
Conclusions
Conclusions and Future Work
Conclusions and future work
◮ Define the task of ranking related future predictions.
◮ Employ learning-to-rank incorporating 4 feature classes.
◮ Conduct extensive experiments and create an evaluation
dataset with over 6000 relevance judgments.
◮ Future work:
◮ Combining multiple sources (Wikipedia, blogs, home
pages, etc.) of future-related information.
◮ Sentimental analysis for future-related information.
62. Ranking Related News Predictions
Conclusions
Conclusions and Future Work
Acknowledgment: Thank Hugo Zaragoza for his help at the
early stages of this work.
Thank you!