In this talk, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exist three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The objective is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Tackling mosquitoes together - Preparing to respond to teh threat of exotic m...DrCameronWebb
Authorities need to be better prepared to respond to the introduction of exotic mosquitoes to NSW such as Aedes aegypti and Aedes albopictus. This presentation provides a summary of the capacity building exercise, including workshops, field exercises and community surveys to better understand how local authorities are prepared to respond to these exotic mosquito threats. The presentation was made to the 13th Mosquito Control Association of Australia at Kingscliff, 2-5 September 2018.
Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab
Lake Michigan and outdoor recreation are enjoyable aspects of summers in Chicago, but it can come with risk of potential E. coli in Lake Michigan or West Nile Virus from mosquitos. This summer, the City of Chicago launched two new predictive analytics projects to forecasts the risks and to proactively limit these risks. Members of the research team, Gene Leynes and Nick Lucius discuss the projects and how they’re being used as part of city operations.
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
With the tremendous amount of news published on the Web every day, helping users explore news events on a given topic of interest is an acute problem. Timeline summaries have recently emerged as a simple and effective solution for users to navigate through temporally related news events. In this paper, we propose an optimization framework and demonstrate the use of Learning To Rank (LTR) to automatically construct timeline summaries from Web news articles. Experimental evaluations show that our approach outperforms existing solutions in producing high quality timeline summaries.
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
Event detection is an interesting task for many applications, for instance: surveillance, scientific discovery, and Topic Detection and Tracking. Numerous works have focused on detecting events from unstructured text and determining what features constitutes an event, e.g., key terms or named entities. Although most works are able to find interesting time associated to an event, there is a lack in research on determining the relevance of time for an event. In this paper, we propose a method for automatically extracting real-world events from unstructured text documents. In addition, we propose a machine learning approach to identifying relevant time (i.e., temporal expressions) for the extracted events using three classes of features: sentence-based, document-based and corpus-specific features. Through experiments using real-world data and 3,500 manually judged relevance pairs, we show that our proposed approach is able to identify the relevant time of events with good accuracy.
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
Wikipedia has become a widely accepted reference point for information of all kinds; real-world events (e.g., natural disasters, man-made incidents, and political events) as well as specific entities like politicians, celebrities, and entities involved in an event. Due to its open construction and negotiation, Wikipedia is an important new cultural and societal phenomenon, and the content of Wikipedia articles is a valuable source for different applications. For instance, the edit history and view logs of Wikipedia can be leveraged for detecting an event and its associated entities. In this study, we analyze temporal anchor texts extracted from the edit history. We propose a model for Wikipedia and anchor texts viewed as a temporal resource and a probabilistic method for ranking temporal anchor texts. Our preliminary results show that relevant anchor texts composed of evolving information (e.g., the changes of names and semantic roles, as well as evolving context) that reflects societal trends and perceptions, thus being candidates for capturing entity evolution.
Tackling mosquitoes together - Preparing to respond to teh threat of exotic m...DrCameronWebb
Authorities need to be better prepared to respond to the introduction of exotic mosquitoes to NSW such as Aedes aegypti and Aedes albopictus. This presentation provides a summary of the capacity building exercise, including workshops, field exercises and community surveys to better understand how local authorities are prepared to respond to these exotic mosquito threats. The presentation was made to the 13th Mosquito Control Association of Australia at Kingscliff, 2-5 September 2018.
Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab
Lake Michigan and outdoor recreation are enjoyable aspects of summers in Chicago, but it can come with risk of potential E. coli in Lake Michigan or West Nile Virus from mosquitos. This summer, the City of Chicago launched two new predictive analytics projects to forecasts the risks and to proactively limit these risks. Members of the research team, Gene Leynes and Nick Lucius discuss the projects and how they’re being used as part of city operations.
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
With the tremendous amount of news published on the Web every day, helping users explore news events on a given topic of interest is an acute problem. Timeline summaries have recently emerged as a simple and effective solution for users to navigate through temporally related news events. In this paper, we propose an optimization framework and demonstrate the use of Learning To Rank (LTR) to automatically construct timeline summaries from Web news articles. Experimental evaluations show that our approach outperforms existing solutions in producing high quality timeline summaries.
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
Event detection is an interesting task for many applications, for instance: surveillance, scientific discovery, and Topic Detection and Tracking. Numerous works have focused on detecting events from unstructured text and determining what features constitutes an event, e.g., key terms or named entities. Although most works are able to find interesting time associated to an event, there is a lack in research on determining the relevance of time for an event. In this paper, we propose a method for automatically extracting real-world events from unstructured text documents. In addition, we propose a machine learning approach to identifying relevant time (i.e., temporal expressions) for the extracted events using three classes of features: sentence-based, document-based and corpus-specific features. Through experiments using real-world data and 3,500 manually judged relevance pairs, we show that our proposed approach is able to identify the relevant time of events with good accuracy.
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
Wikipedia has become a widely accepted reference point for information of all kinds; real-world events (e.g., natural disasters, man-made incidents, and political events) as well as specific entities like politicians, celebrities, and entities involved in an event. Due to its open construction and negotiation, Wikipedia is an important new cultural and societal phenomenon, and the content of Wikipedia articles is a valuable source for different applications. For instance, the edit history and view logs of Wikipedia can be leveraged for detecting an event and its associated entities. In this study, we analyze temporal anchor texts extracted from the edit history. We propose a model for Wikipedia and anchor texts viewed as a temporal resource and a probabilistic method for ranking temporal anchor texts. Our preliminary results show that relevant anchor texts composed of evolving information (e.g., the changes of names and semantic roles, as well as evolving context) that reflects societal trends and perceptions, thus being candidates for capturing entity evolution.
Humans are very effective in remembering by abstraction, pattern exploitation, or contextualization. On the other hand, humans are also capable of forgetting irrelevant details, an important role in the human brain helping us to focus on relevant things instead of drowning in details by remembering everything. The research question that we address in this paper is: Can we learn from human remembering and forgetting in order to develop more advanced preservation technology? In particular, we aim at studying how a managed or controlled form of forgetting can play a role in digital preservation, including personal and organizational archives as well as collective memories. Our research goal is twofold: 1) to establish effective preservation for more concise and accessible digital memories, and 2) to enable the easier and wider adoption of preservation technology. The concept of managed forgetting is discussed in more detail in the research work of the European project ForgetIT, which investigates the proposed concept by mean of an integrated information and preservation management approach.
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
Dynamicity of Web and its implications on various components of search systems have taken a large attention in the last decade. This course, in the first place, aims to introduce students to the general and wide topic of Web evolution, and then pinpoint a number of issues that is related to temporal aspects of search and IR. We plan to start with an overview of seminal works that shed light on the evolution of Web within time. Next, we will focus on the impacts of this evolution on search and we will essentially focus on indexing of versioned document collections and time-aware retrieval and ranking. We will discuss evolution of search results and its effects on caching, and wrap up the course with a review of some recent approaches that aim to predict and search the future!
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
In human memory, forgetting plays a crucial role for focusing on important things and neglecting irrelevant details. In digital memories, the idea of systematic forgetting has found little attention, so far. At first glance, forgetting seems to contradict the purpose of archival and preservation. However, we are currently facing a tremendous growth in volumes of digital content. Thus, it becomes ever more important to focus, while forgetting irrelevant details, redundancies and noise. This holds true for better organizing the information space as well as in preservation management for making and revisiting decisions on what to keep. Therefore, we propose the introduction of the concept of managed forgetting as part of a joint information management and preservation management process in digital memories. Managed forgetting models resource selection as a function of attention and significance dynamics. Based on dynamic, multidimensional information value assessment it identifies information objects, e.g., documents or images of decreasing importance and/or topicality and triggers forgetting actions. Those actions include a variety of options, namely, aggregation and summarization, revised search and ranking behavior, elimination of redundancy, and finally, also deletion. In this paper, we present our vision for managed forgetting, discuss the challenges as well as our first ideas for its introduction, and present a case study for its motivation.
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationNattiya Kanhabua
Search result diversification is a common technique for tackling the problem of ambiguous and multi-faceted queries by maximizing query aspects or subtopics in a result list. In some special cases, subtopics associated to such queries can be temporally ambiguous, for instance, the query US Open is more likely to be targeting the tennis open in September, and the golf tournament in June. More precisely, users' search intent can be identified by the popularity of a subtopic with respect to the time where the query is issued. In this paper, we study search result diversification for time-sensitive queries, where the temporal dynamics of query subtopics are explicitly determined and modeled into result diversification. Unlike aforementioned work that, in general, considered only static subtopics, we leverage dynamic subtopics by analyzing two data sources (i.e., query logs and a document collection). By using these data sources, it provides the insights from different perspectives of how query subtopics change over time. Moreover, we propose novel time-aware diversification methods that leverage the identified dynamic subtopics. A key idea is to re-rank search results based on the freshness and popularity of subtopics. To this end, our experimental results show that the proposed methods can significantly improve the diversity and relevance effectiveness for time-sensitive queries in comparison with state-of-the-art methods.
We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
With the growing volumes of and reliance on digital content, there is a clear need for better information access solutions that keep relevant information accessible and usable in long-term. Inspired by the role of forgetting in the human brain, we envision a concept of managed forgetting for systematically dealing with information that progressively ceases in importance as well as with redundant information. Although inspired by human memory, managed forgetting is meant to complement rather than copy human remembering and forgetting. It can be regarded as functions of attention and significance dynamics relying on multi-faceted information assessment. This talk introduces our vision for managed forgetting on the conceptual level as part of an Integrated Cognitive Framework for Time-aware Information Access. We discuss relevant research and application aspects for managed forgetting. To this end, we present our first results and point out issues where further research is required.
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its con- tent is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system deter- mines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline.
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
Going beyond its role as an encyclopedia, Wikipedia becomes a global memory place for high-impact events, such as, natural disasters and manmade incidents, thus influencing collective memory, i.e., the way we remember the past. Due to the importance of collective memory for framing the assessment of new situations, our actions and value systems, its open construction and negotiation in Wikipedia is an important new cultural and societal phenomenon. The analysis of this phenomenon does not only promise new insights in collective memory. It is also an important foundation for technology, which more effectively complements the processes of human forgetting and remembering and better enables us to learn from the past. In this paper, we analyse the long-term dynamics of Wikipedia as a global memory place for high-impact events. This complements existing work in analysing the collective memory negotiation and construction process in Wikipedia directly following the event. In more detail, we are interested in catalysts for reviving memories, i.e., in the fuel that keeps memories of past events alive, interrupting the general trend for fast forgetting. For this purpose, we study the trigger of revisiting behavior for a large set of event pages by exploiting page views and time series analysis, as well as identify of most important catalyst features.
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
A microblogging service like Twitter continues to surge in importance as a means of sharing information in social networks. In the medical domain, several works have shown the potential of detecting public health events (i.e., infectious disease outbreaks) using Twitter messages or tweets. Given its real-time nature, Twitter can enhance early outbreak warning for public health authorities in order that a rapid response can take place. Most of previous works on detecting outbreaks in Twitter simply analyze tweets matched disease names and/or locations of interests. However, the effectiveness of such method is limited for two main reasons. First, disease names are highly ambiguous, i.e., referring slangs or non health-related contexts. Second, the characteristics of infectious diseases are highly dynamic in time and place, namely, strongly time-dependent and vary greatly among different regions. In this paper, we propose to analyze the temporal diversity of tweets during the known periods of real-world outbreaks in order to gain insight into a temporary focus on specific events. More precisely, our objective is to understand whether the temporal diversity of tweets can be used as indicators of outbreak events, and to which extent. We employ an efficient algorithm based on sampling to compute the diversity statistics of tweets at particular time. To this end, we conduct experiments by correlating temporal diversity with the estimated event magnitude of 14 real-world outbreak events manually created as ground truth. Our analysis shows that correlation results are diverse among different outbreaks, which can reflect the characteristics (severity and duration) of outbreaks.
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
Taking the temporal dimension into account in searching, i.e., using time of content creation as part of the search condition, is now gaining increasingly interest. However, in the case of web search and web warehousing, the timestamps (time of creation or creation of contents) of web pages and documents found on the web are in general not known or cannot be trusted, and must be determined otherwise. In this paper, we describe approaches that enhance and increase the quality of existing techniques for determining timestamps based on a temporal language model. Through a number of experiments on temporal document collections we show how our new methods improve the accuracy of timestamping compared to the previous models.
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an important role in many domains, e.g., digital libraries and document archives. Temporal queries can be divided into two types: 1) those with temporal criteria explicitly provided by users, and 2) those with no temporal criteria provided. In this paper, we deal with the latter type of queries, i.e., queries that comprise only keywords, and their relevant documents are associated to particular time periods not given by the queries. We propose a number of methods to determine the time of queries using temporal language models. After that, we show how to increase the retrieval effectiveness by using the determined time of queries to re-rank the search results. Through extensive experiments we show that our proposed approaches improve retrieval effectiveness.
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
Query expansion of named entities can be employed in order to increase the retrieval effectiveness. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance, and synonym relationships between terms change with time. In this paper, we present an approach to extracting synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relationships change over time. Further, we describe how to make use of both types of synonyms to increase the retrieval effectiveness, i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt. temporal criteria. Finally, through an evaluation based on TREC collections, we demonstrate how retrieval performance of queries consisting of named entities can be improved using our approach.
Searching the Temporal Web: Challenges and Current ApproachesNattiya Kanhabua
This talk gives a survey of current approaches to searching the temporal web. In such a web collection, the contents are created and/or edited over time, and examples are web archives, news archives, blogs, micro-blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching the temporal web. The reason for this is multifold: 1) the collection is strongly time-dependent, i.e., with multiple versions of documents, 2) the contents of documents are about events happened at particular time periods, 3) the meanings of semantic annotations can change over time, and 4) a query representing an information need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed, namely, 1) How to understand temporal search intent represented by time-sensitive queries? 2) How to handle the temporal dynamics of queries and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
We address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.
Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. We analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms.
In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
In this paper, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exists three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The aim of this paper is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
In a text retrieval community, many researchers have shown a good quality of searching a current snapshot of the Web. However, only a small number have demonstrated a good quality of searching a long-term archival domain, where documents are preserved for a long time, i.e., ten years or more. In such a domain, a search application is not only applicable for archivists or historians, but also in a context of national library and enterprise search (searching document repositories, emails, etc.). In the rest of this paper, we will explain three problems of searching document archives and propose possible approaches to solve these problems. Our main research question is: How to improve the quality of search in a document archive using temporal information?
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
Dr. Bryan Lewis and Dr. Madhav Marathe (both at Virginia Tech) will present a data driven multi-scale approach for modeling the Ebola epidemic in West Africa. We will discuss how the models and tools were used to study a number of important analytical questions, such as:
(i) computing weekly forecasts, (ii) optimally placing emergency treatment units and more generally health care facilities, and (iii) carrying out a comprehensive counter-factual analysis related to allocation of scarce pharmaceutical and non-pharmaceutical resources. The role of big-data and behavioral adaptation in developing the computational models will be highlighted.
Natural language processing (NLP) for mining online health-related data, a talk given at the International Society of Pharmacovigilence tutorial on Pharmacovigilence and the Social Media in 2017.
Listen to this recording of by IFLA's ENSULIB standing committee, to learn how libraries are working at the forefront of citizen science; the connection between NASA climate change science, citizen science observations, and mosquito-borne disease; how the international GLOBE Mission Mosquito citizen science campaign is providing a common language and approach for meeting the global challenge to ensure good health for all from mosquito-borne diseases; and examples of resources and partnerships that public, academic, and research libraries can leverage.
Humans are very effective in remembering by abstraction, pattern exploitation, or contextualization. On the other hand, humans are also capable of forgetting irrelevant details, an important role in the human brain helping us to focus on relevant things instead of drowning in details by remembering everything. The research question that we address in this paper is: Can we learn from human remembering and forgetting in order to develop more advanced preservation technology? In particular, we aim at studying how a managed or controlled form of forgetting can play a role in digital preservation, including personal and organizational archives as well as collective memories. Our research goal is twofold: 1) to establish effective preservation for more concise and accessible digital memories, and 2) to enable the easier and wider adoption of preservation technology. The concept of managed forgetting is discussed in more detail in the research work of the European project ForgetIT, which investigates the proposed concept by mean of an integrated information and preservation management approach.
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
Dynamicity of Web and its implications on various components of search systems have taken a large attention in the last decade. This course, in the first place, aims to introduce students to the general and wide topic of Web evolution, and then pinpoint a number of issues that is related to temporal aspects of search and IR. We plan to start with an overview of seminal works that shed light on the evolution of Web within time. Next, we will focus on the impacts of this evolution on search and we will essentially focus on indexing of versioned document collections and time-aware retrieval and ranking. We will discuss evolution of search results and its effects on caching, and wrap up the course with a review of some recent approaches that aim to predict and search the future!
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
In human memory, forgetting plays a crucial role for focusing on important things and neglecting irrelevant details. In digital memories, the idea of systematic forgetting has found little attention, so far. At first glance, forgetting seems to contradict the purpose of archival and preservation. However, we are currently facing a tremendous growth in volumes of digital content. Thus, it becomes ever more important to focus, while forgetting irrelevant details, redundancies and noise. This holds true for better organizing the information space as well as in preservation management for making and revisiting decisions on what to keep. Therefore, we propose the introduction of the concept of managed forgetting as part of a joint information management and preservation management process in digital memories. Managed forgetting models resource selection as a function of attention and significance dynamics. Based on dynamic, multidimensional information value assessment it identifies information objects, e.g., documents or images of decreasing importance and/or topicality and triggers forgetting actions. Those actions include a variety of options, namely, aggregation and summarization, revised search and ranking behavior, elimination of redundancy, and finally, also deletion. In this paper, we present our vision for managed forgetting, discuss the challenges as well as our first ideas for its introduction, and present a case study for its motivation.
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationNattiya Kanhabua
Search result diversification is a common technique for tackling the problem of ambiguous and multi-faceted queries by maximizing query aspects or subtopics in a result list. In some special cases, subtopics associated to such queries can be temporally ambiguous, for instance, the query US Open is more likely to be targeting the tennis open in September, and the golf tournament in June. More precisely, users' search intent can be identified by the popularity of a subtopic with respect to the time where the query is issued. In this paper, we study search result diversification for time-sensitive queries, where the temporal dynamics of query subtopics are explicitly determined and modeled into result diversification. Unlike aforementioned work that, in general, considered only static subtopics, we leverage dynamic subtopics by analyzing two data sources (i.e., query logs and a document collection). By using these data sources, it provides the insights from different perspectives of how query subtopics change over time. Moreover, we propose novel time-aware diversification methods that leverage the identified dynamic subtopics. A key idea is to re-rank search results based on the freshness and popularity of subtopics. To this end, our experimental results show that the proposed methods can significantly improve the diversity and relevance effectiveness for time-sensitive queries in comparison with state-of-the-art methods.
We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
With the growing volumes of and reliance on digital content, there is a clear need for better information access solutions that keep relevant information accessible and usable in long-term. Inspired by the role of forgetting in the human brain, we envision a concept of managed forgetting for systematically dealing with information that progressively ceases in importance as well as with redundant information. Although inspired by human memory, managed forgetting is meant to complement rather than copy human remembering and forgetting. It can be regarded as functions of attention and significance dynamics relying on multi-faceted information assessment. This talk introduces our vision for managed forgetting on the conceptual level as part of an Integrated Cognitive Framework for Time-aware Information Access. We discuss relevant research and application aspects for managed forgetting. To this end, we present our first results and point out issues where further research is required.
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its con- tent is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system deter- mines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline.
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
Going beyond its role as an encyclopedia, Wikipedia becomes a global memory place for high-impact events, such as, natural disasters and manmade incidents, thus influencing collective memory, i.e., the way we remember the past. Due to the importance of collective memory for framing the assessment of new situations, our actions and value systems, its open construction and negotiation in Wikipedia is an important new cultural and societal phenomenon. The analysis of this phenomenon does not only promise new insights in collective memory. It is also an important foundation for technology, which more effectively complements the processes of human forgetting and remembering and better enables us to learn from the past. In this paper, we analyse the long-term dynamics of Wikipedia as a global memory place for high-impact events. This complements existing work in analysing the collective memory negotiation and construction process in Wikipedia directly following the event. In more detail, we are interested in catalysts for reviving memories, i.e., in the fuel that keeps memories of past events alive, interrupting the general trend for fast forgetting. For this purpose, we study the trigger of revisiting behavior for a large set of event pages by exploiting page views and time series analysis, as well as identify of most important catalyst features.
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
A microblogging service like Twitter continues to surge in importance as a means of sharing information in social networks. In the medical domain, several works have shown the potential of detecting public health events (i.e., infectious disease outbreaks) using Twitter messages or tweets. Given its real-time nature, Twitter can enhance early outbreak warning for public health authorities in order that a rapid response can take place. Most of previous works on detecting outbreaks in Twitter simply analyze tweets matched disease names and/or locations of interests. However, the effectiveness of such method is limited for two main reasons. First, disease names are highly ambiguous, i.e., referring slangs or non health-related contexts. Second, the characteristics of infectious diseases are highly dynamic in time and place, namely, strongly time-dependent and vary greatly among different regions. In this paper, we propose to analyze the temporal diversity of tweets during the known periods of real-world outbreaks in order to gain insight into a temporary focus on specific events. More precisely, our objective is to understand whether the temporal diversity of tweets can be used as indicators of outbreak events, and to which extent. We employ an efficient algorithm based on sampling to compute the diversity statistics of tweets at particular time. To this end, we conduct experiments by correlating temporal diversity with the estimated event magnitude of 14 real-world outbreak events manually created as ground truth. Our analysis shows that correlation results are diverse among different outbreaks, which can reflect the characteristics (severity and duration) of outbreaks.
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
Taking the temporal dimension into account in searching, i.e., using time of content creation as part of the search condition, is now gaining increasingly interest. However, in the case of web search and web warehousing, the timestamps (time of creation or creation of contents) of web pages and documents found on the web are in general not known or cannot be trusted, and must be determined otherwise. In this paper, we describe approaches that enhance and increase the quality of existing techniques for determining timestamps based on a temporal language model. Through a number of experiments on temporal document collections we show how our new methods improve the accuracy of timestamping compared to the previous models.
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an important role in many domains, e.g., digital libraries and document archives. Temporal queries can be divided into two types: 1) those with temporal criteria explicitly provided by users, and 2) those with no temporal criteria provided. In this paper, we deal with the latter type of queries, i.e., queries that comprise only keywords, and their relevant documents are associated to particular time periods not given by the queries. We propose a number of methods to determine the time of queries using temporal language models. After that, we show how to increase the retrieval effectiveness by using the determined time of queries to re-rank the search results. Through extensive experiments we show that our proposed approaches improve retrieval effectiveness.
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
Query expansion of named entities can be employed in order to increase the retrieval effectiveness. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance, and synonym relationships between terms change with time. In this paper, we present an approach to extracting synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relationships change over time. Further, we describe how to make use of both types of synonyms to increase the retrieval effectiveness, i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt. temporal criteria. Finally, through an evaluation based on TREC collections, we demonstrate how retrieval performance of queries consisting of named entities can be improved using our approach.
Searching the Temporal Web: Challenges and Current ApproachesNattiya Kanhabua
This talk gives a survey of current approaches to searching the temporal web. In such a web collection, the contents are created and/or edited over time, and examples are web archives, news archives, blogs, micro-blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching the temporal web. The reason for this is multifold: 1) the collection is strongly time-dependent, i.e., with multiple versions of documents, 2) the contents of documents are about events happened at particular time periods, 3) the meanings of semantic annotations can change over time, and 4) a query representing an information need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed, namely, 1) How to understand temporal search intent represented by time-sensitive queries? 2) How to handle the temporal dynamics of queries and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
We address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.
Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. We analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms.
In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
In this paper, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exists three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The aim of this paper is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
In a text retrieval community, many researchers have shown a good quality of searching a current snapshot of the Web. However, only a small number have demonstrated a good quality of searching a long-term archival domain, where documents are preserved for a long time, i.e., ten years or more. In such a domain, a search application is not only applicable for archivists or historians, but also in a context of national library and enterprise search (searching document repositories, emails, etc.). In the rest of this paper, we will explain three problems of searching document archives and propose possible approaches to solve these problems. Our main research question is: How to improve the quality of search in a document archive using temporal information?
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
Dr. Bryan Lewis and Dr. Madhav Marathe (both at Virginia Tech) will present a data driven multi-scale approach for modeling the Ebola epidemic in West Africa. We will discuss how the models and tools were used to study a number of important analytical questions, such as:
(i) computing weekly forecasts, (ii) optimally placing emergency treatment units and more generally health care facilities, and (iii) carrying out a comprehensive counter-factual analysis related to allocation of scarce pharmaceutical and non-pharmaceutical resources. The role of big-data and behavioral adaptation in developing the computational models will be highlighted.
Natural language processing (NLP) for mining online health-related data, a talk given at the International Society of Pharmacovigilence tutorial on Pharmacovigilence and the Social Media in 2017.
Listen to this recording of by IFLA's ENSULIB standing committee, to learn how libraries are working at the forefront of citizen science; the connection between NASA climate change science, citizen science observations, and mosquito-borne disease; how the international GLOBE Mission Mosquito citizen science campaign is providing a common language and approach for meeting the global challenge to ensure good health for all from mosquito-borne diseases; and examples of resources and partnerships that public, academic, and research libraries can leverage.
APHA Presentation: Using Predictive Analytics for West Nile Disease PreventionRaed Mansour
Presentation at the 2015 American Public Health Association Annual Meeting in Chicago.
Since 2004, the City of Chicago has had a comprehensive surveillance and control program to address West Nile virus (WNV). Environmental surveillance has included: the collection of mosquitoes from traps located throughout the city; the identification and sorting of mosquitoes collected from these traps; and the testing of specific species of mosquitoes for WNV. Environmental control measures have included targeted adulticiding efforts.
This project will identify factors associated with the presence of West Nile virus (WNV) in mosquitoes and determine the effectiveness of mosquito control measures. Information gained will help the City of Chicago better target its surveillance, prevention and control efforts
An open competition to determine the best model is being planned by Kaggle who will be hosting the competition in partnership with Robert Wood Johnson Foundation and CDPH. CDPH will provide data and technical support. There will be 8 years of public health data incorporated into the model that will be tested and potentially incorporated into business practice.
Full Abstract: https://apha.confex.com/apha/143am/webprogram/Paper335111.html
Informatics for Disease Surveillance – New TechnologiesDr Wasim Ahmed
A guest lecture on informatics for disease surveillance, looking at a number of new new technologies. Delivered at the School of Health and Related Research.
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersAmit Sheth
Abstract
Kno.e.sis (http://knoesis.org) is a world-class research center that uses semantic, cognitive, and perceptual computing for gathering insights from physical/IoT, cyber/Web, and social and enterprise (e.g., clinical) big data. We innovate and employ semantic web, machine learning, NLP/IR, data mining, network science and highly scalable computing techniques. Our highly interdisciplinary research impacts health and clinical applications, biomedical and translational research, epidemiology, cognitive science, social good, policy, development, etc. A majority of our $12+ million in active funds come from the NSF and NIH. In this talk, I will provide an overview of some of our major research projects.
Kno.e.sis is highly successful in its primary mission of exceptional student outcomes: our students have exceptional publication and real-world impact and our PhDs compete with their counterparts from top 10 schools for initial jobs in research universities, top industry research labs, and highly competitive companies. A key reason for Kno.e.sis' success is its unique work culture involving teamwork to solve complex problems. Practically all our work involves real-world challenges, real-world data, interdisciplinary collaborators, path-breaking research to solve challenges, real-world deployments, real-world use, and measurable real-world impact.
In this talk, I will also seek to discuss our choice of research topics and our unique ecosystem that prepares our students for exceptional careers.
Harnessing the Power of Infectious Disease Information with a Relational Data...Jay Brown
IDdx is a relational database of 249 infectious diseases. It is a decision-support software tool. Physicians can use IDdx to build differential diagnosis lists that match criteria in the query. Available criteria include 99 signs & symptoms and 39 epidemiological factors.
"...On 29 September 2006, Eric Noji (Stanford, 1977) delivered a lecture on the public health consequences of disasters, at the University of Pittsburgh’s main campus. However, this wasn't an ordinary lecture delivered to a packed auditorium of scholars and students. Eric’s lecture was Webcast around the world. It was expected to reach more than 1.5 million viewers, the largest academic lecture in history. Instead they had more than 3 million! Unfortunately, this exceeded the number of global access portals the university and its 12 global telecommunication partners had anticipated. Internet pioneer Vint Cerf (Stanford, 1965), was at Eric’s lecture and managed to wirelessly contact several friends around the world who opened up enough additional access points to allow another 50,000 viewers to log on—just 10 minutes late..."
- Stanford Magazine, JULY/AUGUST 2007
Related Video: https://www.youtube.com/watch?v=WbjSiPT8scI
Thank You for referencing this work, if you find it useful!
Citation of a related scientific paper:
Berrocal, A., Manea, V., De Masi, A., Wac, K., mQoL Lab: Step-By-Step Creation of a Flexible Platform to Conduct Studies Using Interactive, Mobile, Wearable And Ubiquitous Devices, 17th International Conference On Mobile Systems And Pervasive Computing (MobiSPC), August 2020.
The talk details:
Alexandre De Masi, Katarzyna Wac, Getting Most out of your SENSORS: Mixed-Methods Research Methodology Enabling Identification, Modelling and Predicting Human Aspects of Mobile Sensing “In the Wild”, 19th IEEE Conference on Sensors (IEEE SENSORS’20), October 2020.
Digital Scholarship: Enlightenment or Devastated Landscape? TheContentMine
Published on Dec 17, 2015 by PMR
Every year 500 Billion USD of public funding is spent on research, but much of this lies hidden in papers that are never read. I describe how machines can help us to read the literature. However there is massive opposition from publishers who are trying to prevent open scholarship and who build walled gardens that they control
Every year 500 Billion USD of public funding is spent on research, but much of this lies hidden in papers that are never read. I describe how machines can help us to read the literature. However there is massive opposition from publishers who are trying to prevent open scholarship and who build walled gardens that they control
Changing the World in Healthcare, Education, and Energy through Science, Tech...Mohamed Labadi
Changing the World in Healthcare, Education, and Energy through Science, Technology, and Social Entrepreneurship & Innovation!
“Global health and global education problems & challenges are a single-point failure for humanity.”
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
María Carolina Martínez - eCommerce Day Colombia 2024
Can Twitter & Co. Save Lives?
1. Can Twitter & Co. Save Lives?
Nattiya Kanhabua, Avaré Stewart, Sara Romano
Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl
L3S Research Center / Leibniz Universität Hannover, Germany
Research Seminar @MPII, Saarbrücken
22 October 2013
2. Motivation
• Numerous works use Twitter to infer the existence
and magnitude of real-world events in real-time
– Earthquake [Sakaki et al., 2010]
– Predicting financial time series [Ruiz et al., 2012]
– Influenza epidemics [Culotta, 2010; Lampos et al.,
2011; Paul et al., 2011]
4. Health related tweets
• User status updates or news related to
public health are common in Twitter
– I have the mumps...am I alone?
– my baby girl has a Gastroenteritis so great!! Please
do not give it to meee
– #Cholera breaks out in #Dadaab refugee camp in
#Kenya http://t.co/....
– As many as 16 people have been found infected with
Anthrax in Shahjadpur upazila of the Sirajganj district
in Bangladesh.
9. Data Collection
• Official outbreak reports
– ~3,000 ProMED-mail reports from 2011
– WHO reports have very small coverage
• Twitter data
– ~1,200 health-related terms (i.e., infectious
diseases, their synonyms, pathogens and symptoms)
– Over 112 millions of tweets from 2011
• Series of NLP tools including
– OpenNLP (tokenization, sentence splitting, POS
tagging)
– OpenCalais (named entity recognition)
– HeidelTime (temporal expression extraction)
11. Event Extraction
• An event is a sentence containing two entities
– (1) medical condition and (2) geographic expression
– A minimum requirement by domain experts
• A victim and the time of an event can be identified
from the sentence itself, or its surrounding context
• Output: a set of event candidates
Reported by World Health Organization (WHO) on
29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
[Kanhabua et al., TAIA’ 12]
12. Message Filtering: Challenges
• Ambiguity
– having several meanings
– used in different contexts
• Incompleteness
– missing or under-reported events
– data processing errors
13. Message Filtering: Challenges
• Ambiguity
– having several meanings
– used in different contexts
• Incompleteness
– missing or under-reported events
– data processing errors
Category Example tweet
Literature A two hour train journey, Love In the Time of Cholera ...
Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith
Universal Audio...
Marketing Exclusive distributor of high quality #HIV/AIDS Blood &
Urine and #Hepatitis #Self -testers.
General Identification of genotype 4 Hepatitis E virus binding
proteins on swine liver cells: Hepatitis E virus...
Negative i dont have sniffles and no real coughing..well its
coughing but not like an influenza cough.
Joke Thought I had Bieber Fever. Ends up I just had a combo
of the mumps, mono, measles & the hershey squ...
16. Approach for Noisy Data
• MedISys1
– providing a list of negative keywords created
by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic
culture words or phrases
1
http://medusa.jrc.it/medisys/homeedition/en/home.html
2
http://www.urbandictionary.com/
17. Approach for Noisy Data
• MedISys1
– providing a list of negative keywords created
by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic
culture words or phrases
1
http://medusa.jrc.it/medisys/homeedition/en/home.html
2
http://www.urbandictionary.com/
21. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
22. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
[Rortais et al., 2010 in Journal of Food Research International]
23. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
24. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
[Emch et al., 2008 in International Journal of Health Geographics]
31. Approach
• Personalized Tweet Ranking for Epidemic
Intelligence
– Learning to rank and recommender systems
– User's context as implicit criteria for recommendation
[Diaz-Aviles et al., ICWSM’
12, Diaz-Aviles et al., WWW’
32. Approach
• Personalized Tweet Ranking for Epidemic
Intelligence
– Learning to rank and recommender systems
– User's context as implicit criteria for recommendation
[Diaz-Aviles et al., ICWSM’
12, Diaz-Aviles et al., WWW’
34. Conclusion
• Can Twitter & Co. Save Lives?
– On a global level, we were able to generate
signals earlier than official reporting
mechanisms.
– The ultimate answer depends on: how a health
organization will use and react to information
provided by our system.
35. Future Work
• Real-Time Analysis of Big and Fast
Social Web Streams
– Scalable, efficient methods for filtering and
generating signals in real-time
– Effective methods for aggregating and
visualizing information in a meaningful way
37. References• [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter
messages. In Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.
• [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl.
Towards personalized learning to rank for epidemic intelligence based on social media streams.
In Proceedings of the 21st World Wide Web Conference (WWW ‘2012), 2012.
• [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl.
Epidemic intelligence for the crowd, by the crowd. In Proceedings of International AAAI
Conference on Weblogs and Social Media (ICWSM’2012), 2012.
• [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant
Temporal Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware
Information Access (TAIA'2012), 2012.
• [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting
Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.
• [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets
in the Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop
(WOW'2013) at WWW'2013, 2013.
• [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with
statistical learning. ACM TIST, 3, 2011.
• [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public
health. In Proceedings of International AAAI Conference on Weblogs and Social Media
(ICWSM’2011), 2011.
• [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating
financial time series with micro-blogging activity. In Proceedings of WSDM’2012, 2012.
• [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users:
real-time event detection by social sensors. In Proceedings of WWW’2010, 2010.
Editor's Notes
To exploit this timeliness potential, we present an event-based Epidemic Intelligence (EI) system, which has emerged as a type of intelligence gathering aimed to detect events of interest to the public health from unstructured text on the Web
In the medical domain, there has been a surge in detecting health related tweets for early warning
Allow a rapid response from authorities [Diaz-Aviles et al., 2012]
Note that, there are existing EI systems, such as, the Bio- Caster Global Health Monitor1 or HealthMap 2. However, they differ from our proposed system in the level of analysis, information sources, language coverage and visualization.
Frequencies of cases reported to RKI and number of tweets mentioning the name of the disease: EHEC. Pearson correlation coefficient = 0.864. The monitor of Twitter allowed M-Eco to generate the first signals on Friday, May 20th, 2011.
We study and propose solutions to three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) producing reliable warning signals (temporal anomalyies found) based on observe term frequency changes in these messages, using biosurveillance algorithms, and 3) providing suitable information and recommendations to domain experts, for better assessment of the potential outbreak threats associated with the generated signals.
Part I. Ground truth creation
Official outbreak reports
World Health Organization1
ProMED-mail2
Part II. Creating Twitter time series
medical condition
disease name, synonyms, pathogens, symptoms
location
geographic expressions, geo-location, or user profile
3 levels: country, continent, latitude
M-Eco strives to detect a large variety of infectious diseases, so we make use of a list of 1,258 terms consisting of infectious diseases, their synonyms, pathogens and symptoms, which are provided by the domain experts in two languages, namely English and German, for an initial filtering step. All documents and tweets are annotated with locations, medical conditions and
temporal expressions using a series of language processing tools, including OpenNLP2 for tokenization, sentence splitting and part-of-speech tagging, HeidelTime [34] for temporal expression extraction
Hash-tags co-occurring with #EHEC during May 23 and June 19, 2011, the main period of the
outbreak. The hash-tags are classified as entities of type Medical Condition, Location, or Complementary
Context, hash-tags out of these categories are discarded.
Hash-tags co-occurring with #EHEC during May 23 and June 19, 2011, the main period of the
outbreak. The hash-tags are classified as entities of type Medical Condition, Location, or Complementary
Context, hash-tags out of these categories are discarded.
Our approach builds upon [18] and extends it by: 1) incorporating the use of an orthogonal vector, which is learned by a Support Vector Machine (SVM), as a description of the feature change; and 2) computing a novelty score that lets the system identify
those tweets that contribute to the feature change, so that their true labels can be obtained.
In order to detect outbreak events for early warning, we exploit different state-of-the-art Biosurveillance algorithms as anomaly detectors in disease-related Twitter messages: \textbf{C1}, \textbf{C2}, \textbf{C3}, F-Statistic (\textbf{FS}), Experimental Weighted Moving Average (\textbf{EWMA}) and Farrington (\textbf{FA})~\cite{basseville1993detection, farrington_1996}. Traditional bio-surveillance systems usually exploit information from official sources, e.g., laboratory results, mortality rates, or the number of reported patients suffering from a disease outbreak. In recent years, researchers in the medical domain have begun to leverage real-time, social Web data, such as, tweets.
In order to detect outbreak events for early warning, we exploit different state-of-the-art Biosurveillance algorithms as anomaly detectors in disease-related Twitter messages: \textbf{C1}, \textbf{C2}, \textbf{C3}, F-Statistic (\textbf{FS}), Experimental Weighted Moving Average (\textbf{EWMA}) and Farrington (\textbf{FA})~\cite{basseville1993detection, farrington_1996}. Traditional bio-surveillance systems usually exploit information from official sources, e.g., laboratory results, mortality rates, or the number of reported patients suffering from a disease outbreak. In recent years, researchers in the medical domain have begun to leverage real-time, social Web data, such as, tweets.
Identified topics show similar trends during the known time periods of real-world outbreaks
Diversity reflects how the language (i.e., terms and locations) are used differently
Div(entity) highly correlates with topic dynamics for some diseases, i.e., mumps, ebola, botulism and ehec
Div(term) shows correlation with topic dynamics for cholera, anthrax and rubella
Algorithms: SampleDJ, TrackDJ (claims and proofs in [Deng et al., 2012])
Algorithms: SampleDJ, TrackDJ (claims and proofs in [Deng et al., 2012])
(1) Rely upon abundant user interactions and/or the availability of explicit feedback (e.g., ratings, likes, dislikes)
(2) Within M-Eco, we use the tweets from signals in developing techniques to provide a personalized short list of tweets that meets the context of the investigation. In this section, we review one of them; namely Personalized Tweet Ranking for Epidemic Intelligence (PTR4EI) [13, 14] and discuss the evaluation conducted during a major EHEC outbreak in Germany.
where t is a discrete Time interval, MCu the set of Medical Conditions, and Lu the set of Locations of user interest.
More precisely, we expand the user's context, Cu, using latent topics computed with LDA [5] on: 1) an indexed collection of tweets for epidemic intelligence; and 2) the hash-tags that co-occur with this context.
(1) Rely upon abundant user interactions and/or the availability of explicit feedback (e.g., ratings, likes, dislikes)
(2) Within M-Eco, we use the tweets from signals in developing techniques to provide a personalized short list of tweets that meets the context of the investigation. In this section, we review one of them; namely Personalized Tweet Ranking for Epidemic Intelligence (PTR4EI) [13, 14] and discuss the evaluation conducted during a major EHEC outbreak in Germany.
where t is a discrete Time interval, MCu the set of Medical Conditions, and Lu the set of Locations of user interest.
More precisely, we expand the user's context, Cu, using latent topics computed with LDA [5] on: 1) an indexed collection of tweets for epidemic intelligence; and 2) the hash-tags that co-occur with this context.