Abstract: With the rapid rise in social media, alternative news sources, and blogs, ordinary citizens have become information producers as much as information consumers. Highly charged prose, images, and videos spread virally, and stoke the embers of social unrest by alerting fellow citizens to relevant happenings and
spurring them into action. We are interested in using Big Data approaches to generate forecasts of civil unrest from open source indicators. The heterogenous nature of data coupled with the rich and diverse origins of civil unrest call for a multi-model approach to such forecasting. We present a modular approach wherein a collection of models use overlapping sources of data to independently forecast protests. Fusion of alerts into one single alert stream becomes a key system informatics problem and we present a statistical framework to accomplish such fusion. Given an alert from one of the numerous models, the decision space
for fusion has two possibilities: (i) release the alert or (ii) suppress the alert. Using a Bayesian decision theoretical framework, we present a fusion approach for releasing or suppressing alerts. The resulting system enables real-time decisions and more importantly tuning of precision and recall. Supplementary materials for this article are available online.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
EMBERS at 4 years: Experiences operating an Open Source Indicators Forecastin...Parang Saraf
Abstract: EMBERS is an anticipatory intelligence system forecasting population-level events in multiple countries of Latin America. A deployed system from 2012, EMBERS has been generating alerts 24x7 by ingesting a broad range of data sources including news, blogs, tweets, machine coded events, currency rates, and food prices. In this paper, we describe our experiences operating EMBERS continuously for nearly 4 years, with specific attention to the discoveries it has enabled, correct as well as missed forecasts, and lessons learnt from participating in a forecasting tournament including our perspectives on the limits of forecasting and ethical considerations.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
‘Beating the News’ with EMBERS: Forecasting Civil Unrest using Open Source In...Parang Saraf
Abstract: We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the June 2013 protests in Brazil and Feb 2014 violent protests in Venezuela. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific
predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
EMBERS AutoGSR: Automated Coding of Civil Unrest EventsParang Saraf
Abstract: We describe the EMBERS AutoGSR system that conducts automated coding of civil unrest events from news articles published in multiple languages. The nuts and bolts of the AutoGSR system constitute an ecosystem of filtering, ranking, and recommendation models to determine if an article reports a civil unrest event and, if so, proceed to identify and encode specific characteristics of the civil unrest event such as the when, where, who, and why of the protest. AutoGSR is a deployed system for the past 6 months continually processing data 24x7 in languages such as Spanish, Portuguese, English and encoding civil unrest events in 10 countries of Latin America: Argentina, Brazil, Chile, Colombia, Ecuador, El Salvador, Mexico, Paraguay, Uruguay, and Venezuela. We demonstrate the superiority of AutoGSR over both manual approaches and other state-of-the-art en- coding systems for civil unrest.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Classifying Crises-Information Relevancy with SemanticsCOMRADES project
Prashant Khare, Gregoire Burel, and Harith Alani
Knowledge Media Institute, The Open University, United Kingdom
fprashant.khare,g.burel,h.alanig@open.ac.uk
This paper presents a framework to automatically detect disinformation campaigns on social media. The framework integrates natural language processing, machine learning, graph analysis, and causal inference. It collects social media posts, identifies narratives, classifies accounts as real or influence operations, maps the network of accounts spreading each narrative, and estimates the causal impact of each account in spreading the narrative. The framework was tested on real Twitter data from the 2017 French election and detected influence operation accounts with 96% precision and 79% recall. It also identified communities and high impact accounts that were corroborated by other sources.
Memo for the Danish Emergency Management Agency by student Anna Boye Koldaas, Master of Science (MSc)-student in Security Risk Management at Copenhagen University.
IRJET- Fake Message Deduction using Machine LeariningIRJET Journal
This document summarizes research on detecting fake news using machine learning. It proposes building a dataset of both fake and real news articles and using a Naive Bayes classifier to create a model that can classify new articles as fake or real based on their words and phrases. The goal is to address the growing problems of fake news and lack of trust in the media by automatically identifying misleading stories. It discusses related work on information diffusion and media bias, and outlines the objectives, scope and literature review of the proposed research.
Epidemiological modeling of online social network dynamicsDario Caliendo
Facebook si è diffuso come un'epidemia virale, alla quale gli utenti stanno lentamente diventando immuni. Ad affermarlo sono John Cannarella e Joshua A. Spechler, due ricercatori della Princeton Universiry che in un'analisi nella quale hanno studiato il fenomeno del social network applicando un modello utilizzato per lo studio dello sviluppo delle epidemie, hanno determinato che la diffusione della piattaforma di Zuckerberg ha ormai superato il sui punto più alto ed a breve inizierà ad implodere, fino a perdere l'80% degli utenti entro il 2017.
EMBERS at 4 years: Experiences operating an Open Source Indicators Forecastin...Parang Saraf
Abstract: EMBERS is an anticipatory intelligence system forecasting population-level events in multiple countries of Latin America. A deployed system from 2012, EMBERS has been generating alerts 24x7 by ingesting a broad range of data sources including news, blogs, tweets, machine coded events, currency rates, and food prices. In this paper, we describe our experiences operating EMBERS continuously for nearly 4 years, with specific attention to the discoveries it has enabled, correct as well as missed forecasts, and lessons learnt from participating in a forecasting tournament including our perspectives on the limits of forecasting and ethical considerations.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
‘Beating the News’ with EMBERS: Forecasting Civil Unrest using Open Source In...Parang Saraf
Abstract: We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the June 2013 protests in Brazil and Feb 2014 violent protests in Venezuela. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific
predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
EMBERS AutoGSR: Automated Coding of Civil Unrest EventsParang Saraf
Abstract: We describe the EMBERS AutoGSR system that conducts automated coding of civil unrest events from news articles published in multiple languages. The nuts and bolts of the AutoGSR system constitute an ecosystem of filtering, ranking, and recommendation models to determine if an article reports a civil unrest event and, if so, proceed to identify and encode specific characteristics of the civil unrest event such as the when, where, who, and why of the protest. AutoGSR is a deployed system for the past 6 months continually processing data 24x7 in languages such as Spanish, Portuguese, English and encoding civil unrest events in 10 countries of Latin America: Argentina, Brazil, Chile, Colombia, Ecuador, El Salvador, Mexico, Paraguay, Uruguay, and Venezuela. We demonstrate the superiority of AutoGSR over both manual approaches and other state-of-the-art en- coding systems for civil unrest.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Classifying Crises-Information Relevancy with SemanticsCOMRADES project
Prashant Khare, Gregoire Burel, and Harith Alani
Knowledge Media Institute, The Open University, United Kingdom
fprashant.khare,g.burel,h.alanig@open.ac.uk
This paper presents a framework to automatically detect disinformation campaigns on social media. The framework integrates natural language processing, machine learning, graph analysis, and causal inference. It collects social media posts, identifies narratives, classifies accounts as real or influence operations, maps the network of accounts spreading each narrative, and estimates the causal impact of each account in spreading the narrative. The framework was tested on real Twitter data from the 2017 French election and detected influence operation accounts with 96% precision and 79% recall. It also identified communities and high impact accounts that were corroborated by other sources.
Memo for the Danish Emergency Management Agency by student Anna Boye Koldaas, Master of Science (MSc)-student in Security Risk Management at Copenhagen University.
IRJET- Fake Message Deduction using Machine LeariningIRJET Journal
This document summarizes research on detecting fake news using machine learning. It proposes building a dataset of both fake and real news articles and using a Naive Bayes classifier to create a model that can classify new articles as fake or real based on their words and phrases. The goal is to address the growing problems of fake news and lack of trust in the media by automatically identifying misleading stories. It discusses related work on information diffusion and media bias, and outlines the objectives, scope and literature review of the proposed research.
Epidemiological modeling of online social network dynamicsDario Caliendo
Facebook si è diffuso come un'epidemia virale, alla quale gli utenti stanno lentamente diventando immuni. Ad affermarlo sono John Cannarella e Joshua A. Spechler, due ricercatori della Princeton Universiry che in un'analisi nella quale hanno studiato il fenomeno del social network applicando un modello utilizzato per lo studio dello sviluppo delle epidemie, hanno determinato che la diffusione della piattaforma di Zuckerberg ha ormai superato il sui punto più alto ed a breve inizierà ad implodere, fino a perdere l'80% degli utenti entro il 2017.
Dealing with Information Overload When Using Social Media for Emergency Manag...Mirjam-Mona
Presentation of Starr Roxanne Hiltz and Linda P. Plotnick on the topic "Dealing with Information Overload When Using Social Media for Emergency Management: Emerging Solutions" at ISCRAM2013
Expelling Information of Events from Critical Public Space using Social Senso...ijtsrd
Open foundation frameworks give a significant number of the administrations that are basic to the wellbeing, working, and security of society. A considerable lot of these frameworks, in any case, need persistent physical sensor checking to have the option to recognize disappointment occasions or harm that has struck these frameworks. We propose the utilization of social sensor enormous information to recognize these occasions. We center around two primary framework frameworks, transportation and vitality, and use information from Twitter streams to identify harm to spans, expressways, gas lines, and power foundation. Through a three step filtering approach and assignment to geographical cells, we are able to filter out noise in this data to produce relevant geo located tweets identifying failure events. Applying the strategy to real world data, we demonstrate the ability of our approach to utilize social sensor big data to detect damage and failure events in these critical public infrastructures. Samatha P. K | Dr. Mohamed Rafi "Expelling Information of Events from Critical Public Space using Social Sensor Big Data" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25350.pdfPaper URL: https://www.ijtsrd.com/engineering/computer-engineering/25350/expelling-information-of-events-from-critical-public-space-using-social-sensor-big-data/samatha-p-k
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...Subhajit Sahu
Highlighted notes while studying for project work:
Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks
Abhinav Bhatele†
Jae-Seung Yeom†
Nikhil Jain†
Chris J. Kuhlman∗
Yarden Livnat‡
Keith R. Bisset∗
Laxmikant V. Kale§
Madhav V. Marathe∗
†Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94551 USA
∗Biocomplexity Institute & Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061 USA
‡Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, Utah 84112 USA
§Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 USA
E-mail: †{bhatele, yeom2, nikhil}@llnl.gov, ∗{ckuhlman, kbisset, mmarathe}@vbi.vt.edu
Abstract—Controlling the spread of infectious diseases in large populations is an important societal challenge. Mathematically, the problem is best captured as a certain class of reactiondiffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. We describe EpiSimdemics, a highly scalable, parallel code written in Charm++ that uses agent-based modeling to simulate disease spreads over large, realistic, co-evolving interaction networks. We present a new parallel implementation of EpiSimdemics that achieves unprecedented strong and weak scaling on different architectures — Blue Waters, Cori and Mira. EpiSimdemics achieves five times greater speedup than the second fastest parallel code in this field. This unprecedented scaling is an important step to support the long term vision of realtime epidemic science. Finally, we demonstrate the capabilities of EpiSimdemics by simulating the spread of influenza over a realistic synthetic social contact network spanning the continental United States (∼280 million nodes and 5.8 billion social contacts).
Semantic Wide and Deep Learning for Detecting Crisis-Information Categories o...Gregoire Burel
When crises hit, many flog to social media to share or consume information related to the event. Social media posts during crises tend to provide valuable reports on affected people, donation offers, help requests, advice provision, etc. Automatically identifying the category of information (e.g., reports on affected individuals, donations and volunteers) contained in these posts is vital for their efficient handling and consumption by effected communities and concerned organisations. In this paper, we introduce Sem-CNN; a wide and deep Convolutional Neural Network (CNN) model designed for identifying the category of information contained in crisis-related social media content. Unlike previous models, which mainly rely on the lexical representations of words in the text, the proposed model integrates an additional layer of semantics that represents the named entities in the text, into a wide and deep CNN network. Results show that the Sem-CNN model consistently outperforms the baselines which consist of statistical and non-semantic deep learning models.
Paper access: http://oro.open.ac.uk/51726/
IRJET- Sentiment Analysis using Machine LearningIRJET Journal
This document discusses sentiment analysis of social media posts using machine learning. The authors aim to classify social media posts as having either a political or non-political sentiment. A dictionary of keywords and their sentiments is created to analyze posts. Users can make posts that are then classified, and an admin can hide or delete posts containing harmful keywords. Graphs are generated to analyze the classification of posts as political versus non-political and by sentiment. The accuracy of classification depends on the training data and dictionary.
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...IJDKP
Terrorism is a phenomenon that rose to its peak nowadays. Counter terrorism analysts work with a large
set of documents related to different terrorist groups and attack types to extract useful information about
these groups’ motive and tactics. It is evident that terrorism became a global threat and can exist
anywhere. In order to face this phenomenon, there is a need to understand the characteristics of the
terrorists in order to find if there are general characteristics among all of them or not. However, as the
number of the collected documents increase, deducing results and making decisions became more and
more difficult to the analysts. The use of information visualization tools can help the analysts to visualize
the terrorist characteristics. However, most of the current information visualization tools focus only on
representing and analyzing the terrorist organizations, with little emphasis on terrorist’s personal
characteristics. Therefore, the current paper presents a visualization tool that can be used to analyze the
terrorist’s personal characteristics in order to understand the production life cycle of a terrorist and how
to face it.
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET Journal
This document discusses controlling the spread of fake or misleading information on social media. It proposes a system to analyze information diffusion on social networks, identify diffused data, and control the spread of fake diffused data. The system would extract data from social media, perform sentiment analysis to determine the veracity of information, and discard fake or untrustworthy information from the database to prevent further propagation. A variety of machine learning techniques could be used for the sentiment analysis, including naive Bayes classification, linear regression, and gradient boosted trees. The goal is to curb the spread of misinformation while still allowing the diffusion of real or truthful information.
On Semantics and Deep Learning for Event Detection in Crisis SituationsCOMRADES project
In this paper, we introduce Dual-CNN, a semantically-enhanced deep learning model to target the problem of event detection in crisis situations from
social media data. A layer of semantics is added to a traditional Convolutional Neural Network (CNN) model to capture the contextual information that is generally scarce in short, ill-formed social media messages. Our results show that
our methods are able to successfully identify the existence of events, and event types (hurricane, floods, etc.) accurately (> 79% F-measure), but the performance of the model significantly drops (61% F-measure) when identifying fine-grained event-related information (affected individuals, damaged infrastructures, etc.).
These results are competitive with more traditional Machine Learning models, such as SVM.
http://oro.open.ac.uk/49639/1/event_detection.pdf
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGijcsit
With the advent of the Internet and social media, while hundreds of people have benefitted from the vast sources of information available, there has been an enormous increase in the rise of cyber-crimes, particularly targeted towards women. According to a 2019 report in the [4] Economics Times, India has witnessed a 457% rise in cybercrime in the five year span between 2011 and 2016. Most speculate that this is due to impact of social media such as Facebook, Instagram and Twitter on our daily lives. While these definitely help in creating a sound social network, creation of user accounts in these sites usually needs just an email-id. A real life person can create multiple fake IDs and hence impostors can easily be made. Unlike the real world scenario where multiple rules and regulations are imposed to identify oneself in a unique manner (for example while issuing one’s passport or driver’s license), in the virtual world of social media, admission does not require any such checks. In this paper, we study the different accounts of Instagram, in particular and try to assess an account as fake or real using Machine Learning techniques namely Logistic Regression and Random Forest Algorithm.
This document discusses the importance of data visualization and provides examples of effective data visualizations. It begins by defining data visualization and explaining why it is important for communicating analytical results and insights to diverse audiences. It then provides examples of pioneering data visualizations from Florence Nightingale and William Playford. Finally, it showcases various modern data visualization tools and interactive examples using datasets on marriage trends, time use, music popularity, air quality, and sports analytics.
The document describes a proposed system for authenticating and summarizing news articles. The system prioritizes authenticating the news over summarization. If the news is determined to be at least 55% authentic based on analysis of keywords and content from authorized news sites, then it will generate a summary by breaking the news down into shorter, meaningful snippets. The system uses machine learning and artificial intelligence algorithms like Syntaxnet to extract phrases and analyze the news. It is implemented as a user-friendly web application using technologies like Python, Django, HTML, CSS and Bootstrap.
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERINGijaia
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety,
Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
This document summarizes research on detecting fake news using text analysis techniques. It discusses how social media consumption of news has increased and the challenges of identifying trustworthy sources. Various types of fake news are described based on visual/text content or the targeted audience. Methods for detection include clustering similar news reports and using predictive models to analyze linguistic features like punctuation, semantic levels, and readability. The proposed approach uses text summarization, web crawling to find related articles, latent semantic analysis to compare articles, and fuzzy logic to determine the authenticity score of a target news article. The goal is to develop a system to help users identify fake news on social media platforms.
1) Values in Computational Models RevaluedComputational mode.docxmonicafrancis71118
1) Values in Computational Models Revalued
Computational models are mathematical representations that are designed to study the behaviour of complex systems. Systems under study are usually nonlinear and complex to the extent that conventional analytics cannot be used. Scholars have tried to establish the role played by trust and values in the use of such models in the analysis of public administration.
Public decision-making is itself a complex endeavour that involves the input of multiple stakeholders. Usually, there are a lot of conflicting interests that influence the final outcome of such decision-making processes (Klabunde & Willekens, 2016). In a computational model, a number of factors equally influence the outcome of the process. One of them is the number of actors involved –the presence of more actors normally implies increased mistrust. Another factor is the amount of trust that already exists among the decision makers. In cases where the group is homogenous, there is likely to be more trust and thus, less concern about the number of actors involved.
Given the importance of these two factors, the designer of any such model bears the largest burden in assuring the value of the model. He or she can choose to implement agency by humans or by technology depending on the number of actors and trust among them. Also, model designer determines the margins of error from each scenario while modelling (Gershman, Markman & Otto, 2014). Since in conventional decision-making processes different actors have different roles, the model designer may decide to accord different levels of authority to different actors. Nevertheless, they must ensure that such a decision does not affect the trust of the system. Overall, what values are sought from a computational model in a public decision-making context?
References
Gershman, S. J., Markman, A. B., & Otto, A. R. (2014). Retrospective revaluation in sequential decision making: A tale of two systems.
Journal of Experimental Psychology: General
,
143
(1), 182-194.
Klabunde, A., & Willekens, F. (2016). Decision-making in agent-based models of migration: state of the art and challenges.
European Journal of Population
,
32
(1), 73-97.
2) Active and Passive Crowdsourcing in Government
The authors of the article “Active and Passive Crowdsourcing in Government” discuss the application of the idea of crowdsourcing by public agencies. It leverages Web-based platforms to gather information from a large number of individuals for solving intricate problems (Loukis and Charalabidis 284). The scholars revealed that the concept of crowdsourcing was first adopted by organizations in the private sector, especially creative and design firms. Later on, state agencies began to determine how to leverage crowdsourcing to obtain “collective wisdom” from citizens aimed at informing the formulation and implementation of public policies.
Active and passive approaches to crowdsourcing are similar as they are both.
1. The document presents a study on automated detection of fake news from social media.
2. It proposes using account features like number of hashtags, URLs, mentions, followers etc. and a random forest classifier to detect fake news with 99.8% precision.
3. The random forest approach is compared to other classifiers like decision trees, Naive Bayes and neural networks, which achieve lower precision of 98.4%, 92.6% and 62.7% respectively.
Clustering analysis on news from health OSINT data regarding CORONAVIRUS-COVI...ALexandruDaia1
Our primarly goal was to detect clusters via gensim libraries in news data consisting ofinformation regarding health and threats. We identified clusters for the periodscorresponding: i) Jannuary 2006 until the end of 2019, as December 2019 is considered thefirst month in which information about CORONVIRUS COVID-19 was made public; ii)between the 1st of Jannuary 2019 and 31st December 2019; and iii) between the 31st ofDecember 2019 and the 14th of April 2020. We conducted experiments using naturallanguage on open source intelligence data offered generously by brica.de, a providerspecialized in Business Risk Intelligence & Cyberthreat Awareness.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Research paper impact evaluation for collaborative information supply chainKenny Meesters
Emerging technologies provide opportunities for the humanitarian responders’ community to enhance the
effectiveness of their response to crisissituations. A part of this development can be contributed to a new type of
information supply chains -driven by collaboration with digital, online communities- enabling organizations to
make better informed decisions. However, how exactly and to what extend this collaboration impacts the
decision making process is unknown. To improve these new information exchanges and the corresponding
systems, an evaluation method is needed to assess the performance of these processes and systems. This paper
builds on existing evaluation methods for information systems and design principles to propose such an impact
evaluation framework. The proposed framework has been applied in a case study to demonstrate its potential to
identify areas for further improvement in the (online) collaboration between information suppliers and users.
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
This document discusses methods for detecting fake news and identifying the source of rumors on social media. It proposes using Bayesian classification to classify information into real or fake categories based on the outputs. If the combined outputs from the classes do not match, then the information is considered fake. It also discusses using a reverse dissemination strategy to identify a group of suspects for the original rumor source, rather than examining each individual. This addresses issues with identifying sources. The method aims to identify the source node based on which nodes have accepted the rumor. Machine learning and natural language processing techniques are used to detect fake news from article content.
Dealing with Information Overload When Using Social Media for Emergency Manag...Mirjam-Mona
Presentation of Starr Roxanne Hiltz and Linda P. Plotnick on the topic "Dealing with Information Overload When Using Social Media for Emergency Management: Emerging Solutions" at ISCRAM2013
Expelling Information of Events from Critical Public Space using Social Senso...ijtsrd
Open foundation frameworks give a significant number of the administrations that are basic to the wellbeing, working, and security of society. A considerable lot of these frameworks, in any case, need persistent physical sensor checking to have the option to recognize disappointment occasions or harm that has struck these frameworks. We propose the utilization of social sensor enormous information to recognize these occasions. We center around two primary framework frameworks, transportation and vitality, and use information from Twitter streams to identify harm to spans, expressways, gas lines, and power foundation. Through a three step filtering approach and assignment to geographical cells, we are able to filter out noise in this data to produce relevant geo located tweets identifying failure events. Applying the strategy to real world data, we demonstrate the ability of our approach to utilize social sensor big data to detect damage and failure events in these critical public infrastructures. Samatha P. K | Dr. Mohamed Rafi "Expelling Information of Events from Critical Public Space using Social Sensor Big Data" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25350.pdfPaper URL: https://www.ijtsrd.com/engineering/computer-engineering/25350/expelling-information-of-events-from-critical-public-space-using-social-sensor-big-data/samatha-p-k
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...Subhajit Sahu
Highlighted notes while studying for project work:
Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks
Abhinav Bhatele†
Jae-Seung Yeom†
Nikhil Jain†
Chris J. Kuhlman∗
Yarden Livnat‡
Keith R. Bisset∗
Laxmikant V. Kale§
Madhav V. Marathe∗
†Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94551 USA
∗Biocomplexity Institute & Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061 USA
‡Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, Utah 84112 USA
§Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 USA
E-mail: †{bhatele, yeom2, nikhil}@llnl.gov, ∗{ckuhlman, kbisset, mmarathe}@vbi.vt.edu
Abstract—Controlling the spread of infectious diseases in large populations is an important societal challenge. Mathematically, the problem is best captured as a certain class of reactiondiffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. We describe EpiSimdemics, a highly scalable, parallel code written in Charm++ that uses agent-based modeling to simulate disease spreads over large, realistic, co-evolving interaction networks. We present a new parallel implementation of EpiSimdemics that achieves unprecedented strong and weak scaling on different architectures — Blue Waters, Cori and Mira. EpiSimdemics achieves five times greater speedup than the second fastest parallel code in this field. This unprecedented scaling is an important step to support the long term vision of realtime epidemic science. Finally, we demonstrate the capabilities of EpiSimdemics by simulating the spread of influenza over a realistic synthetic social contact network spanning the continental United States (∼280 million nodes and 5.8 billion social contacts).
Semantic Wide and Deep Learning for Detecting Crisis-Information Categories o...Gregoire Burel
When crises hit, many flog to social media to share or consume information related to the event. Social media posts during crises tend to provide valuable reports on affected people, donation offers, help requests, advice provision, etc. Automatically identifying the category of information (e.g., reports on affected individuals, donations and volunteers) contained in these posts is vital for their efficient handling and consumption by effected communities and concerned organisations. In this paper, we introduce Sem-CNN; a wide and deep Convolutional Neural Network (CNN) model designed for identifying the category of information contained in crisis-related social media content. Unlike previous models, which mainly rely on the lexical representations of words in the text, the proposed model integrates an additional layer of semantics that represents the named entities in the text, into a wide and deep CNN network. Results show that the Sem-CNN model consistently outperforms the baselines which consist of statistical and non-semantic deep learning models.
Paper access: http://oro.open.ac.uk/51726/
IRJET- Sentiment Analysis using Machine LearningIRJET Journal
This document discusses sentiment analysis of social media posts using machine learning. The authors aim to classify social media posts as having either a political or non-political sentiment. A dictionary of keywords and their sentiments is created to analyze posts. Users can make posts that are then classified, and an admin can hide or delete posts containing harmful keywords. Graphs are generated to analyze the classification of posts as political versus non-political and by sentiment. The accuracy of classification depends on the training data and dictionary.
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...IJDKP
Terrorism is a phenomenon that rose to its peak nowadays. Counter terrorism analysts work with a large
set of documents related to different terrorist groups and attack types to extract useful information about
these groups’ motive and tactics. It is evident that terrorism became a global threat and can exist
anywhere. In order to face this phenomenon, there is a need to understand the characteristics of the
terrorists in order to find if there are general characteristics among all of them or not. However, as the
number of the collected documents increase, deducing results and making decisions became more and
more difficult to the analysts. The use of information visualization tools can help the analysts to visualize
the terrorist characteristics. However, most of the current information visualization tools focus only on
representing and analyzing the terrorist organizations, with little emphasis on terrorist’s personal
characteristics. Therefore, the current paper presents a visualization tool that can be used to analyze the
terrorist’s personal characteristics in order to understand the production life cycle of a terrorist and how
to face it.
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET Journal
This document discusses controlling the spread of fake or misleading information on social media. It proposes a system to analyze information diffusion on social networks, identify diffused data, and control the spread of fake diffused data. The system would extract data from social media, perform sentiment analysis to determine the veracity of information, and discard fake or untrustworthy information from the database to prevent further propagation. A variety of machine learning techniques could be used for the sentiment analysis, including naive Bayes classification, linear regression, and gradient boosted trees. The goal is to curb the spread of misinformation while still allowing the diffusion of real or truthful information.
On Semantics and Deep Learning for Event Detection in Crisis SituationsCOMRADES project
In this paper, we introduce Dual-CNN, a semantically-enhanced deep learning model to target the problem of event detection in crisis situations from
social media data. A layer of semantics is added to a traditional Convolutional Neural Network (CNN) model to capture the contextual information that is generally scarce in short, ill-formed social media messages. Our results show that
our methods are able to successfully identify the existence of events, and event types (hurricane, floods, etc.) accurately (> 79% F-measure), but the performance of the model significantly drops (61% F-measure) when identifying fine-grained event-related information (affected individuals, damaged infrastructures, etc.).
These results are competitive with more traditional Machine Learning models, such as SVM.
http://oro.open.ac.uk/49639/1/event_detection.pdf
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGijcsit
With the advent of the Internet and social media, while hundreds of people have benefitted from the vast sources of information available, there has been an enormous increase in the rise of cyber-crimes, particularly targeted towards women. According to a 2019 report in the [4] Economics Times, India has witnessed a 457% rise in cybercrime in the five year span between 2011 and 2016. Most speculate that this is due to impact of social media such as Facebook, Instagram and Twitter on our daily lives. While these definitely help in creating a sound social network, creation of user accounts in these sites usually needs just an email-id. A real life person can create multiple fake IDs and hence impostors can easily be made. Unlike the real world scenario where multiple rules and regulations are imposed to identify oneself in a unique manner (for example while issuing one’s passport or driver’s license), in the virtual world of social media, admission does not require any such checks. In this paper, we study the different accounts of Instagram, in particular and try to assess an account as fake or real using Machine Learning techniques namely Logistic Regression and Random Forest Algorithm.
This document discusses the importance of data visualization and provides examples of effective data visualizations. It begins by defining data visualization and explaining why it is important for communicating analytical results and insights to diverse audiences. It then provides examples of pioneering data visualizations from Florence Nightingale and William Playford. Finally, it showcases various modern data visualization tools and interactive examples using datasets on marriage trends, time use, music popularity, air quality, and sports analytics.
The document describes a proposed system for authenticating and summarizing news articles. The system prioritizes authenticating the news over summarization. If the news is determined to be at least 55% authentic based on analysis of keywords and content from authorized news sites, then it will generate a summary by breaking the news down into shorter, meaningful snippets. The system uses machine learning and artificial intelligence algorithms like Syntaxnet to extract phrases and analyze the news. It is implemented as a user-friendly web application using technologies like Python, Django, HTML, CSS and Bootstrap.
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERINGijaia
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety,
Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
This document summarizes research on detecting fake news using text analysis techniques. It discusses how social media consumption of news has increased and the challenges of identifying trustworthy sources. Various types of fake news are described based on visual/text content or the targeted audience. Methods for detection include clustering similar news reports and using predictive models to analyze linguistic features like punctuation, semantic levels, and readability. The proposed approach uses text summarization, web crawling to find related articles, latent semantic analysis to compare articles, and fuzzy logic to determine the authenticity score of a target news article. The goal is to develop a system to help users identify fake news on social media platforms.
1) Values in Computational Models RevaluedComputational mode.docxmonicafrancis71118
1) Values in Computational Models Revalued
Computational models are mathematical representations that are designed to study the behaviour of complex systems. Systems under study are usually nonlinear and complex to the extent that conventional analytics cannot be used. Scholars have tried to establish the role played by trust and values in the use of such models in the analysis of public administration.
Public decision-making is itself a complex endeavour that involves the input of multiple stakeholders. Usually, there are a lot of conflicting interests that influence the final outcome of such decision-making processes (Klabunde & Willekens, 2016). In a computational model, a number of factors equally influence the outcome of the process. One of them is the number of actors involved –the presence of more actors normally implies increased mistrust. Another factor is the amount of trust that already exists among the decision makers. In cases where the group is homogenous, there is likely to be more trust and thus, less concern about the number of actors involved.
Given the importance of these two factors, the designer of any such model bears the largest burden in assuring the value of the model. He or she can choose to implement agency by humans or by technology depending on the number of actors and trust among them. Also, model designer determines the margins of error from each scenario while modelling (Gershman, Markman & Otto, 2014). Since in conventional decision-making processes different actors have different roles, the model designer may decide to accord different levels of authority to different actors. Nevertheless, they must ensure that such a decision does not affect the trust of the system. Overall, what values are sought from a computational model in a public decision-making context?
References
Gershman, S. J., Markman, A. B., & Otto, A. R. (2014). Retrospective revaluation in sequential decision making: A tale of two systems.
Journal of Experimental Psychology: General
,
143
(1), 182-194.
Klabunde, A., & Willekens, F. (2016). Decision-making in agent-based models of migration: state of the art and challenges.
European Journal of Population
,
32
(1), 73-97.
2) Active and Passive Crowdsourcing in Government
The authors of the article “Active and Passive Crowdsourcing in Government” discuss the application of the idea of crowdsourcing by public agencies. It leverages Web-based platforms to gather information from a large number of individuals for solving intricate problems (Loukis and Charalabidis 284). The scholars revealed that the concept of crowdsourcing was first adopted by organizations in the private sector, especially creative and design firms. Later on, state agencies began to determine how to leverage crowdsourcing to obtain “collective wisdom” from citizens aimed at informing the formulation and implementation of public policies.
Active and passive approaches to crowdsourcing are similar as they are both.
1. The document presents a study on automated detection of fake news from social media.
2. It proposes using account features like number of hashtags, URLs, mentions, followers etc. and a random forest classifier to detect fake news with 99.8% precision.
3. The random forest approach is compared to other classifiers like decision trees, Naive Bayes and neural networks, which achieve lower precision of 98.4%, 92.6% and 62.7% respectively.
Clustering analysis on news from health OSINT data regarding CORONAVIRUS-COVI...ALexandruDaia1
Our primarly goal was to detect clusters via gensim libraries in news data consisting ofinformation regarding health and threats. We identified clusters for the periodscorresponding: i) Jannuary 2006 until the end of 2019, as December 2019 is considered thefirst month in which information about CORONVIRUS COVID-19 was made public; ii)between the 1st of Jannuary 2019 and 31st December 2019; and iii) between the 31st ofDecember 2019 and the 14th of April 2020. We conducted experiments using naturallanguage on open source intelligence data offered generously by brica.de, a providerspecialized in Business Risk Intelligence & Cyberthreat Awareness.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Research paper impact evaluation for collaborative information supply chainKenny Meesters
Emerging technologies provide opportunities for the humanitarian responders’ community to enhance the
effectiveness of their response to crisissituations. A part of this development can be contributed to a new type of
information supply chains -driven by collaboration with digital, online communities- enabling organizations to
make better informed decisions. However, how exactly and to what extend this collaboration impacts the
decision making process is unknown. To improve these new information exchanges and the corresponding
systems, an evaluation method is needed to assess the performance of these processes and systems. This paper
builds on existing evaluation methods for information systems and design principles to propose such an impact
evaluation framework. The proposed framework has been applied in a case study to demonstrate its potential to
identify areas for further improvement in the (online) collaboration between information suppliers and users.
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
This document discusses methods for detecting fake news and identifying the source of rumors on social media. It proposes using Bayesian classification to classify information into real or fake categories based on the outputs. If the combined outputs from the classes do not match, then the information is considered fake. It also discusses using a reverse dissemination strategy to identify a group of suspects for the original rumor source, rather than examining each individual. This addresses issues with identifying sources. The method aims to identify the source node based on which nodes have accepted the rumor. Machine learning and natural language processing techniques are used to detect fake news from article content.
Census microdata from different countries and time periods is currently difficult to access, combine, and analyze due to differences in format and granularity. The authors propose applying Linked Open Data principles and semantic web technologies to publish census microdata in order to address these issues. They present a process for converting census microdata to Linked Open Data and apply it to two case studies: the 2001 Spanish census and the Integrated Public Use Microdata Series International framework. The results show census microdata can be published as Linked Open Data while preserving original structures, and this approach enables harmonization and integration across data sources.
FirstReview these assigned readings; they will serve as your .docxclydes2
First:
Review these assigned readings; they will serve as your scientific sources of accurate information:
http://www.closerlookatstemcells.org/Top_10_Stem_Cell_Treatment_Facts.html
http://www.closerlookatstemcells.org/How_Science_Becomes_Medicine.html
http://www.newvision.co.ug/news/649266-fighting-ageing-using-stem-cell-therapy.html
http://www.nature.com/news/stem-cells-in-texas-cowboy-culture-1.12404
http://www.cbc.ca/radio/whitecoat/blog/stem-cell-hype-and-risk-1.3654515
http://stm.sciencemag.org/content/7/278/278ps4.full
Next:
Use a standard Google search for this phrase: “stem cell therapy.” Do not go to Google Scholar. Select one of the websites, blogs, or other locations that offer stem cell therapies.
Save the link for your selected site.
Read the materials provided on your selected site and find out who the authors and sponsors of the site are by going to their “home” or “about us” pages.
Finally, submit your responses to the following in an essay of 500-750 words (2-3 pages of text—use a separate page for a title and for your references):
You are going to prepare a critique of the site you located and compare it to the scientific information available on this therapy.
Give the full title of the website, web blog, or other site that you selected, along with the link.
Describe the therapy that is being offered and what conditions it is designed to treat.
Who are the authors and sponsors of the site you selected?
Compare the claims about the therapy offered to what is said in the assigned readings about this type of therapy. You may have to use our library, as well, to determine what scientists and researchers have to say about the use of stem cells to treat this condition.
Would you say that the therapy you found is a well-established, proven technique for humans, or more of an experimental, unproven approach?
What about the type of language discussed in the Goldman article? Is the therapy you found using sensationalist claims and terminology that are not supported by the scientific research?
Would you recommend that a patient with this condition go ahead and participate in this treatment? Why or why not?
Literature review on how Information Technology has impacted governing bodies’ ability to align public policy with stakeholder needs
Nowadays, the governing bodies both in public and private sectors are dealing with complex systems on a day to day operations. These systems are made up of different components which present varying interactions and interrelationships with and/or among each other; therefore, making their management to be difficult or challenging. Indeed, Ruiz, Zabaleta & Elorza (2016), highlighted that public policymakers have to deal with complex systems which involve heterogeneous agents that act in non-linear behaviors making their management difficult. Neziraj & Shaqiri (2018) also stated that the policymakers are faced with problems which are complex and non-uniform due to a lot of uncertainties and risk situ.
Collusion-resistant multiparty data sharing in social networksIJECEIAES
The number of users on online social networks (OSNs) has grown tremendously over the past few years, with sites like Facebook amassing over a billion users. With the popularity of OSNs, the increase in privacy risk from the large volume of sensitive and private data is inevitable. While there are many features for access control for an individual user, most OSNs still need concrete mechanisms to preserve the privacy of data shared between multiple users. The proposed method uses metrics such as identity leakage (IL) and strength of interaction (SoI) to fine-tune the scenarios that use privacy risk and sharing loss to identify and resolve conflicts. In addition to conflict resolution, bot detection is also done to mitigate collusion attacks. The final decision to share the data item is then ascertained based on whether it passes the threshold condition for the above metrics.
This document provides a review of data science initiatives for social good. It discusses how data-driven approaches have been used to address challenges in healthcare accessibility, poverty alleviation, and environmental sustainability through initiatives leveraging machine learning, predictive modeling, and other techniques. The review highlights both the positive impacts and potential challenges of applying data science to effect social change.
Assessment of the main features of the model of dissemination of information ...IJECEIAES
Social networks provide a fairly wide range of data that allows one way or another to evaluate the effect of the dissemination of information. This article presents the results of a study that describes methods for determining the key parameters of the model needed to analyze and predict the dissemination of information in social networks. An approach based on the analysis of statistical data on user behavior in social networks is proposed. The process of evaluating the main features of the model is described, including the mathematical methods used for data analysis and information dissemination modeling. The study aims to understand the processes of information dissemination in social networks and develop recommendations for the effective use of social networks as a communication and brand promotion tool, as well as to consider the analytical properties of the classical susceptible-infected-removed (SIR) model and evaluate its applicability to the problem of information dissemination. The results of the study can be used to create algorithms and techniques that will effectively manage the process of information dissemination in social networks.
Algorithmic Transparency via Quantitative Input InfluenceFrenchWeb.fr
The document proposes a framework called Quantitative Input Influence (QII) to measure the influence of inputs on the outputs of algorithmic decision-making systems. QII defines causal measures that account for correlations between inputs to identify the true influence of each input. It supports answering transparency queries about individual and group-level decisions. The paper empirically tests QII on real-world machine learning models and datasets, finding that QII provides useful explanations and can be computed efficiently while preserving privacy.
This document summarizes a student's dissertation on classifying and detecting disaster tweets based on machine learning. The student collected tweets related to disasters like floods and earthquakes to build a dataset. Various machine learning classification algorithms were tested on the dataset, with logistic regression achieving the highest accuracy of 78%. By combining results from all algorithms, an overall accuracy of 79% was achieved in identifying actual disaster tweets. The research aims to help disaster management by providing real-time social media analysis during emergencies.
Protecting elections from social media manipulationscience.sjanekahananbw
Protecting elections from social media manipulation
science.sciencemag.org/content/365/6456/858
Policy ForumScience and Democracy
1. Sinan Aral1,2,3,
2. Dean Eckles1,2
See all authors and affiliations
"ILLUSTRATION: BROBEL
DESIGN"
To what extent are
democratic elections
vulnerable to social
media manipulation? The
fractured state of
research and evidence on
this most important
question facing
democracy is reflected in
the range of
disagreement among
experts. Facebook chief
executive officer Mark
Zuckerberg has repeatedly called on the U.S. government to regulate election manipulation
through social media. But we cannot manage what we do not measure. Without an
organized research agenda that informs policy, democracies will remain vulnerable to
foreign and domestic attacks. Thankfully, social media's effects are, in our view, eminently
measurable. Here, we advocate a research agenda for measuring social media manipulation
of elections, highlight underutilized approaches to rigorous causal inference, and discuss
political, legal, and ethical implications of undertaking such analysis. Consideration of this
research agenda illuminates the need to overcome important trade-offs for public and
corporate policy—for example, between election integrity and privacy. We have promising
research tools, but they have not been applied to election manipulation, mainly because of
a lack of access to data and lack of cooperation from the platforms (driven in part by public
policy and political constraints).
1/11
https://science.sciencemag.org/content/365/6456/858
Two recent studies commissioned by the U.S. Senate Intelligence Committee detail Russian
misinformation campaigns targeting hundreds of millions of U.S. citizens during the 2016
presidential election. The reports highlight, but do not answer, whether social media
manipulation may have influenced the outcome. Some experts argue that Russia-sponsored
content on social media likely did not decide the election because Russian-linked spending
and exposure to fake news (1, 2) were smallscale. Others contend that a combination of
Russian trolls and hacking likely tipped the election for Donald Trump (3). Similar
disagreements exist about the UK referendum on leaving the European Union and recent
elections in Brazil, Sweden, and India.
Such disagreement is understandable, given the distinctive challenges of studying social
media manipulation of elections. For example, unlike the majority of linear television
advertising, social media can be personally targeted; assessing its reach requires analysis of
paid and unpaid media, ranking algorithms and advertising auctions; and causal analysis is
necessary to understand how social media changes opinions and voting.
Luckily, much of the necessary methodology has already been developed. A growing body
of literature illuminates how social media influences behavior. Analysis of misinformation
on Twitter and Facebook (4, 5), and randomized and natural experime ...
Categorizing 2019-n-CoV Twitter Hashtag Data by Clusteringgerogepatton
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety, Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERINGgerogepatton
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety, Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
Helping Crisis Responders Find the Informative Needle in the Tweet HaystackCOMRADES project
Leon Derczynski - University of Sheffield,
Kenny Meesters - TU Delft, Kalina Bontcheva - University of Sheffield, Diana Maynard- University of Sheffield
WiPe Paper – Social Media Studies
Proceedings of the 15th ISCRAM Conference – Rochester, NY, USA May 2018
Epidemiological Modeling of News and Rumors on TwitterParang Saraf
Abstract: Characterizing information diffusion on social platforms like Twitter enables us to understand the properties of underlying media and model communication patterns. As Twitter gains in popularity, it has also become a venue to broadcast rumors and misinformation. We use epidemiological models to characterize information cascades in twitter resulting from both news and rumors. Specifically, we use the SEIZ enhanced epidemic model that explicitly recognizes skeptics to characterize eight events across the world and spanning a range of event types. We demonstrate that our approach is accurate at capturing diffusion in these events. Our approach can be fruitfully combined with other strategies that use content modeling and graph theoretic features to detect (and possibly disrupt) rumors.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Big Data Risks and Rewards (good length and at least 3-4 references .docxtangyechloe
Big Data Risks and Rewards (good length and at least 3-4 references everything in APA 7 format)
When you wake in the morning, you may reach for your cell phone to reply to a few text or email messages that you missed overnight. On your drive to work, you may stop to refuel your car. Upon your arrival, you might swipe a key card at the door to gain entrance to the facility. And before finally reaching your workstation, you may stop by the cafeteria to purchase a coffee.
From the moment you wake, you are in fact a data-generation machine. Each use of your phone, every transaction you make using a debit or credit card, even your entrance to your place of work, creates data. It begs the question: How much data do you generate each day? Many studies have been conducted on this, and the numbers are staggering: Estimates suggest that nearly 1 million bytes of data are generated every second for every person on earth.
As the volume of data increases, information professionals have looked for ways to use big data—large, complex sets of data that require specialized approaches to use effectively. Big data has the potential for significant rewards—and significant risks—to healthcare. In this Discussion, you will consider these risks and rewards.
To Prepare:
Review the Resources and reflect on the web article
Big Data Means Big Potential, Challenges for Nurse Execs
.
Reflect on your own experience with complex health information access and management and consider potential challenges and risks you may have experienced or observed.
By Day 3 of Week 5
Post
a description of at least one potential benefit of using big data as part of a clinical system and explain why. Then, describe at least one potential challenge or risk of using big data as part of a clinical system and explain why. Propose at least one strategy you have experienced, observed, or researched that may effectively mitigate the challenges or risks of using big data you described. Be specific and provide examples.
By Day 6 of Week 5
Respond
to at least
two
of your colleagues
* on two different days
, by offering one or more additional mitigation strategies or further insight into your colleagues’ assessment of big data opportunities and risks.
Click on the
Reply
button below to reveal the textbox for entering your message. Then click on the
Submit
button to post your message.
*Note:
Throughout this program, your fellow students are referred to as colleagues.
Michea Discussion ( in APA 7 format and at least 2-3 references)
With the fast growing pace of technological advancement in the health care sector, daily operations of the institution helps generate millions of data that over time needs proper channels of transmission, storage, processing, assimilation and utilization. Following from the vast amount of data generated, some of its benefits includes but is not limited to functioning as a pattern discovery aid with relation to the amount of variance or similarity in .
Similar to Bayesian Model Fusion for Forecasting Civil Unrest (20)
The Email Analyzer interface helps an analyst visualize the email network and identify local group of people who frequently exchange emails amongst themselves. This interface was developed as an entry for VAST Challenge 2014.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Slides: Safeguarding Abila through Multiple Data PerspectivesParang Saraf
Abstract: This paper introduces a system for visual analysis of news articles, emails, GPS tracking data, financial transactions and streaming micro-blog data. The system was developed in response to the 2014 VAST Grand Challenge and comprises of several interfaces for mining textual, network, spatio-temporal, financial, and streaming data.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Abstract: This paper introduces a system for visualization and analysis of geo-spatial and temporal data from call center and microblog sources. The system provides a streaming client for interacting with the data in real time. The system presents the data in a partitioned format using coordinated visualizations that allows the analyst to view the data in multiple dimension simultaneously. This allows the user to see patterns that occur in space and time. This project was developed in response to the VAST 2014 Mini-Challenge 3.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Abstract: This paper introduces a system for visual analysis of GPS tracking and financial data. The system was developed in response to VAST Mini-Challenge 2 and comprises of different interfaces for mining spatio-temporal and financial data.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
This document summarizes two interfaces presented as part of a VAST challenge solution: a News Analyzer and Email Analyzer. The News Analyzer allows exploration of news articles to identify events using search, entity recognition, and visualization tools. The Email Analyzer visualizes email networks through a co-occurrence matrix, radial graph, and word cloud to examine relationships. Both interfaces were designed specifically for the challenge and utilize techniques like search engines, named entity recognition, spectral co-clustering, and D3 visualization.
The News Analyzer interface provides novel ways to visualize and analyze news articles collected from different sources. An analyst can identify trending entities like name of people, organizations and locations in the news articles and can also see how the corresponding trends have evolved over time. This was developed as an entry for VAST Challenge 2014.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Slides: Forex-Foreteller: Currency Trend Modeling using News ArticlesParang Saraf
Abstract: Financial markets are quite sensitive to unanticipated news and events. Identifying the effect of news on the market is a challenging task. In this demo, we present Forex-foreteller (FF) which mines news articles and makes forecasts about the movement of foreign currency markets. The system uses a combination of language models, topic clustering, and sentiment analysis to identify relevant news articles. These articles along with the historical stock index and currency exchange values are used in a linear regression model to make forecasts. The system has an interactive visualizer designed specifically for touch-sensitive devices which depicts forecasts along with the chronological news events and financial data used for making the forecasts.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Slides: Epidemiological Modeling of News and Rumors on TwitterParang Saraf
This document presents research on modeling the spread of news and rumors on Twitter using epidemiological models. It finds that a SEIZ (Susceptible-Exposed-Infected-Skeptics) model more accurately describes the spread of information on Twitter compared to a simpler SIS (Susceptible-Infected-Susceptible) model, especially at the initial stages. The researchers analyzed several real-world events and found the SEIZ model produced lower errors between the model and actual tweet volumes. Parameters extracted from fitting the SEIZ model to data could help identify rumors versus factual news on Twitter. Limitations include not incorporating information about followers or population characteristics.
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
EMBERS AutoGSR: Automated Coding of Civil Unrest EventsParang Saraf
EMBERS AutoGSR is a novel, web based framework that generates a comprehensive database of validated civil unrest events using minimal human effort. AutoGSR is a deployed system for the past 6 months that is continually processing data 24X7 in an automated fashion. The system extracts civil unrest events of type "who protested where, when and why?" from news articles published in over 7 languages, and collected from 16 countries.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
DMAP: Data Aggregation and Presentation FrameworkParang Saraf
DMAP (Data Mining and Automation for Platforms) is an online framework that presents a wide variety of official data, news and information about companies. It was developed with an aim to act as a one-stop platform for displaying everything official related to a company and its competitors. It aggregates data from the following sources: Bing News, Companys' official blogs, RSS Sources, Facebook, Twitter, Google Trends, Crunchbase, Financial Data Sources, and Alexa Web Analytics.
The aggregated information is shown in an intuitive fashion that allows a user to perform exploratory analysis of a particular company. Specifically, the interface makes it easy to do comparative analysis of a company with respect to its competitors. The user can use filters to select a given set of companies, data sources and date ranges. All the information is presented within the context of the framework such that the user doesn't have to go to different domains. The fetched news articles are cleaned, enriched, and presented in a manner that allows an analyst to navigate through news articles using named entities. For each news article, named entities: people, location, organization, and name of companies are extracted. For a given search query, the interface returns matching news articles along with associated entities which are shown using word clouds. This allows for easy discovery of connections between entities.
The framework was developed as a part of the 2015 summer internship with The Center for Global Enterprise (CGE). Conceptualization, Design and Development of the framework was done by me during the three month period.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Safeguarding Abila through Multiple Data PerspectivesParang Saraf
This document describes a visual analytics system developed to analyze multiple datasets related to the disappearance of employees from an organization. The system allows analysis of unstructured news articles, email headers, GPS tracking data, financial transactions, and streaming microblog data through different interactive interfaces. It was created as a solution to the 2014 VAST Grand Challenge to help law enforcement investigate the case by combining insights from all the different data sources.
Abstract: This paper introduces a system for visualization and analysis of geo-spatial and temporal data from call center and microblog sources. The system provides a streaming client for interacting with the data in real time. The system presents the data in a partitioned format using coordinated visualizations
that allows the analyst to view the data in multiple dimension
simultaneously. This allows the user to see patterns that occur in space and time. This project was developed in response to the VAST 2014 Mini-Challenge 3.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Award: This system won the "Honorable Mention for Effective Presentation" award.
Abstract: This paper introduces a system for visual analysis of GPS tracking and financial data. The system was developed in response to VAST Mini-Challenge 2 and comprises of different interfaces for mining spatio-temporal and financial data.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Abstract: This paper introduces a system for visual analysis of news articles and emails. The system was developed in response to VAST MiniChallenge 1 and comprises different interfaces for mining textual data and network data.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Forex-Foreteller: Currency Trend Modeling using News ArticlesParang Saraf
Abstract: Financial markets are quite sensitive to unanticipated news and events. Identifying the effect of news on the market is a challenging task. In this demo, we present Forex-foreteller (FF) which mines news articles and makes forecasts about the movement of foreign currency markets. The system uses a combination of language models, topic clustering, and sentiment analysis to identify relevant news articles. These articles along with the historical stock index and currency exchange values are used in a linear regression model to make forecasts. The system has an interactive visualizer designed specifically for touch-sensitive devices which depicts forecasts along with the chronological news events and financial data used for making the forecasts.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Abstract : Crime prediction is a topic of significant research across the fields of criminology, data mining, city planning, law enforcement, and political science. Crime patterns exist on a spatial level; these patterns can be grouped geographically by physical location, and analyzed contextually based on the region
in which crime occurs. This paper proposes a mechanism to parameterize street-level crime, localize crime hotspots, identify correlations between spatiotemporal crime patterns and social trends, and analyze the resulting data for the purposes of knowledge discovery and anomaly detection. The subject of this study is the county of Merseyside in the United Kingdom, over a span of 21 months beginning in December 2010 (monthly) through August 2012. Several types of crime are analyzed in this dataset, including Burglary and Antisocial Behavior. Through this analysis, several interesting findings are drawn about crime in Merseyside, including: hotspots with steadily increasing crime levels, hotspots with unstable crime levels, synchronous changes in crime trends throughout Merseyside as a whole, individual months in which certain hotspots behaved anomalously, and a strong correlation between crime hotspot locations and borough/postal code locations. We believe that this type of statistical and correlative analysis of crime patterns will help law enforcement agencies predict criminal activity, allocate resources, and promote community awareness to reduce overall crime rates.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Virtual time round-robin scheduler presented by Parang Saraf (CS4204 VT)Parang Saraf
Presentation on the paper: "Virtual time round-robin O(1) proportional share scheduler"
If you need a copy of this presentation, feel free to msg me at parang[DOT]saraf[AT]gmail
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
2. BAYESIAN MODEL FUSION FOR FORECASTING CIVIL UNREST 333
Figure 1. System Informatics framework for Bayesian model fusion,
where DQE, PP, and KV are underlying models detailed in Section 2.1.
Three important measures for evaluating the performance of
predicting civil unrest events are precision, recall, and quality
score. Precision is the proportion of issued alerts that correspond
to actual events, while recall is the proportion of events that
are predicted. Quality score is a measure of similarity between
alerts and events. An implicit fourth measure is lead time, that
is, the amount of advance time by which a forecast “beats the
news.” We will not explicitly focus on modeling/optimizing
lead time here, but our framework requires that alerts precede
the corresponding news events.
A visual depiction of the entire SI framework can be seen in
Figure 1. Each model processes source data independently to
release alerts corresponding to future protests. The source data
vary across models and while there may be some overlapping
datasets (e.g., Twitter is used by four models), data are aggre-
gated and used in a different manner. A parameter η is invoked
to control whether fused alerts favor precision or recall. Re-
leasing more alerts from the various models will cause drops
in precision while recall will improve; thus, the statistical re-
search question in fusion is focused on balancing this tradeoff,
specifically: Should a given alert be issued or suppressed?
Given a set of alerts, the first step in the fusion framework is
to cluster similar alerts from different models. Then, a classifier
is developed using a latent multivariate Gaussian representation
(Albert and Chib 1993; Chib and Greenberg 1998) of the model
alerts in each cluster. The benefit of the latent Gaussian specifi-
cation is that the association structure between the models can be
modeled and accounted for. Finally through a decision theoreti-
cal framework, a loss function that controls the precision-recall
tradeoff determines whether to issue or suppress alerts.
The complexity of the system and the uniqueness of the prob-
lem are such that novel extensions and combinations of exist-
ing methods are required. In particular, the proposed methods
contain similarities to meta-analysis, boosting, and quadratic
discriminant analysis (QDA); however, these methods are not
directly applicable to the research problem at hand.
Statistical meta-analysis (Hedges and Olkin 1985) encom-
passes a broad class of techniques for combining multiple data
sources. Originally, such methods were developed for integrat-
ing results from multiple studies to create more statistically
powerful inferences, or contrast discordance in differing results.
Under similar experimental conditions, Bayesian meta-analyses
(Gelman and Hill 2006) are in spirit straightforward methods
for fusing multiple data sources.
Another related class of techniques that are often applied
to predictive fusion are boosting methods (Schapire 1990),
wherein an ensemble of weak classification models form a
strong unified learner/predictor. In typical cases where boost-
ing can be applied, several models from a common model class
are fit, yielding a set of predictors. By weighting and averaging
each of these predictors, a single more accurate and reliable pre-
dictor is found. Such methods have been employed with great
success throughout the disciplines of machine and statistical
learning (Schapire and Freund 2006).
Because each of the constituting models, which we are at-
tempting to fuse, are not conceptually similar, they do not share a
common parameter space, straightforward boosting algorithms
and/or meta-analysis techniques cannot be applied. In subse-
quent sections of the article, we develop a predictive fusion
framework, which is completely agnostic to the models that are
used for prediction. Specifically, these models need not share
any parameters, or come from any common model class. Only
model predictions are required.
The classification component in our fusion framework also
has the flavor of QDA (Hastie, Tibshirani, and Friedman 2009),
which we have extended for binary variables. Rather than creat-
ing a classifier on the continuous space like QDA, we integrate
out the latent variables to determine optimal rules for a binary
set of model inputs; this will be further illustrated in Section 4.4.
Our contributions include an explicit way to handle problems
with massive and often unstructured data (Feldman and Sanger
2007). These rich, complicated data streams can be farmed out
and analyzed separately providing a means for incorporating
segmented domain knowledge or expertise. Collectively, these
models extract the sufficient information to capture the dynam-
ics driving the system. Then using our Bayesian model fusion
framework, these separate models are combined in principled
manner allowing enhanced control of the system. We are able
to increase the precision and quality of our forecasts of civil un-
rest. This article is focused on predicting civil unrest, however,
the general framework can be applied to a variety of problems.
The fusion process is agnostic to the underlying models, learn-
ing dependencies, and redundancies of such models to issue
predictions.
The remainder of this article is organized into five sections.
Section 2 compares fusion with a single super-model and details
the underlying models that are inputs to the fusion mechanism.
Section 3 defines civil unrest for the purposes of this analysis
and details the structure of the data including the alert-event
matching scheme. Section 4 describes the Bayesian model fu-
sion algorithm implemented to model civil unrest. Section 5
provides an overview of the case study for predicting civil un-
rest in Argentina, Brazil, and Mexico. Section 6 concludes with
a discussion.
2. MODULAR FUSION VERSUS SUPER-MODELS
There are two schools of thought in how to design an ap-
proach to generate alerts from massive data sources. By a
“super-model” approach, we refer to an integrated algorithm for
forecasting unrest that considers all possible input data sources
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
3. 334 ANDREW HOEGH ET AL.
simultaneously. One benefit of a super-model is that the joint
structure between data sources can be captured, at least theoret-
ically. For instance, Facebook and Twitter could offer mutually
reinforcing signals for a given country where they both have
penetration among the relevant population but can offer com-
plementary evidence in another country where they cater to
different segments of the populace. Such dependencies can be
captured in a super-model approach. The obvious drawback of
a super-model is the significant computational cost involved in
bundling all relevant data for processing.
In contrast, the modular fusion approach is targeted at dis-
tributed processing of heterogenous data sources, either singly
or in distinct combinations. Such an approach allows segmented
expertise to be naturally captured, for example, one model could
use Facebook to detect organized protests, and another could
use Twitter to hunt for signals about spontaneous events (and
a third model could use both). The modular fusion approach
also permits easy addition (and removal) of models as new data
sources are considered. Our Bayesian model fusion strategy for
SI adopts this approach.
2.1 Models Considered in This Article
Adopting the modular fusion approach, we next detail how
information is extracted from unstructured social media data
sources to produce discrete predictions of upcoming protests.
The underlying models use open source information primar-
ily from social media as inputs as well as a GSR (gold
standard report) of historical protests organized by a third
party, to produce alerts corresponding to upcoming civil un-
rest events. The alert structure consists of four elements: (i)
date of protest (day, month, year), (ii) location of the protest
(country, state, city), (iii) event type (i.e., the reason for
protest, and whether it was violent/nonviolent), and (iv) pop-
ulation (that will protest). Hence, for any particular alert, the
structure is A = {date, location, event type, population}. Thus
on day t, the models generate a set of alerts At =
{At11
, . . . , At1n1
, At21
. . . , At5n5
}, where At11
is the first alert is-
sued by model 1 on day t and At1n1
is the n1th alert issued by
model 1 on day t. The fusion process takes At as an input and
determines which alerts to issue and suppress.
Our modular fusion approach is intended to be agnostic to the
details of the specific models, only requiring the alerts from the
models and providing seamless integration of additional models.
In this article, we consider five models: (i) historical protests,
(ii) planned protest, (iii) spatial scan, (iv) keyword volume, and
(v) dynamic query expansion. Next we detail how these five
models extract information from datasets to produce alerts.
The historical protest (HP) model uses newspaper infor-
mation about past reported protests to identify commonly re-
curring protests and forecasts that such protests will happen
in the future as well. From the GSR, the HP model identi-
fies frequently occurring four-tuples: (day, location, event type,
population) using the last 3 months of protest data. Location,
event type, and population are all categorical variables whose
values are determined by consulting the GSR. The day is mod-
eled as the day of the week and in issuing forecasts a fixed
lead time (of 2 weeks) is used in choosing the forecast protest
date. Frequently recurring event types above a specified thresh-
old (e.g., twice per month) are issued. For instance, “farmers
protesting government taxes in Rio de Janiero, Brazil on an up-
coming Wednesday,” is the type of alert issued by the HP model.
The planned protest (PP) model extracts information from
news, blogs, Twitter, mailing lists, and Facebook in an attempt
to capture organized protests, that is, by organizations that use
social media as a way to recruit volunteers and galvanize sup-
port. This model works by extracting key phrases from text sig-
naling intent to protest, detecting dates and locations when such
protests are intended to happen. Phrases are drawn from English,
Spanish, and Portuguese. Examples of phrases are preparaci´on
huelga and llam´o a acudir a dicha movilizaci´on. If an article
or posting containing such a phrase is detected, the algorithm
aims to identify a mention of a future date and a location, both
proximal to the detected phrase. If a future date cannot be found
(e.g., only past dates are identified), then the article is discarded.
If a date mention is relative (e.g., the use of the phrase “next
Saturday”), the TIMEN (Llorens et al. 1995) enrichment en-
gine is used to convert the mention into an absolute date. As
an example, an alert generated by the PP model could be the
result of scraping information from a Facebook event page on
an upcoming labor strike.
Like the PP model, the keyword volume (KV) model also
uses a variety of news sources but tracks the daily aggregated
volume of protest-related keywords, and uses such volumes
in a logistic—least absolute shrinkage and selection operator
(LASSO) regression model to forecast protests. Additional eco-
nomic variables and Internet access statistics are also used in the
model. As outlined by Ramakrishnan et al. (2014) using subject
matter experts in political science and Latin American studies,
we organized a dictionary of nearly 800 words and phrases that
are useful indicators to track in social media. Examples of words
and phrases are “protest” and “right to work.” The KV model
uses features, that is, volumes of these keywords, from one day
to forecast if a protest will occur on the next day. Once an af-
firmative prediction is made, the tweets are analyzed in greater
detail to determine the location, event type, and population. Lo-
cation is determined using geocoding algorithms; event type
and population are predicted using a naive Bayes classifier. For
more details, see Ramakrishnan et al. (2014).
The spatial scan (SS) algorithm uses Twitter to track clusters
of geotagged tweets enriched with protest-related words, and
organizes chains of such clusters in space and time. Clusters
that grow over overlapping time windows are used as indicators
of upcoming protests. A fast subset scan algorithm (Neill 2012)
is applied over a grid of geolocated cells to identify a relevant
cluster that is enriched with the usage of words from our protest
dictionary. The cluster is then tracked over multiple time slices
to determine if it persists in size (or grows or decays). A grow-
ing cluster over three consecutive slices is used as an affirmative
determination of a protest. The location, event type, and popu-
lation are predicted by analyzing and/or classifying the tweets
constituting the cluster. The date is forecast using a simple linear
regression model trained against the GSR.
The dynamic query expansion (DQE) model uses Tweets
as well but is intended to be an unsupervised learning system
that can detect new forms of protests hitherto not encountered
(e.g., recently there were protests in Venezuela reg. toilet paper
shortage and keywords used in KV and PP do not involve any
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
4. BAYESIAN MODEL FUSION FOR FORECASTING CIVIL UNREST 335
relevant phrases). This model begins with a seed set of key-
words and expands it to learn a dynamic set of terms that reflect
ongoing chatter on social media. The expanded set is then used
to construct a network of tweet nodes from which anomalous
subgraphs are mined to forecast specific events. For additional
details on the DQE model, readers are referred to Ramakrishnan
et al. (2014).
Note that some models use just one source of data (e.g., DQE)
but others (e.g., PP) use multiple sources of data. Further, note
that the sources of data overlap across the models. While there
may be overlap in the data sources, the models individually
have selective superiorities that collectively provide a means
for capturing a variety of civil unrest situations. As the models
described here are a necessary component in predicting civil
unrest, details and references are given so the reader can under-
stand the underlying mechanism and implement similar models.
The remainder of the article focuses primarily on the fusion pro-
cess, which was designed to be modular, giving ample oppor-
tunity for segmented groups to predict civil unrest using their
expertise. Hence, it is necessary for the fusion component to be
flexible in integrating models by identifying dependencies and
redundancies across models to issue a streamlined set of alerts.
3. PRELIMINARIES
3.1 Defining Protests
An obvious question is: what constitutes an actual protest?
Many events with a large number of people gathering to support
a common case (e.g., a farmer’s market or a victory parade)
are not civil unrest events. The origin of the event definition
lies with the integrated data for events analysis (IDEA) frame-
work outlined by Bond et al. (2003), but is modified for our
purposes. Rather than exactly identifying what constitutes a
protest, it is easier to define exclusion criteria. Entertainment
performances and sporting events, natural disasters, terrorists or
criminal activities, general instability, and scandals are not con-
sidered civil unrest events. However, strikes by soccer players
or mass protests following a natural disaster would be valid in-
stances of protest. Another component of identifying civil unrest
revolves around whether the event reaches a relevance thresh-
old. Rather than identifying specific criteria, the determination
is made as to whether the event warrants a mention in major
newspapers in the country of interest. The event need not be
the feature of the article, but can be discussed in the context
of another story (e.g., a story about traffic mentions a protest
inhibiting traffic flows). For this application, the occurrence of
civil unrest events is identified by subject matter experts (dis-
joint from the authors) reading newspapers of record in each
country and recording protests.
Once an instance of civil unrest has been identified, the
event needs to categorized. For each case, five components
are recorded: (i) date of the protest (day, month, year), (ii)
date of reporting of the protest (day, month, year), (iii) location
of the protest (country, state, city), (iv) event type (reason for
protest, whether it was violent/nonviolent), and (v) population
(that protested). The protest type includes the following cate-
gories: employment and wages, housing, energy and resources,
other economic policies, other government policies, and other.
The population protesting can be one of the following groups:
business, ethnic, legal, education, religious, medical, media, la-
bor, refugees/displaced, agricultural, or general population (if
they do not fall into any of the other groups). The set of protests
recorded by subject matter experts is referred to as the GSR.
The GSR format matches that of alerts issued by our mod-
els, with the exception that the GSR contains two dates, one
pertaining to the protest date and another to when the protest
is reported. Similar to the alert structure, the events are char-
acterized as Et = {Et1
, . . . , Etnt
}, where the first event on day
t is Et1
= {date(protest), date(reported), location, event type,
population}.
The distinction between date of the protest and date of re-
porting is crucial. Ideally we need a forecast before the date of
reporting and with a forecasted date as close to the date of the
protest. Furthermore, alerts are generated daily and decisions
regarding issuance or suppression also need to be made daily.
However, the GSR, and thus the record of events, is only updated
monthly. Therefore, an evaluation of the alerts for the previous
month is made once that month’s GSR is finalized. The next
section describes this process, in which the date of the protest,
location, event type, and population, together help define the
quality in accurately forecasting events.
3.2 Alert-Event Matching
Because models (even a single one) issue multiple alerts and
because there are multiple protest events, a necessary component
of evaluating performance is a strategy for matching alert-event
pairs. The first required component is a similarity measure be-
tween individual alerts and events, which is known as the quality
score (QS) and defined by a piecewise formula comparing every
element of the event against the corresponding element of the
alert. QS is defined to be zero under certain obvious conditions:
(i) if the alert country and event country do not match, (ii) if
the lead time is < 0 (i.e., the alert is issued after the event is
reported), and (iii) if the forecasted date of event and actual date
of event are more than 7 days apart. If these exclusion conditions
are not triggered, QS is defined as
QS(A, E) = date.score + location.score + type.score
+population.score, where
date.score = 1 −
|dateA − dateE |
7
,
location.score =
1
3
δ(countryA = countryE )
+
1
3
δ(stateA = stateE ) +
1
3
δ(cityA = cityE ),
type.score =
2
3
δ(reasonA = reasonE )
+
1
3
δ(violentA = violentE )
population.score = δ(populationA = populationE ).
The δ() is an indicator function. Examples of issued alerts and
the corresponding matched events with the resultant QS are
shown in Figure 2. For instance, in the second alert-event pair
the resultant QS is 3.67 as the only difference between alert and
event is the city. Note that QS is scaled to be in [0, 4].
We can think of the QS formula as helping define the space of
possible matchings between alerts and events. We annotate every
matchable edge (QS > 0) between alerts and events with their
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
5. 336 ANDREW HOEGH ET AL.
Figure 2. Matched alert-event pairs.
QS, and compute a maximum bipartite matching to evaluate the
overall performance of the set of alerts issued. We enforce the
added constraint that an alert can be matched to at most one event
(and vice versa). The overall quality score is then assessed based
on the maximum quality score achievable in a bipartite matching
between alerts and events as defined in Munkres (1957).
3.2.1 Alert-Event Matching Illustration. Consider the vi-
sual depiction of the process as shown in Figure 3. As seen in
Panel (a), each model issues multiple alerts, some of which can
correspond to realized events. In particular, note that an event
can potentially be forecast by multiple models, but perfect iden-
tification of the event will be rare. So as Panel (b) shows, the QS
that each alert achieves can be different. Additionally, alerts are
also generated for events that do not occur (e.g., A33). At this
point, two distinct scenarios are depicted in Panels (c) and (d),
a case where all alerts from the models are released and another
with alerts suppressed by fusion. Whereas Panel (a) depicted
potential for models to forecast events, Panel (c) presents a con-
crete assignment of alerts to events when all alerts are released.
Dashed lines in Panel (c) are used to denote either alerts that did
not match to events, or that matched fortuitously to events other
than the target. This panel illustrates the two possible negative
effects of issuing extraneous alerts: either alerts go unmatched,
lowering precision, or fortuitously match to “incorrect” events,
diminishing the mean QS. Such rogue alerts can be a result of
near duplicate alerts or alerts not corresponding to an event,
but the net effect is the same. Finally, in Panel (d) consider
a case where fusion clusters similar alerts (shown as dotted
lines), while also suppressing poor alerts (no arrows). This re-
sults in lower recall, but higher precision, and higher quality
scores.
Figure 3. Overview of alert-event matching. (a) Alert generation.
(b) Quality scores. (c) Matching without fusion. (d) Matching with
fusion.
In general without fusion, issuing more alerts (accomplished
by incorporating more models) is not uniformly beneficial, even
if the alerts accurately predict civil unrest. In fact, too many
alerts is detrimental. There are two negative effects of issuing
excess alerts: alerts are unmatched causing precision to diminish
or alerts are matched with an event with low quality score reduc-
ing the average quality score. Next we detail the implementation
of our fusion process.
4. BAYESIAN MODEL FUSION
On a given day, a set of alerts, At , is generated from the
collection of underlying models. Each individual alert contains
information pertaining to location, date of protest, population
protesting, and the type of protest. Given At , our fusion model
consists of two stages. First, a scheme for clustering similar
alerts from different models is invoked that reduces At into
a set of clusters Mt . Each cluster contains up to five alerts
with no more than a single alert from each model defined as
m = (m1, m2, . . . , m5), where mi ∈ {0, 1} denotes whether an
alert from model i is included in the cluster. This results in
(25
− 1) = 31 unique cluster types. Second, for each cluster
type we estimate the probability of an event occurring using
data from the previous month. Based on a loss function, an alert
from each instance of cluster type is emitted if the probability of
a match is sufficiently high, otherwise alerts in the cluster type
are suppressed. The following subsections detail our clustering
procedure, decision rule, and implementation of Bayesian model
fusion.
4.1 Alert-Alert Clustering
The first step is to reduce extraneous alerts by cluster-
ing similar alerts. Specifically, clusters of similar alerts are
formed so that a given cluster contains either zero or one alert
from each model. This is done by combining alerts such that
QS(Atij , At i j ) > 3.0, where QS() is the previously defined
similarity score now applied to two alerts and Atij denotes the
jth alert issued from the ith model on day t. A QS of three
is used in this analysis, but the actual QS can be viewed as
a tuning parameter. The clustering effectively removes redun-
dant alerts as the complete set of alerts At is mapped to a set
of clusters Mt . Now rather than considering the complete set
of alerts, a smaller subset is evaluated as only a single alert
in each cluster can be released. The meta information of each
alert is retained and given the similarity of the alerts, a single
alert from each cluster is randomly selected to be considered for
release.
4.2 Fusion Concepts
In this problem, multiple users are viewing alerts with vary-
ing utilities corresponding to individuals. To create a flexible
procedure to account for individual utilities, a loss function is
introduced that associates a cost with each type of decision.
Given that this analysis is confined to predicting the presence
or absence of events, loss functions can be expressed via Table
1, where L(E, F(m)) denotes the loss for decision F(m) and
event outcome E, where E = 1 denotes that an event occurred.
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
6. BAYESIAN MODEL FUSION FOR FORECASTING CIVIL UNREST 337
Table 1. Loss function matrix: L(E, F(m))
Predicted
F(m) = 1 F(m) = 0
Actual E = 1 c11 c10
E = 0 c01 c00
The framework for tuning precision and recall is established
by setting c11 = c00 = 0 and allowing c10 and c01 to vary. Then
the parameter η is defined as
η =
c01
c10 + c01
. (1)
Values of η near one heavily penalize alerts that do not cor-
respond to an event (or false alarms). The result is only the
predictions with the highest probability of matching an event
are issued; thus, precision is emphasized. Similarly with η near
zero, failing to predict events would be more costly than issu-
ing alerts not corresponding to an event. Hence, small values of
η favor recall.
For a given cluster type, there are two possible decisions: issue
an alert, F(m) = 1, or suppress the alert, F(m) = 0. The fol-
lowing is a standard result in Bayesian decision theory (Berger
1985).
Proposition 1. The set of decisions satisfying: F(m) = 1 if
P[E = 1|m] > η and F(m) = 0 if P[E = 1|m] ≤ η is optimal.
Unfortunately, practical constraints make an optimal solution
untenable. In particular, the real-time nature of this prediction
exercise requires an algorithm capable of executing in a short
manner of time. The computational bottleneck is the alert-event
matching process described in Section 3.2. Computing P[E =
1|m] depends upon which cluster types are issuing alerts. For in-
stance, if a single cluster type, m∗
= (1, 0, 0, 0, 0), is suppressed
because P[E = 1|m∗
] < η, then the matching algorithm needs
to be rerun on the training data as P[E = 1|m] changes for
m = m∗
. This is because events that previously matched alerts
from m∗
in the training data are now free to match with alerts
issued from other cluster types. Hence, computed probabilities
change for the other cluster types. An optimal solution would re-
quire a massive model selection set to calculate P[E = 1|m] for
all m that considers the 231
possibilities in which alerts from
cluster types are issued or suppressed. The time alone required to
run the 231
sets of alert-event matching procedures (let alone any
statistical machinery) is considerably longer than 24 hr. Hence,
we adopt a practical solution, estimating P[E = 1|m] when all
clusters types issue alerts. This enables users to treat η as a
tuning parameter, viewing results from past months to calibrate
η to their respective utilities.
4.3 Data Generation Mechanism
To suppress lower quality alerts, it is necessary to learn
P[E = 1|m]. A Bayesian classifier is developed using the un-
derlying data generation mechanism for P[m|E]. That is, the
process by which models issue alerts in the presence or ab-
sence of an event is modeled and flipped into a classifier via
Bayes’ rule. To capture model dependencies, latent multivariate
Gaussian variables {Z1, Z0} are introduced as
E = 1 : Z1 ∼ N(μ1, R1) (2)
E = 0 : Z0 ∼ N(μ0, R0), (3)
where the latent variables Z1 = (z11z12, . . . , z1i), Z0 =
(z01, z02, . . . , z0i) are truncated such that zij > 0 if mj = 1 and
zij < 0 if mj = 0. The covariance matrices are constrained to
correlation matrices due to identifiability issues as in Albert
and Chib (1993). Given p[Z|E = i, μi, Ri] we can obtain,
P[m|E = i, μi, Ri] by integrating out the latent variables,
P m|E = i, μi, Ri =
Z∗
(2π)−mn/2
|Ri|− 1
2
× exp −
1
2
(Z − μi) R−1
i (Z − μi) d Z, (4)
where Z∗ denotes the orthants such that z∗j > 0 if mj = 1 and
z∗j < 0 if mj = 0. The integration of (4) can be computed us-
ing a simple accept–reject sampling method (Robert and Casella
2005). This integration is similar to QDA, but with binary data
rather than continuous. QDA establishes a decision boundary be-
tween the two distributions, whereas our procedure computes a
probability for each orthant corresponding to the discrete model
inputs.
4.4 Fusion Decision
The goal is to learn P[E = 1|m], which enables tuning of
precision and recall by issuing or suppressing alerts according
to the loss defined by the parameter η in (1). This posterior
distribution can be computed as
P [E = 1|m]
=
P m|E = 1, μ1, R1 P [E = 1]
P m|E = 1, μ1, R1 P [E = 1] + P m|E = 0, μ0, R0 P [E = 0]
×p( )d , (5)
where = {R1, μ1, R0, μ0}. Priors for are drawn from Tal-
houk, Doucet, and Murphy (2012) to use the parameter expan-
sion and data augmentation (PXDA) approach detailed therein:
P(μ|R) ∼ N(0, R) and p(R) ∝ |R|
M(M−1)
2 −1
M
i=1
|Rii|
−(M+1)/2
,
where Rii is the principal submatrix of R. We use a naive Bayes’
type prior in which P[E = 1] = n
j=1 δ(Ej = 1)/n, where n
is the number of points in the training dataset and δ(Ej = 1) in
an indicator function that the jth data point was a realized event.
For general details on PXDA, readers are referred to Liu and
Wu (1999) and van Dyk and Meng (2001). As conjugate priors
do not exist on correlation matrices, PXDA allows an augmenta-
tion converting the correlation matrix R into a valid covariance
matrix. This allows Gibbs sampling of both a covariance matrix
from an inverse Wishart distribution and a transformed mean
from a normal distribution. Finally, these are converted back to
the original mean and correlation scale. Tuning proposal densi-
ties for sampling matrices can be a cumbersome procedure, so
circumventing that via a Gibbs sampler improves the efficiency
of the algorithm. The Markov chain Monte Carlo (MCMC) pro-
cedure implemented for inference can be found in the Appendix.
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
7. 338 ANDREW HOEGH ET AL.
Table 2. Posterior means and 95% credible intervals
Model μ1 95% CI μ0 95% CI
Planned protest −0.05 (−0.20,0.09) −1.82 (−1.91,−1.74)
Spatial scan −1.34 (−1.55,−1.13) −2.46 (−2.61,−2.31)
Historical protest 0.22 (0.02,0.43) −2.23 (−2.48,−2.02)
Dynamic query extraction −0.03 (−0.23,0.21) −2.14 (−2.29,−1.97)
Keyword volume −0.98 (−1.24,−0.68) −2.14 (−2.36,−1.94)
5. BAYESIAN MODEL FUSION FOR CIVIL UNREST
5.1 Overview
Using data from May 2013 to train the Bayesian model fu-
sion framework, alerts are issued and evaluated for June 2013.
Predictions are made for the countries of Argentina, Brazil, and
Mexico. As mentioned previously, June 2013 was a month in
which numerous protests broke out in Brazil. Historical trends
alone would not be sufficient to predict the sheer number of
protests, so the performance during this month shows the effi-
cacy of mining social media data to predict protests. For ref-
erence, in Brazil alone 74 events took place during May 2013.
This increased in June to 425 events. Data and R code for the
algorithms in this article can be found in the supplementary files.
5.2 Model Estimation
Convergence occurs very rapidly with a burn-in period of
less than 50 iterations. This implementation uses 5000 itera-
tions for the MCMC and 100,000 Monte Carlo samples for the
accept/reject sampler built into each iteration. The algorithm
runs on a MacBook Pro (2GHz Intel Core i7 processor, 8 GB
SDRAM) in about 300 min. The algorithm only need be recali-
brated when a new event history is added to the GSR (monthly)
or when an underlying model is changed. However, should
model changes be made, the fusion implementation needs to
be retrained and capable of producing updated predictions in a
single day.
Marginal posterior means and credible intervals for the mean
parameters of the latent representation, (2) and (3), can be seen
in Table 2. As there are 20 correlation parameters, these are
omitted for brevity’s sake.
Collectively the probability of matching an alert for a given
cluster type, P[m|E = i, μ, R], is computed via (4). The
output of the MCMC procedure necessary for fusion deci-
sions is the posterior probabilities P[E = 1|m]. Table 3 shows
the posterior mean of these distributions for select model
combinations, which is used to determine which alerts to
issue.
For instance, the cluster type only containing a PP alert,
m = (1, 0, 0, 0, 0), and the cluster only containing an SS
alert, m = (0, 1, 0, 0, 0), models both have relatively weak
evidence of an event without agreement of other mod-
els, with P[E = 1|(m = (1, 0, 0, 0, 1))] = 0.21 and P[E =
1|(m = (0, 1, 0, 1, 0))] = 0.24. However, in cases where both
models jointly issue an alert, the evidence of an event in-
creases to P[E = 1|(m = (1, 1, 0, 0, 0))] = 0.51. Dependent
on the η value specified by the loss function, a likely scenario
is the alert in the PP cluster would initially be suppressed, but
once the SS model issued a similar alert that was clustered with
Table 3. Posterior mean for P[E = 1|m]
PP SS H DQE KV P[E = 1|m]
1 0 0 0 0 0.21
0 1 0 0 0 0.24
0 0 1 0 0 0.75
0 0 0 1 0 0.53
0 0 0 0 1 0.06
1 1 0 0 0 0.51
0 0 0 1 1 0.59
1 1 0 1 0 0.74
1 0 1 1 0 0.90
1 0 1 0 1 0.70
the PP alert, then the probability would be upgraded and an alert
would be issued.
5.3 Results
Realistically, evaluating prediction requires considering pre-
cision, recall, and quality score jointly. The alert-event matching
algorithm is a necessary evil of the process, in which poor, un-
calibrated alerts are often “matched” albeit with low quality
scores. This phenomenon is shown in Panel (c) of Figure 3,
where a duplicate alert from model 2, A22, matches event 4
and a rogue alert from model 3, A33, matches event 6. Sim-
ilarly, the first alert-event pair in Figure 2 represents a poor
match with QS = 1.52. In both cases the alerts should have been
suppressed.
To further illustrate the value of this tuning approach, six
different η values are selected that span the range of computed
probabilities. These levels, as well as the unconstrained instance
in which all alerts are issued, are compared. A description of
these levels can be seen in Table 4. Using the seven levels,
different numbers of alerts are issued for each level. Plots dis-
playing the precision and recall for each of the seven tuning
levels can be seen in Figure 4. With a large η, fusion increases
precision; moreover, loss functions favoring precision will also
tend to have higher QS. Intuitively by suppressing alerts with
a lower probability of matching an event, alerts that result in
low quality matches are also suppressed. As we have shown
extra alerts either match with low QS or are not matched at all.
While this is not explicitly a component of the loss function,
it is an important feature of our fusion algorithm. The mean
quality score of matched alerts for each of the tuning levels can
be seen in Figure 5. By suppressing the lowest quality alerts at
each tuning level, the overall mean QS increases well above the
target value of 3.0 for each country.
Another way to consider quality score is to look at the dis-
tribution of the quality scores among the matched alert-event
pairs rather than the overall mean QS. Kernel density estimates
(KDE) for the mean QS combined across the three countries
can be seen in Figure 6. In this figure, notice the distribution
shifts considerably toward higher quality alerts at higher tuning
levels. The fusion mechanism effectively suppresses the lower
quality alerts, particularly those with a QS less than 2.
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
8. BAYESIAN MODEL FUSION FOR FORECASTING CIVIL UNREST 339
Figure 4. Precision recall tradeoff by the tuning levels defined in
Table 4.
Figure 5. Mean QS in solid line, dashed line represents target QS
of 3.0.
Figure 6. KDE of mean QS combined for all three countries for
three specified tuning levels.
Table 4. Precision recall tuning levels
Level Criteria
1 Issue all alerts
2 η = 0: cluster similar alerts
3 η = 0.21
4 η = 0.51
5 η = 0.65
6 η = 0.74
7 η = 0.80
6. DISCUSSION
In this article, we have presented a modular fusion approach to
predicting civil unrest. The framework allows separate models to
be constructed with differing selective superiorities, which is ex-
tremely beneficial when dealing with big unstructured datasets,
such as the set of publicly available social media information in
a country for a given month. This also allows data to be aggre-
gated in different ways, such that the signal of civil unrest can
be captured from social media. However, this modular approach
also presents some challenges. Creating additional models or
more specifically releasing extra alerts, even strong performing
ones, results in degraded precision and quality score. Our fusion
framework presents a solution to combining model alerts and
issuing predictions such that precision and recall tradeoffs are
handled in an optimal way. While not explicitly optimizing QS,
our fusion framework has been shown to retain the strongest
alerts resulting in elevated overall QS.
A key feature of our approach is the tunability aspect, that
is, wherein an analyst can choose to focus on precision or re-
call. Recall is more appropriate (and precision less important)
if an analyst uses our framework as a step in an analytic triage,
that is, to determine potential hotspots of civil unrest and then
use domain knowledge to narrow them down to areas of inter-
est. Precision is more appropriate (and conversely, recall is less
important) if the analyst is focused on exploring a specific so-
cial science hypothesis, and thus obtaining accurate (but fewer)
forecasts is desirable.
One reason that fusion is necessary is that models release
similar alerts. It may be the case that the models pick up on the
same signal from the same or a different dataset. Irregardless
of how the process arises, the net result of the similar alerts is
degraded QS and lower precision. The problem can be under-
stood through the idea of correlated models. This is expressed
through the correlation matrices in (2) and (3). An alternative
to sampling the correlation matrices would be to use Gaussian
graphical models to explicitly capture the association structure
between the models. Graphs could be sampled during each iter-
ation of the MCMC sampler, providing a means to visualize and
estimate model dependencies. However, the association struc-
ture of the models is only tangentially related to our research
question. The graph structures would be integrated out, to obtain
the probabilities of identifying an event. Therefore, we sample
unconstrained correlation matrices rather than the graphs.
Moving forward, a direct approach to consider the joint struc-
ture of precision, recall, and QS is being developed. Specifi-
cally, a loss function can be invoked that not only penalizes
unmatched alerts and events, but also penalizes matches with
QS less than some specified value. A related approach is to solve
a constrained optimization for QS (e.g., maximize QS subject
to precision and recall greater than some threshold). To achieve
these aims, a few modifications are made to the work presented
here: (i) a more sophisticated model-based approach for match-
ing alerts via a Dirichlet process introducing clusters of alerts,
(ii) ensemble methods are used to combine information between
alerts in a cluster—that is, alerts that are matched are combined
rather than suppressing the extra alerts, (iii) given the flexibil-
ity in matching alerts, multiple alerts are allowed to be emitted
from a cluster of alerts.
APPENDIX: MCMC Pseudocode
Estimation is conducted using an MCMC procedure as
follows:
1. Sample (Z1|E = 1, μ1, R1) and (Z0|E = 0, μ0, R0) ∼
N(μi, Ri). For computational efficiency, the multivariate den-
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016
9. 340 ANDREW HOEGH ET AL.
sities can be decomposed into univariate normal densities, then
sampling proceeds using importance sampling as in Robert
(1995). The use of standard accept–reject samplers or importance
sampling with a multivariate (or univariate) normal distribution
is unfeasibly slow, due to truncated regions having extremely
low probabilities.
2. Sample (μ1, R1|Z1) and (μ0, R0|Z0) using PXDA as in Talhouk,
Doucet, and Murphy (2012).
a. Sample (dii|R) ∼ IG((M + 1)/2, rii
/2), where M is the
number of models and rii
are the diagonal elements of R−1
.
Then let D be a matrix with diagonal elements dii and com-
pute W = ZD. Then = DRD and γ = μD.
b. Sample ( |W) from IW(2 + n, W W + I − W JnW/
(n + 1))), where
∼ IW(ν, ) ∝
2
ν+M−1
2
| |
−ν+2M
2
× exp −
1
2
tr[ −1
] ,
and Jn is an n × n matrix of 1s.
c. Sample (γ |W, ) ∼ N(1nW/(n + 1), /(n + 1))).
d. Compute R = Q Q and μ = γ Q, where Q is a diagonal
matrix with diagonal elements qii = σ
−1/2
ii , where σ
−1/2
ii are
the diagonal elements of the precision matrix −1
.
3. Compute P[m|E = 1, μ1, R1] and P[m|E = 0, (μ0, R0] as in
(4) for each of the cluster types m.
4. P[E = 1|m, μ1, R1, μ0, R0] is computed via (5), where each it-
eration of the MCMC sampler integrates over {μ1, R1, μ0, R0} to
obtain P[E = 1|m] for m ∈ M.
SUPPLEMENTARY MATERIALS
Supplementary materials include the data used in the paper,
both the civil unrest events and the alerts from the predictive
algorithms. R code is also available to reproduce the figures in
the article.
[Received September 2013. Revised December 2014.]
REFERENCES
Albert, J. H., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychoto-
mous Response Data,” Journal of the American Statistical Association, 88,
669–679. [333,337]
Anderson, E. G., Jr. (2006), “A Preliminary System Dynamics Model of Insur-
gency Management: The Anglo-Irish War of 1916–1921 as a Case Study,”
paper presented at the 2006 International System Dynamics Conference.
[332]
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, New
York: Springer. [337]
Bond, D., Bond, J., Oh, C., Jenkins, J. C., and Taylor, C. L. (2003), “Integrated
Data for Events Analysis (IDEA): An Event Typology for Automated Events
Data Development,” Journal of Peace Research, 40, 733–745. [335]
Braha, D. (2012), “Global Civil Unrest: Contagion, Self-Organization, and Pre-
diction,” PLoS ONE, 7, e48596. [332]
Carley, K. (2006), “Destabilization of Covert Networks,” Computational &
Mathematical Organization Theory, 12, 51–66. [332]
Chib, S., and Greenberg, E. (1998), “Analysis of Multivariate Probit Models,”
Biometrika, 85, 347–361. [333]
Feldman, R., and Sanger, J. (2007), The Text Mining Handbook: Advanced Ap-
proaches in Analyzing Unstructured Data, New York: Cambridge University
Press. [333]
Garson, G. D. (2009), “Computerized Simulation in the Social Sciences: A
Survey and Evaluation,” Simulation & Gaming, 40, 267–279. [332]
Gelman, A., and Hill, J. (2006), Data Analysis Using Regression and Hi-
erarchical/Multilevel Models, New York: Cambridge University Press.
[333]
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009), The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, New York: Springer-
Verlag. [333]
Hedges, L. V., and Olkin, I. (1985), Statistical Methods for Meta-Analysis,
Boston, MA: Academic Press. [333]
Hua, T., Lu, C.-T., Ramakrishnan, N., Chen, F., Arredondo, J., Mares, D.,
and Summers, K. (2013), “Analyzing Civil Unrest Through Social Media,”
Computer, 46, 0080–0084. [332]
Liu, J. S., and Wu, Y. N. (1999), “Parameter Expansion for Data Augmentation,”
Journal of the American Statistical Association, 94, 1264–1274. [337]
Llorens, H., Derczynski, L., Gaizauskas, R., and Saquete, E. (1995), “TIMEN:
An Open Temporal Expression Normalisation Resource,” LREC, 5,
3044–3041. [334]
Munkres, J. (1957), “Algorithms for the Assignment and Transportation Prob-
lems,” Journal of the Society for Industrial and Applied Mathematics, 5,
32–38. [336]
Neill, D. B. (2012), “Fast Subset Scan for Spatial Pattern Detection,” Journal
of the Royal Statistical Society, Series B, 74, 337–360. [334]
Ramakrishnan, N., Butler, P., Muthiah, S., Self, N., Khandpur, R., Wang, W.,
Cadena, J., Vullikanti, A., Korkmaz, G., Kuhlman, C., Marathe, A., Zhao,
L., Hua, T., Chen, F., Lu, C., Huang, B., Srinivasan, A., Trinh, K., Getoor,
L., Katz, G., Doyle, A., Ackermann, C., Zavorin, I., Ford, J., Summers,
K., Fayed, Y., Arredondo, J., Gupta, D., and Mares, D. (2014), “Beat-
ing the News With Embers: Forecasting Civil Unrest Using Open Source
Indicators,” in Proceedings of the SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 1799–1808. [332,334]
Robert, C. (1995), “Simulation of Truncated Normal Variables,” Statistics and
Computing, 5, 121–125. [340]
Robert, C. P., and Casella, G. (2005), Monte Carlo Statistical Methods
(Springer Texts in Statistics), Secaucus, NJ: Springer-Verlag New York, Inc.
[337]
Schapire, R. E. (1990), “The Strength of Weak Learnability,” Machine Learning,
5, 197–227. [333]
Schapire, R. E., and Freund, Y. (2006), Boosting: Foundations and Algorithms,
Adaptive Computation and Machine Learning Series, Cambridge, MA: MIT
Press. [333]
Stark, O., Hyll, W., and Behrens, D. (2010), “Gauging the Potential for Social
Unrest,” Public Choice, 143, 229–236. [332]
Talhouk, A., Doucet, A., and Murphy, K. (2012), “Efficient Bayesian Infer-
ence for Multivariate Probit Models With Sparse Inverse Correlation Ma-
trices,” Journal of Computational and Graphical Statistics, 21, 739–757.
[337,340]
Thron, C., Salerno, J., Kwiat, A., Dexter, P., and Smith, J. (2012), “Modeling
South African Service Protests using the National Operational Environment
Model,” in Social Computing, Behavioral-Cultural Modeling and Predic-
tion, eds. S. J. Yang, A. M. Greenberg, M. Endsley, Heidelberg, Germany:
Springer, pp. 298–305. [332]
van Dyk, D. A., and Meng, X.-L. (2001), “The Art of Data Augmentation,”
Journal of Computational and Graphical Statistics, 10, 1–50. [337]
TECHNOMETRICS, AUGUST 2015, VOL. 57, NO. 3
Downloadedby[VirginiaTechLibraries]at14:1001April2016