A talk at Social Media Lab, Ryerson University on April 25, 2018 discussing the 2016 U.S. election Twitter dataset collected at George Washington University Libraries.
Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565
Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use.
My research addresses these challenges by
(1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users.
(2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.
Social media is now the place where people are gathering en masse to discuss the news with their friends, neighbors and complete strangers. This change in news consumers’ behavior is proving to be a challenge for local news, but it is also an opportunity. Users and system generated data from social media can also be a boon for content creators. This presentation will feature a case study showing how publishers can use social media analytics to gain insights into their audience and how to use this information to foster a stronger sense of community around their brand of journalism. The case study will focus on how to use Netlytic, a cloud-based social media analytics tool, to mine the public Facebook interactions of the readers of BlogTO, a regional, Canadian-based media outlet, to find out what their readers are interested in and what engages them.
EMBERS AutoGSR: Automated Coding of Civil Unrest EventsParang Saraf
EMBERS AutoGSR is a novel, web based framework that generates a comprehensive database of validated civil unrest events using minimal human effort. AutoGSR is a deployed system for the past 6 months that is continually processing data 24X7 in an automated fashion. The system extracts civil unrest events of type "who protested where, when and why?" from news articles published in over 7 languages, and collected from 16 countries.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Pavan Kapanipathi's talk at IBM's Frontiers of Cloud Computing and Big Data Workshop 2014. http://researcher.ibm.com/researcher/view_group_subpage.php?id=5565
Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use.
My research addresses these challenges by
(1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users.
(2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia.
Social media is now the place where people are gathering en masse to discuss the news with their friends, neighbors and complete strangers. This change in news consumers’ behavior is proving to be a challenge for local news, but it is also an opportunity. Users and system generated data from social media can also be a boon for content creators. This presentation will feature a case study showing how publishers can use social media analytics to gain insights into their audience and how to use this information to foster a stronger sense of community around their brand of journalism. The case study will focus on how to use Netlytic, a cloud-based social media analytics tool, to mine the public Facebook interactions of the readers of BlogTO, a regional, Canadian-based media outlet, to find out what their readers are interested in and what engages them.
EMBERS AutoGSR: Automated Coding of Civil Unrest EventsParang Saraf
EMBERS AutoGSR is a novel, web based framework that generates a comprehensive database of validated civil unrest events using minimal human effort. AutoGSR is a deployed system for the past 6 months that is continually processing data 24X7 in an automated fashion. The system extracts civil unrest events of type "who protested where, when and why?" from news articles published in over 7 languages, and collected from 16 countries.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
DMAP: Data Aggregation and Presentation FrameworkParang Saraf
DMAP (Data Mining and Automation for Platforms) is an online framework that presents a wide variety of official data, news and information about companies. It was developed with an aim to act as a one-stop platform for displaying everything official related to a company and its competitors. It aggregates data from the following sources: Bing News, Companys' official blogs, RSS Sources, Facebook, Twitter, Google Trends, Crunchbase, Financial Data Sources, and Alexa Web Analytics.
The aggregated information is shown in an intuitive fashion that allows a user to perform exploratory analysis of a particular company. Specifically, the interface makes it easy to do comparative analysis of a company with respect to its competitors. The user can use filters to select a given set of companies, data sources and date ranges. All the information is presented within the context of the framework such that the user doesn't have to go to different domains. The fetched news articles are cleaned, enriched, and presented in a manner that allows an analyst to navigate through news articles using named entities. For each news article, named entities: people, location, organization, and name of companies are extracted. For a given search query, the interface returns matching news articles along with associated entities which are shown using word clouds. This allows for easy discovery of connections between entities.
The framework was developed as a part of the 2015 summer internship with The Center for Global Enterprise (CGE). Conceptualization, Design and Development of the framework was done by me during the three month period.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...CSCJournals
In the era of technology and internet, people use online social media services like Twitter, Instagram, Facebook, Reddit, etc. to express their emotions. The idea behind this paper is to understand people’s emotion on Twitter and their opinion towards Presidential Election 2020. We collected 1.2 million tweets in total with keyword like “RealDonaldTrump”, “JoeBiden”, “Election2020” and other election related keywords using Twitter API and then processed them with natural language processing toolkit. A Bidirectional Long Short-Term Memory (BiLSTM) model has been trained and we have achieved 93.45% accuracy on our test dataset. We then used our trained model to perform sentiment analysis on the rest of our dataset. With the sentiment analysis results and comparison with 2016 Presidential Election, we have made predictions on who could win the US Presidential Election in 2020 with pre-election twitter data. We have also analyzed the impact of COVID-19 on people’s sentiment about the election.
Panel presented as part of the 2017 Data Power Conference (Ottawa, ON, June 23, 2017)
Anatoliy Gruzd (@gruzd), Jenna Jacobson (@jacobsonjenna), Priya Kumar (@link_priya), Philip Mai (@phmai)
This paper introduces how ClaimBuster, a fact-checking platform, uses natural language processing and supervised learning to detect important factual claims in political discourses. The claim spotting model is built using a human-labeled dataset of check-worthy factual claims from the U.S. general election debate transcripts. The paper explains the architecture and the components of the system and the evaluation of the model. It presents a case study of how ClaimBuster live covers the 2016 U.S. presidential election debates and monitors social media and Australian Hansard for factual claims. It also describes the current status and the long-term goals of ClaimBuster as we keep developing and expanding it.
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Rich Heimann
Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the Sandy Hook Elementary School shooting.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. However, analyzing this ever-growing pile of data is quite tricky and, if done erroneously, could lead to wrong inferences.
In this webinar you will gain, by example insights to mining social media data and exposing underlying latent structures relating to ideology and sentiment as well as space and time.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Some elementary principles and procedures for Facebook data-mining. Combination of Graph API and OpenRefine software for parsing the JSON output. Two beer brands are analyze with respect to their active fans and engagement.
The second part is dedicated to the Interest positioning (as pioneered by PerfectCrowd) technique and what can OutWit Hub do as a substitute for more sophisticated techniques & apps.
Twitter is now an established and a widely popular news medium. Be it normal banter or a discussion on high impact events like Boston marathon blasts, February 2014 US Icestorm, etc., people use Twitter to get updates and also broadcast their thoughts and views. Twitter bots have today become very common and acceptable. People are using them to get updates about emergencies like natural disasters, terrorist strikes, etc., users also use them for getting updates about different places and events, both local and global. Twitter bots provide these users a means to perform certain tasks on Twitter that are both simple and structurally repetitive, at a much higher rate than what would be possible for a human alone. During high impact events these Twitter bots tend to provide a time critical and a comprehensive information source with information aggregated form various different sources. In this study, we present how these bots participate in discussions and augment them during high impact events. We identify bots in 5 high impact events for 2013: Boston blasts, February 2014 US Icestorm, Washington Navy Yard Shooting, Oklahoma tornado, and Cyclone Phailin. We identify bots among top tweeters by getting all such accounts manually annotated. We then study their activity and present many important insights. We determine the impact bots have on information diffusion during these events and how they tend to aggregate and broker information from various sources to different users. We also analyzed their tweets, list down important differentiating features between bots and non bots (normal or human accounts) during high impact events. We also show how bots are slowly moving away from traditional API based posts towards web automation platforms like IFTTT, dlvr.it, etc. Using standard machine learning, we proposed a methodology to identify bots/non bots in real time during high impact events. This study also looks into how the bot scenario has changed by comparing data from high impact events from 2013 against data from similar type of events from 2011. Bots active in high impact events generally don't spread malicious content. Lastly, we also go through an in-depth analysis of Twitter bots who were active during 2013 Boston Marathon Blast. We show how bots because of their programming structure don't pick up rumors easily during these events and even if they do; they do it after a long time.
This project analyses the relations on Twitter between politicians and journalists in the triangle of political communication in a hybrid media system (Chadwick, 2013).
DMAP: Data Aggregation and Presentation FrameworkParang Saraf
DMAP (Data Mining and Automation for Platforms) is an online framework that presents a wide variety of official data, news and information about companies. It was developed with an aim to act as a one-stop platform for displaying everything official related to a company and its competitors. It aggregates data from the following sources: Bing News, Companys' official blogs, RSS Sources, Facebook, Twitter, Google Trends, Crunchbase, Financial Data Sources, and Alexa Web Analytics.
The aggregated information is shown in an intuitive fashion that allows a user to perform exploratory analysis of a particular company. Specifically, the interface makes it easy to do comparative analysis of a company with respect to its competitors. The user can use filters to select a given set of companies, data sources and date ranges. All the information is presented within the context of the framework such that the user doesn't have to go to different domains. The fetched news articles are cleaned, enriched, and presented in a manner that allows an analyst to navigate through news articles using named entities. For each news article, named entities: people, location, organization, and name of companies are extracted. For a given search query, the interface returns matching news articles along with associated entities which are shown using word clouds. This allows for easy discovery of connections between entities.
The framework was developed as a part of the 2015 summer internship with The Center for Global Enterprise (CGE). Conceptualization, Design and Development of the framework was done by me during the three month period.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...CSCJournals
In the era of technology and internet, people use online social media services like Twitter, Instagram, Facebook, Reddit, etc. to express their emotions. The idea behind this paper is to understand people’s emotion on Twitter and their opinion towards Presidential Election 2020. We collected 1.2 million tweets in total with keyword like “RealDonaldTrump”, “JoeBiden”, “Election2020” and other election related keywords using Twitter API and then processed them with natural language processing toolkit. A Bidirectional Long Short-Term Memory (BiLSTM) model has been trained and we have achieved 93.45% accuracy on our test dataset. We then used our trained model to perform sentiment analysis on the rest of our dataset. With the sentiment analysis results and comparison with 2016 Presidential Election, we have made predictions on who could win the US Presidential Election in 2020 with pre-election twitter data. We have also analyzed the impact of COVID-19 on people’s sentiment about the election.
Panel presented as part of the 2017 Data Power Conference (Ottawa, ON, June 23, 2017)
Anatoliy Gruzd (@gruzd), Jenna Jacobson (@jacobsonjenna), Priya Kumar (@link_priya), Philip Mai (@phmai)
This paper introduces how ClaimBuster, a fact-checking platform, uses natural language processing and supervised learning to detect important factual claims in political discourses. The claim spotting model is built using a human-labeled dataset of check-worthy factual claims from the U.S. general election debate transcripts. The paper explains the architecture and the components of the system and the evaluation of the model. It presents a case study of how ClaimBuster live covers the 2016 U.S. presidential election debates and monitors social media and Australian Hansard for factual claims. It also describes the current status and the long-term goals of ClaimBuster as we keep developing and expanding it.
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Rich Heimann
Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the Sandy Hook Elementary School shooting.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. However, analyzing this ever-growing pile of data is quite tricky and, if done erroneously, could lead to wrong inferences.
In this webinar you will gain, by example insights to mining social media data and exposing underlying latent structures relating to ideology and sentiment as well as space and time.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Some elementary principles and procedures for Facebook data-mining. Combination of Graph API and OpenRefine software for parsing the JSON output. Two beer brands are analyze with respect to their active fans and engagement.
The second part is dedicated to the Interest positioning (as pioneered by PerfectCrowd) technique and what can OutWit Hub do as a substitute for more sophisticated techniques & apps.
Twitter is now an established and a widely popular news medium. Be it normal banter or a discussion on high impact events like Boston marathon blasts, February 2014 US Icestorm, etc., people use Twitter to get updates and also broadcast their thoughts and views. Twitter bots have today become very common and acceptable. People are using them to get updates about emergencies like natural disasters, terrorist strikes, etc., users also use them for getting updates about different places and events, both local and global. Twitter bots provide these users a means to perform certain tasks on Twitter that are both simple and structurally repetitive, at a much higher rate than what would be possible for a human alone. During high impact events these Twitter bots tend to provide a time critical and a comprehensive information source with information aggregated form various different sources. In this study, we present how these bots participate in discussions and augment them during high impact events. We identify bots in 5 high impact events for 2013: Boston blasts, February 2014 US Icestorm, Washington Navy Yard Shooting, Oklahoma tornado, and Cyclone Phailin. We identify bots among top tweeters by getting all such accounts manually annotated. We then study their activity and present many important insights. We determine the impact bots have on information diffusion during these events and how they tend to aggregate and broker information from various sources to different users. We also analyzed their tweets, list down important differentiating features between bots and non bots (normal or human accounts) during high impact events. We also show how bots are slowly moving away from traditional API based posts towards web automation platforms like IFTTT, dlvr.it, etc. Using standard machine learning, we proposed a methodology to identify bots/non bots in real time during high impact events. This study also looks into how the bot scenario has changed by comparing data from high impact events from 2013 against data from similar type of events from 2011. Bots active in high impact events generally don't spread malicious content. Lastly, we also go through an in-depth analysis of Twitter bots who were active during 2013 Boston Marathon Blast. We show how bots because of their programming structure don't pick up rumors easily during these events and even if they do; they do it after a long time.
This project analyses the relations on Twitter between politicians and journalists in the triangle of political communication in a hybrid media system (Chadwick, 2013).
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Farida Vis
Keynote delivered at the SRA Social Media in Social Research conference, London, 24 June, 2013. The presentation highlights some thoughts on sampling, tools, data, ethics and user requirements for Twitter analytics, including an overview of a series of recent tools.
Abstract:
Social media data is a rich source of behavioural data that can reveal how we connect and interact with each other online in real-time and over time, and what that might mean for our society as we continue to speed towards an increasingly computer-mediated future. And as more and more Canadians are joining and contributing to various social media websites, their automatically recorded data are rapidly becoming available to third parties to mine for both commercial and academic purposes. As a result, questions around why and how data consumers’ use social media data are becoming pertinent. This talk will review different approaches to Social Media Data Stewardship (the collection, storage, use, reuse, analysis, and preservation of social media data) and discuss some ethical implications of working with such data.
For my final year project I used data analysis techniques to investigate user behavior pattern recognition in respect of similar interests and culture versus offline geographical location. This was an out-of-the-box topic, which I selected due to my love on Data Analysis, in respect of the Social Network Analysis in the Internet era.
Data augmented ethnography: using big data and ethnography to explore candi...Salla-Maaria Laaksonen
In this paper we propose data augmented ethnography as a novel mixed methods approach to combine ethnographic, qualitative, observations with social media data collection and computational analysis. Using two brief studies on online interaction as examples we discuss the benefits and challenges of the combination of these two perspectives. We posit that the observations made in the qualitative phase can be quantified and hypothesized together with the data collected later during the analysis stage. Through our case studies we aim to shed light to the differences apparent on the party level and seek to understand how candidates, based on their parties political standing, differ in terms of interactivity. We ask, what insights does a mixed-method approach combining ethnographic observations to computational social science offer to the study of interactivity and its many pregnant forms? To answer this question, we use a large data set collected from different social media platforms before and during the 2015 Parliament Election in Finland. This data consists of both textual data including all candidate updates and the conversations they elicited, as well as field notes written and collected during ethnographic field work period before the elections.
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
Presentation at the Workshop on "Small Data and Big Data Controversies and Alternatives: Perspectives from The Sage Handbook of Social Media Research Methods" with Anabel Quan-Haase, Luke Sloan, Diane Rasmussen Pennington, et al.
LINK: http://sched.co/7G5N
Today, most personalized and recommendation services are built around interest extraction models but the outputs of these algorithms are ambiguous in nature. This makes it difficult to understand what users are personally interested in and more importantly what they are feeling towards these interests and how their interests transition through time. By studying both users' interests and emotions, simultaneously, one can further investigate the motivation behind these interests. Such findings can be useful to build better interest extraction models and algorithms that leverage personalized and recommendation services (e.g., ads. targeting, e-commerce and dating sites). In this paper, we propose the demonstration of a web visualization tool - EmoViz - which facilitates the further exploration of users' interests and their emotions at a global scale. Such tool, through the use of various visual components, aims to alleviate the problem of understanding what users of the world are interested in and the motivations behind their interests and feelings.
Accompanying paper for this work: http://ieeexplore.ieee.org/document/7403627/
User Classification of Organization and Organization Affiliated Users during ...Hemant Purohit
Understanding who participates and for what activities in social media conversations after crisis events can be helpful for response coordination agencies, especially other organizations, and their affiliates. Check paper at:
Hemant Purohit, & Jennifer Chan. (2017). Classifying User Types on Social Media to inform Who-What-Where Coordination during Crisis Response. In ISCRAM-17.
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
This research work discusses how an integrated open source intelligence framework can help the law enforcements and government entities who are investigating crimes based on statistical and graph analysis on Twitter data. The solution supports a real-time and off-line analysis of the tweets collections. The framework employs tools that support big data processing capabilities, to collect, process and analyze a huge amount of data. The outline solution supports content and textual based analysis, helping the investigators to dig into a person and the community linked to that person based on a tweet. Our solution supports an investigative processes composed of the following phases (i) find suspicious tweets and individuals based on hash tags analysis (ii) classify the user profile based on Twitter features (iii) identify influencers in the FOAF networks of the senders (iiii) analyze these influencers’ background and history to find hints of past or current criminal activity.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
What do you do with 280 million tweets from the 2016 U.S. election?
1. What do you do with
280 million tweets from the
2016 U.S. election?
Justin Littman
April 25, 2018
2. Overview
● Outline of the dataset
● Collecting the dataset
○ Social Feed Manager
● Sharing the dataset
○ TweetSets
● Uses of the dataset
● Plans for 2018 U.S. election
15. Social Feed Manager (SFM)
● Open source software by GW Libraries.
● User interface for collecting, managing & exporting social
media data.
● Goal: Lower the technical barriers for collecting social
media data for academic research and archiving.
● Supports Twitter, Tumblr, Flickr & Sina Weibo.
● Intended for organizations to run for their users.
go.gwu.edu/sfm
27. Sharing the dataset
● Twitter’s developer policies require sharing tweet ids only.
● Complete tweets can be “hydrated” from Twitter API.
○ Hydrating complete dataset takes about a month.
● Tweets that are deleted or from accounts that are
protected, deleted, or suspended are not available.
● Provides a “right to be forgotten” but also:
○ Complicates reproducible research
○ Difficult to hold politicians accountable, research bots.
● However, share complete tweets within university.
29. Sharing the dataset: Harvard’s Dataverse
● Almost 3,000 downloads (as of mid-2018).
● Each collection has a README.
→ Interested in collaborating on best practices for sharing
datasets.
30. Sharing the dataset: TweetSets
● Open source software by GW Libraries.
● Basic idea: Reuse existing datasets, but allow to filter /
query for only the tweets that are needed.
● Conforms with Twitter policies.
○ Within university: Complete tweets
○ Public: Tweet ids only
tweetsets.library.gwu.edu
33. TweetSets step 2a: Query the tweets in datasets
● Tweet text
● Hashtags
● Mentions
● Posted by
● In reply to
● Tweet type
● Created at
● URL
● Has image
● Is geotagged
Also, query by:
39. Academic research
● Clare H. Liu, “Applications of Twitter Emotion Detection for
Stock Market Prediction.” Masters thesis at MIT.
● David Anuta, Josh Churchin & Jiebo Luo, “Election Bias:
Comparing Polls and Twitter in the 2016 U.S. Election.”
● Sicheng Zhao, Yue Gao, Guiguang Ding & Tat-Seng
Chua. “Real-Time Multimedia Social Event Detection in
Microblog.” IEEE Transactions on Cybernetics.
● Ahsen J. Uppal & H. Howie Huang, “Event Prediction from
Dynamic Communities in Social Networks.”
40. Journalists
● Significant interest in dataset after release of list of IRA
accounts by Senate Intelligence Committee.
● We identified 36,210 tweets from these accounts.
● Sharing these deleted tweets violates Twitter policy.
● University weighed public interest vs. risk of losing access
to Twitter API for GW researchers.
● See
nbcnews.com/tech/social-media/now-available-more-200-
000-deleted-russian-troll-tweets-n844731
41. Deleted tweets research
● With Catie Bailard (School of Media & Public Affairs,
GWU) & Andy Hoagland (data scientist)
● Possible research questions:
○ What is the substantive content of deleted vs. extant tweets about
the candidate(s)?
○ What was the relative distribution of deleted / extant tweets in
terms of the proportion that were pro- / anti- Hillary / Trump?
○ Were tweets with a certain type of content more likely to be
deleted than those with other types of content?
42. Deleted tweets research
● Possible research questions:
○ What portion of tweets deleted by Twitter were likely-bots vs.
likely-humans? Were there differences in the substantive content
of deleted tweets generated by likely-humans versus likely-bots?
43. Deleted tweets research
● 92 million tweets from October 8th and November 8th
2016 which contain “Clinton,” “Trump,” “Donald,” “Hillary,”
“@realDonaldTrump” or “@HillaryClinton”.
● Split deleted tweets from extant tweets.
○ 22 million tweets (24%) were deleted
● Created 10% sample of deleted tweets & 1.5% sample of
extant tweets.
44. Deleted tweets research
● For each tweet in deleted tweets sample, determined
reason for deletion.
○ For example: user suspended, original user suspended, tweet
deleted
● For each user in each of the samples, ran bot detection.
○ Botometer, using API.
○ Used tweets from full dataset, rather than live Twitter.
○ Not all users had enough tweets.
45. Deleted tweets research
● Performing content analysis of 3000 tweets.
○ Coding for overall “gist” (anti-Trump, anti-Hillary, pro-Trump,
and/or pro-Hillary), specific subject matter (e.g., criticizes
candidate’s personal qualities or past actions, calls-to-action),
identity (e.g., race, gender), more.
○ Three humans code each tweet using DiscoverText.
○ Average Krippendorff’s Alpha score 0.73.
● Will use neural network machine learning to generalize to
larger dataset.
50. Plans for #election2018: Currently collecting
● Top accounts
○ 5,000+ accounts extracted from neutral collection because a top
tweeter, retweeted account, or mentioned account.
○ Add new accounts every week from rolling 2 weeks of tweets.
○ Already seeing significant churn as accounts are suspended.
51. Plans for #election2018:
● Individual candidates
● Local parties
● Local hashtags
→ Currently in discussions with a news organization to
collaborate on identifying these accounts / hashtags.
→ Thinking about how to “cut through noise” to collect tweets
from citizens.
→ Working on contemporaneous web archiving of linked web
resources and media.