SlideShare a Scribd company logo
1 of 50
Download to read offline
DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED
METHODOLOGICAL APPROACH
by
ALFREDO HICKMAN JR
THESIS
Presented to the Faculty of the
Honors College
The University of Texas at San Antonio
In Partial Fulfillment
Of the Requirements
For the Degree of
BACHELOR OF ARTS IN POLITICAL SCIENCE
WITH HIGHEST HONORS IN THE HONORS COLLEGE
THE UNIVERSITY OF TEXAS AT SAN ANTONIO
College of Liberal and Fine Arts
Department of Political Science and Geography
May 2015
DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED
METHODOLOGICAL APPROACH
PREPARED BY:
________________________________________
Alfredo Hickman Jr
APPROVED BY:
________________________________________
Bryan Gervais, Ph.D., Thesis Advisor
________________________________________
Ritu Mathur, Ph.D., Thesis Reader
________________________________________
Walter Wilson, Ph.D., Thesis Reader
Accepted: _________________________________________
Richard Diem, Ph.D., Dean of the Honors College
Received by the Honors College:
______________________
iii
ACKNOWLEDGEMENTS
First and foremost, I would like to acknowledge and thank God. I would like to
acknowledge my parents. Had it not been for the sacrifice and efforts of my parents, I would not
exist or be the man that I am today. I would like thank and acknowledge my wife, Crystal. My
wife’s support throughout this project has been a blessing. I would like to thank and
acknowledge the faculty and staff at the University of Texas at San Antonio, its Honors College,
and its College of Liberal and Fine Arts. Dr. Bryan T. Gervais, Ph.D., has been a great source of
knowledge, experience, and wisdom, and is a trusted mentor and advisor. Dr. Ann Eisenberg,
Ph.D., has also been a great source of encouragement and support, in not only the development
of this Thesis and supporting research, but also my all around educational development while at
the University of Texas at San Antonio. In addition, I would like to thank my thesis readers, Dr.
Ritu Mathur, Ph.D., and Dr. Walter Wilson, Ph.D.
Ultimately, I would like to thank and acknowledge the academics, researchers, and
software developers that have contributed to the base of knowledge, information, and software
that exist in the realms of Political Science, Data Science, and Information Systems. In
particular, I would like to thank JetBrains for the development software, Ubuntu and the Linux
community for the platform and support, MongoDB for the database, Robomongo for the
database administration software, and GitHub for hosting the open-source code repositories, and
Guillermo Del Fresno, on GitHub, for developing twitterstream-to-mongodb. The work that I
present in this Thesis is an amalgam of the fields and technologies mentioned, and which builds
on the effort, intellect, and sacrifice of those that have come before me; they are truly the giants
on whose shoulders I stand.
May 2015
iv
ABSTRACT
DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED
METHODOLOGICAL APPROACH
Alfredo Hickman Jr, B.A.
The University of Texas at San Antonio, 2015
Supervising Professor: Bryan Gervais, Ph.D.
This thesis will examine the creation and use of a data-mining system to extract, process,
and analyze Twitter “tweets” for Political Science. By providing a free and open platform for
rapidly sharing and exchanging ideas, Twitter has become the most popular microblogging site
and system in the world. Twitter allows its users to disclose their actual names, or post tweets
anonymously; this has fostered an environment that allows people to discus and comment on
politics with a scope, liberty, and, candor that has never before existed. Twitter can be an
invaluable tool for political scientists that wish to better understand the motives, thoughts,
sentiments, and social networks of people as it pertains to politics and social phenomena.
During the course of my research, I have built and maintained an information system that
collects and process selective Twitter data live. In conjunction with ps_proj, an authenticated
application I created on Twitter’s Developers Site, I use Twitter’s Streaming Application
Programming Interface (API) to collect streaming data on a randomly selected list of 279
Members of Congress (MCs). Once the tweet data set is captured, I will analyze the messages,
and the accompanying metadata and data. I expect the data, once analyzed, will produce insights
into the American political being, and allow the political scientists to create information products
critical to understanding social and political behavior.
v
TABLE OF CONTENTS
ACKNOWLEDGEMENTS.................................................................................................................................. IVII
ABSTRACT .............................................................................................................................................................. IV
ACRONYMS AND DEFINITIONS..........................................................ERROR! BOOKMARK NOT DEFINED.
CHAPTER 1: INTENT AND ETHICAL CONSIDERATIONS.............................................................................1
CHAPTER 2: INTRODUCTION ..............................................................................................................................4
CHAPTER 3: THESIS STATEMENT....................................................................................................................11
CHAPTER 4: METHODS AND APPROACH.......................................................................................................13
CHAPTER 5: DATA PROCESSING AND ANALYSIS .......................................................................................23
CHAPTER 6: POTENTIAL APPLICATIONS......................................................................................................30
CHAPTER 7: CONCLUSION .................................................................................................................................32
REFERENCES ..........................................................................................................................................................34
APPENDICES ...........................................................................................................................................................35
vi
ACRONYMS AND DEFINITIONS
API: Application Programming Interface – A programmatic specification and mechanism for
interfacing with software components.
Back-end: The mechanism that allows data to be collected and stored in a distributed
computational system.
Client: Software or hardware system that requires services from another platform.
Cloud computing: Computational services hosted on remote, networked, and, distributed
information systems that are consumed like a commodity.
Front-End: The mechanism that allows a distributed computational system to input, process,
and transmit data.
Host: Software or hardware system that provides a platform for other systems.
IP Address: The identifying value assigned to a device participating in an Internet Protocol
network.
iSCSI: Internet Small Computing Interface – A protocol used to facilitate the use and connection
of storage resources on computer networks.
JSON: Java Script Object Notation- A language independent standard used for transmitting
human readable text between computer systems.
Linux: A free and open source operating system base.
LUN: Logical Unit Number – The identification mechanism used to identify a networked storage
resource in an iSCSI storage model.
MC (s): Member of Congress
MongoDB: A NoSQL document oriented database that uses JSON to provide flexible schemas.
NoSQL: The concept of non-structured storage and retrieval in non-relational databases.
Operating System: The suite of software that provides functionality to client computer software
and host hardware.
Python: A popular, multi-purpose, high-level computer programming language.
Server: A networked computer whose function is to provide services to client computers.
Ubuntu: A Linux based operating system.
Vagrant: A configurable, portable, and reproducible computational work environment.
VirtualBox: A software platform for virtualizing computer operating systems.
1
CHAPTER 1: INTENT AND ETHICAL CONSIDERATIONS
The intent of this thesis is not to delve into a theoretically normative discourse on the
pros, cons, or applications of data-mining and analytics in general. However, I will briefly
explore some of the politically theoretical and normative literature that influenced me in the
development of my data-mining and analytics system, and the accompanying research. Rather,
my goal is to display and share the empirical and methodological development and application of
a data-mining and analytics information system for the benefit of Political Science. I would be
remiss and negligent if I did not acknowledge and share some of my concerns for the potentially
harmful applications and consequences of data-mining and analytics systems such as the one that
I have created, and those, much more sophisticated systems that are in production and under
development now and will be in the future.
Before delving into the internals and potential applications for a data-mining system such
as the one I present in this thesis, I believe it is crucial to explore some of the ethical
considerations involved with mining data from the public at-large. In the relatively short amount
of time since the Internet was created and made available for public use (by the American
defense and academic communities), people from all over the world have come to depend on the
technology for an ever increasing amount of daily activity. The Internet has revolutionized the
ways in which we live, communicate, create and consume information and generate data,
metadata, and knowledge. With the rapid development of the Internet and peripheral
technologies, humanity has not only been able to share existing knowledge and information, but
has also created and distributed more new information, data, and metadata than in any time prior
in the human experience.
2
With the astronomic amounts of public and private information, data, and metadata that
have been created and shared on the Internet, have come new possibilities, opportunities, and
derivative technologies. For example, the nascent industries of electronic business intelligence,
data-mining, and data-analytics have emerged in the belief that vast amounts of value can be
generated from the information and data that the public creates and shares on the Internet.
Technologists, by collecting vast amounts of public and private data and metadata, can track,
analyze, and predict human behavior and generate potentially valuable information, products,
and services. With this information, public and private interests can create and construct products
and services that leverage, and potentially manipulate, human behavior in manners never before
possible. With the ability to track, monitor, and potentially manipulate human individuals and
populations at-large, have come many concerns about how electronic information and data are
used and abused.
Massive data breaches, mostly driven by organized criminal and state actors, of the
world’s largest and most powerful private and public institutions and businesses have rattled
many individuals, firms, and governments into questioning how and why electronic data is being
collected, processed, stored, and secured (Rosenzweig, 2013). Revelations of governments
demanding data, and metadata from Internet Service and Data providers legally, illegally, or
otherwise unethically, has alarmed many people in the human rights and civil liberty
communities. Instances such as Yahoo’s complicity in China’s persecution of political dissidents
have alarmed many state and non-state actors into demanding reforms and regulations for how,
and for what purposes data and metadata on people are collected, used, and consumed (Ruggie,
2013).
3
Issues of ownership over the data and metadata that the public creates and consumes have
also been raised. At the time of this writing the status quo operates under the assumption, and is
the de facto standard, that public data and metadata are mostly commodities to be processed,
sold, bought, and consumed, so long as the providers “general terms and conditions,” do here
apply and have been accepted.
In addition, at the time of this writing, the revelations by the former National Security
Agency (NSA) contractor, Edward Snowden are still fresh on many minds. Edward Snowden
alleged that the United States and other governments are collecting massive amounts of public
and private data and metadata, sometimes illegally, in the name of national security and other
interests (Greenwald, 2014). Since the Snowden revelations, many of the allegations made
against the United States were publically and officially substantiated, and some reforms were
initiated.
With all the potential applications of data-mining and analytics, one must question and
query the potential public and private benefits and harms that can arise in the age of instant
communications and “big-data.” With the enormous amounts of data and metadata that are being
created and consumed daily, we, as a society, can choose to use the information, products, and
services they yield for the benefit or harm of our fellow man, and our shared environment and
communities.
4
CHAPTER 2: INTRODUCTION
Since Twitter’s creation and website launch in 2006, it has become the largest and fastest
growing micro-blogging site and system on the planet (Farhi, 2009). Twitter’s ability to cater to
people’s innate curiosity and need for information and interaction has resulted in close to 1
billion registered users and 271 million monthly active users since October 2014. Due to the
roughly140 character limit per Tweet, the format of the communication forces people to
construct their messages succinctly and to the point. Contributing to the success of the Twitter
platform, is the ability to post messages anonymously or not, follow other users, retweet other
user’s tweets, follow other users, allow yourself to be followed, and a myriad of other features
that allow people to communicate, associate, and express themselves in ways never before
possible.
Because of Twitter’s popularity, use, and innate features, the site has fostered a
community of opinion and dialog unlike any system that has existed before it. The results of
Twitter’s system and operations are more extensive social networks, contexts, and information,
some of which are new to humanity at large and the social sciences in particular. Because of
Twitter’s success and proliferation, many social and political scientists have researched the
communications posted on the site in effort to understand the intent, motivation, sentiment,
behavior, and other sociological factors of the Twitter users that create them. In the context of
the social sciences, the vast amount of scholarly work done in the realm of Internet based social
networking has come in the form of direct collection, and analysis of social networking messages
and data. While the conventional methods used in the social sciences for collecting, and
analyzing the data are valid, I believe the methodology leaves a crucial factor out of the equation
- the metadata.
5
However, before delving into the world of Twitter architecture, metadata, data, and
information, and their potential value to the social sciences, I will define some key terms. I will
then briefly explore some of the relevant work that has come before, and how that helped frame
this research and its intent.
In the course of this thesis, I will use following terms in this manner:
1. Metadata (um): The underlying information about the data being referenced that can
serve to provide enriched functionality, context, network, and potential meaning to
the information and data generated. In essence, the metadata is the glue and pointers
that bind and direct the individual message into the larger social network and
information ecosystem.
2. Data (um): The qualitative or quantitative dynamic values or value that make up
information, and which are structured or unstructured (raw) in a manner conducive to
mechanical and/or biological processing means and methods.
3. Information: The qualitative or quantitative product of a causal relationship between
data components in a system and its environment. Information can be transmitted and
consumed via message, observation, perception, or other biological or mechanical
processes. Information is what we want, and what is, but not always, of value from a
data-mining and analytic system.
When shaping the idea for this project, I wanted to not only describe how to build a
functional social media collection and processing system, but also to explore how new
technologies like the Internet and social media can provide insights into the way people create
and consume political data and information. The spark that ignited my interest in the potential
value of social media in regards to political science, was my interpretation of Diana Mutz's
6
Hearing the Other Side: Deliberative versus Participatory Democracy. Mutz (2006) argues that
exposure to multiple political views decreases participation in political activities and highlights
the potential conflict between deliberative and participatory democracy. Furthermore, Mutz
argues that the context and network in which political discussion takes place does matter, and
that they can serve to either facilitate or hamper political learning and action. Due to the social
norms that govern interpersonal communication and association, people often self-censor their
public political opinions and views in order to avoid conflict, ridicule, rejection, or a wide variety
of other social consequences.
Because tight social fabrics can stifle public political expression of dissenting opinions
and views, the observation of political expression in a medium as open as the Internet can be of
value to the political scientist and psychologist. Since much of the interpersonal communication
that can occur on the Internet and social media is free from the social norms and consequences of
live political expression and association, the observation and analysis of such behavior can
render valuable insights into the uncensored political mind (Gervais 2014).
Research on political discourse and deliberation can be greatly enriched by using data
and metadata driven analysis of political discussion in the context of social networks on the
Internet. By capturing, collecting, processing, and analyzing tweets and their corresponding
metadata, researchers can understand how people create, consume, and share data and
information on the Internet. From these observations, researchers can better understand what
political topics are important to people, and where these topics are important in both the physical
and virtual words. By collecting and analyzing social media communications and their
corresponding metadata, researchers can identify political association and behavior as it occurs
in the context of social networks on the Internet and in real life.
7
Another field of study within the social sciences that can be advanced with the use of
social media data and metadata collection and analysis is the study of the communication
between government and its constituencies. In research done during the 111th Congress,
Matthew Eric Glassman and others looked into the way that government officials used Twitter to
communicate and inform people on a variety of topics of political importance. What Glassman
and his partners discovered was that MCs in the minority party tended to use social media at
higher rates than those of the majority party, and that the information was constructed to fulfill
requirements of information within functional contexts. The contexts ranged from district and
state constituencies, official political action groups, personal communications, replies to other
comments or questions, and position taking. The implications for out-groups having a larger
voice when not in power or when disenfranchised from society are something of critical value
for the political outlier. This value is even more evident when the communications occur in
contexts where speech and political descent are commonly self-repressed, such as in certain
physical social settings and on traditional media.
What Glassman discovered was that social media allowed people to communicate and be
informed by their representatives in a more direct and unfiltered manner than was possible using
traditional media channels, such as television, radio news, and press conferences. In regards to
this type of study, data-mining and analytics could support the normative and theoretical bodies
of political science by providing new information on issues such as, constituent – representative
relations, political communication and association on the Internet, and the potential for social
media and the Internet to encourage plebiscitary politics. As such, analysis into the social
networks that are created in the physical and virtual worlds when people create, consume, and
share electronic communications on the Internet could provide potential insights into the
8
“political being.” With this in mind, my research will explore the evolving data and metadata
trail that are created when such actions occur within Twitter. However, the possibilities span
much further than any one website or platform.
I hypothesize that if the Internet and social media provide new networks for
communication and association, along with the potential for social consequences and action, then
the data and metadata that are created and consumed when those actions occur, when analyzed,
can be of value to Political Science. However, it may prove difficult to please strict political
theorists in regards to defining what constitutes political communication, deliberation, and
participation in the context of the Internet and social media, and what that looks like. As such,
Jane Mansbridge (1999) argues that everyday political talk can be useful in promoting political
deliberation and participation if it meets certain and stringent criteria. However, if all political
dialogue were held to this standard, very few discussions would ever be considered true political
deliberation. Nevertheless, if we loosen Mansbridge’s standard and apply the social norms of the
Internet then we can see that the analysis of political communications and social networks can be
of value.
While much of the political communications that take place via social media on the
Internet may not meet all of Mansbridge’s standards, the collection of the communications and
the information that can be derived from the underlying metadata can be of significant value to
the political scientist. As I have mentioned before, there is more to a tweet than just the text of
the message. The majority of what constitutes a tweet is actually a vast construct of metadata and
data structures that serve to provide enriched functionality and value to a tweet and its creators,
distributors, and consumers. I will elaborate on this in later chapters. However, what this implies
is that by collecting and analyzing tweets in their entirety, a political scientist can not only study
9
the content of the message field, but he or she can also construct sophisticated data models that
describe the locations, sentiments, interests, behaviors, and associations of the people that create,
consume, or share those tweets. So, when taking into account that social media is a primary
medium by which people communicate and act on matters of politics on the Internet, data-
mining and analytics can be an invaluable for the political scientist.
To provide contrast to the research I present in this thesis, I found a research study
conducted at The University of Maryland College of Information Studies, in which Jenifer
Golbeck, Justin M. Grimes and Anthony Rogers (2009), collected and analyzed over 6,000
tweets posted by various MCs. The conclusion of that study was that the tweets MCs created and
shared, “tend not do provide new insights into government or the legislative process, or to
improve transparency, rather they are vehicles for self-promotion.”
In response to that, this thesis will not attempt to prove or disprove that information
collected from social media can serve to be the end-all-be-all of insight into the political mind.
Rather, the intent of my research is to display the development and potential applications of a
data-mining and analytics information system that can yield data and information valuable to
political science. A data-mining and analytics system, like the one I present in this thesis, can be
used to collect and process social media data and metadata, and then to create a framework for
future political studies. In essence, I will support the idea around which the entire “big-data,” and
data mining and analytics industries have emerged. The idea being that there is potentially
significant value in the information produced when the underlying structures that are created
when people create, share, and consume information on the Internet and social media are
analyzed and operationalized.
10
This thesis also attempts to highlight that in leaving out the overwhelming majority of
what constitutes a tweet (or most other electronic messages) from their analysis, leaves out a
huge factor from the research – the metadata (reference Appendix 1: what a tweet really looks
like). Golbeck, Grimes and Rogers go on to state, “We have chosen not to study the underlying
social network (followers, following, and friends), but this is a rich space for future work.” I will
attempt to fill some of that space with my research and system. I will also support the idea that
the underlying social constructs enumerated in the metadata, and of which the actual message is
only minimal component, can be of significant value to political science.
By collecting, processing, and analyzing tweets in their entirety, metadata and all,
political scientist can develop a more robust understanding of people’s locations, sentiments,
interests, behaviors, and associations as they relate to matters of political interest and activity on
the Internet and in the “real world”. Perhaps, by exploring this new medium for electronic
communication and association, innovative methodologies can be developed to leverage the
Internet and social media, and help bridge the gap between political normative theory and
empirical quantitative analysis…even if only a bit. Enjoy!
11
CHAPTER 3: THESIS STATEMENT
Contemporary political science research of social media communications involves
collecting data, analyzing the data for components, creating variables, coding the variables,
operationalizing the variables, and attempting to produce an intellectual product of significance
and meaning. What I believe is left out of much of the Internet and social media based research
done in Political Science is the leveraging of information systems to facilitate a more robust
collection and analysis of electronic communications and social networks. As a result, in the
past, much collection and analysis of social media communications have left out some of the
most crucial and potentially valuable components of political communications and social
networks on the Internet, the metadata.
By using data-mining and analytics, political scientists can programmatically collect,
process, and analyze social media and other Internet communications automatically and
perpetually. By using these systems, political scientists can collect and operationalizing massive
Internet derived data sets, and craft unlimited amounts and types of queries and analytics to
create potentially valuable information products. These information products can then can be
used to describe the political sentiments, interests, behaviors, and associations of practically
anyone using the Internet. With that, these information products can then be used to create new
bodies of political knowledge and information. By utilizing data-mining and analytics, political
scientists can produce information products that detail valuable information such as topics of
political interest, and overcome some of the challenges that occur when tackling complex
collection based projects on the Internet with reduced resources.
12
As such, I will attempt to convey the value and possibilities of employing data-mining
and analytics information systems for the benefit of Political Science by explore the following
topics:
1. How to build a data-mining and analytics information system.
2. How to capture, transfer, and store tweets in their entirety (metadata and all).
3. What exactly is a tweet, and why is it potentially valuable (we will explore a
dissected tweet and identify and explain its composition).
4. Potential applications for data-mining and analytics systems in political science.
13
CHAPTER 4: METHODS AND APPROACH
Information System and Data Collection
In this section, I will detail the creation and composition of data-mining and analytics
information system used for this project. The software and hardware used during the course of
project is flexible and can be adapted, or scaled as necessary. In addition, with the development
and proliferation of relatively inexpensive and accessible cloud computing services, the
information system I detail here can be adapted and ported over to a cloud provider and scaled as
needed.
The platform I created for this project is comprised of the following components:
1. A physical server computer to host the operating system and client software: this can be a
virtual server if running from the cloud or another networked computer. I chose to use a
dedicated PC computer that I loaded with s server operating system. For production
purposes, I recommend a dedicated physical, virtual, or cloud based server or a cluster of
servers if you really want to scale.
2. An operating system: I chose to use Ubuntu Linux Server as my operating system. I
chose to use Ubuntu Server because it is a free and open source, enterprise capable server
operating system. In addition, Ubuntu is well maintained, documented, and enjoys a
broad user and technical support base on the Internet.
3. Physical computer storage: I chose to create a storage area network (SAN) for my server
to utilize. For this, I used a 4-terabyte network attached storage appliance, created a
virtual disk pool, partitioned an iSCSI logical unit number (LUN) from the pool, and
assigned it as virtual storage for my Ubuntu Server via a routed virtual local area network
(VLAN).
14
4. A terminal to connect to your server: The terminal can be a physical monitor console, a
web browser, or a software terminal emulator. I chose to use a Secure Shell (SSH)
terminal emulator to securely connect to my server from anywhere.
5. A database: I chose to use MongoDB. MongoDB is an excellent fit for a data-mining and
analytics information system, because it stores documents in the binary form of the Java
Script Object Notation (JSON) that is native to much social media communications.
6. Database administration software: I choose to use Robomongo because it is a free,
secure, and feature rich database administration suite.
7. Programming Language and interpreter (if required): I chose to use the Python
Programming Language, because of its broad documentation, ease of use, clear syntax,
broad support base, rich software library pool, and open source nature.
8. The data-mining software engine: I chose to use twitterstream-to-mongodb by Guillermo
Del Fresno on GitHub (2014), because it is free, open source, licensed for general use
(GNU GPL), and it is written in my favorite programming language, Python.
Once you have acquired the necessary components, you will need to assemble, install, and
deploy your information system. 1
1
Reference the installation and deployment instructions particular to your software, hardware, and
operating system components.
15
Once the data-mining system is setup, the next step is to create an authenticated
application on the Twitter Developer’s Website at https://dev.twitter.com/ (this can be done
either before or after the previous step). The Twitter application you create in this step will allow
your data-mining and analytics system make authenticated requests to Twitter’s APIs, and is
required for the data-mining portion of this system. From the Twitter developer’s website, you
can create an account, log in, and create a Twitter App (read-only access will suffice, unless you
want your system to publish information on behalf of your application). Once you have created
the Twitter app, you will need to record and safeguard the following values: “consumer key,
consumer secret, access token, and the access token secret.”2
2
Reference the screenshot in Appendix 2: Twitter App, for what the authentication and authorization variables look
like.
16
Now that the system is established, and the authenticated Twitter app is created, the next
step is to populate your system with the files and values it needs in order to data-mine Twitter. In
this step, you will need to login to your server and navigate to the directory in which your
twitterstream-to-mongodb script is located. Once you are in the correct directory, you will need
create the following files and populate them with the following values.
1. Create a file named “oauth.json”: In this file, enter the following terms and values as
such:
{
“consumer_key” : “enter your consumer key here”,
“consumer_secret” : “enter your consumer secret here”,
“access_token” : “enter your access token here”,
“access_token_secret” : “enter your access token secret here”
}
OAuth is the open standard that Twitter uses to allow for programmatic authentication
and access to their APIs.
2. Create a file named “objects.txt”: In this file, enter the objects you wish to track, each
individual object must be separated by one space, and cannot exceed 400 objects (the
400-object maximum is a Twitter API limitation); the objects can include the following
types of values: #example, @example, and example.3
3
Version 1 of the system I present in this thesis utilizes the Twitter Streaming API’s “track” feature. As such, the
system will only collect tweets that contain the values listed in the objects text. This particular API limitation will
exclude tweets that are created by a value listed in the objects file. However, the system will collect every tweet that
references a value listed in the objects file. In version 2 of this system, I will incorporate the Twitter Streaming
API’s “follow” feature, which will permit the collection of tweets that are directly created or shared by a value listed
in the objects file.
17
Once the following files are created and populated, the next step is to initiate the script
and data collection. Initiate the script in the following manner, and from the directory that
contains the oauth, objects, and twitterstreamtomongodb.py files:
1. From the terminal, enter the following command and parameters (if on a Windows
system, disregard the “sudo”):
sudo python twitterstreamtomongodb.py --oauth=oauth.json --server=127.0.0.1 --port=27017 --
database=“insert DB name here” --track=objects.txt4
1. 4
Sudo, is used on Linux based systems to invoke the context of another account, typically with elevated or
administrative privileges. The “python” command calls the Python Interpreter to interpret the script (the file
immediately after and ending in the .py extension).
2. --oauth, is the parameter that passes your Twitter app’s credentials, stored in the file, to the program for
authentication and authorization to Twitter’s APIs.
3. --server, denotes the Internet Protocol (IP) address of your server, this value can also be a resolved host
name if you are using an externally provided hosting platform, or have otherwise resolved the IP address to
a host name. In this example, I am using the local host address of 127.0.0.1, which indicates that I am
running the program directly from the local machine. The tweets you collect will be routed or directed to
the IP address you place into this parameter. On this system, I am using a SAN, which has its own set of IP
addresses. However, the SAN is providing virtual storage that is logically attached to the host server, which
is why I am using the local host address.
4. --port, denotes the software endpoint that facilitates application or protocol specific communication. In
this case, port 27017 is the default listening port for MongoDB core services
5. --database, is the parameter that references the database that will house the incoming Twitter data. The
program will automatically create the database on the server hosting MongoDB services that is referenced
in the --sever parameter.
6. --objects, is the parameter that references the text file that contains the objects the system will track and
collect (one object per line with a maximum of 400 objects).
18
Once the program is initiated, the data-mining begins, and the tweets will start pouring in
as soon as they are created or distributed. From this point, how long you collect tweets is up to
you, and is only limited to the resources you allocate to the data-mining system and Twitter’s
rate limitation protocol. Once you have collected an acceptable data set, the next thing to do is to
analyze the data and generate an information product of potential value. However, before
detailing the analysis portion of this project, I believe that it is crucial to explore the composition
of a tweet, and explore why a collection of tweets can be valuable.
So what exactly is a tweet? The common conception is that a tweet is a roughly 140
character message that, on its face, is only able to communicate the most minimal of information.
However, as I have alluded to throughout this thesis, there is more to a tweet than meets the eye:
much more. At the heart of a tweet lies a rich metadata architecture that binds and directs the
individual tweet into the larger social network and information ecosystem. In essence, the tweet
metadata provides defined fields, which can then be populated by personally identifying and
descriptive data pertaining to the creator, distributor, and consumer of the tweet.
The data and metadata associated with a tweet can then be used to create information
constructs such as: location mosaics, “webs-of-association”, behavior -pattern analyses, and
sentiment analyses. Information constructs derived from the underlying data and metadata
contained within a tweet can then detail how individual creators and consumers of a tweet relate
in the broader Twitter social network, and even in the real world. What this means is that an
individual, or an automated information system, can use Twitter data and metadata, or most other
metadata, to create information models that detail human behavior and association. While there
are numerous ways to depict twitter meta-data, I believe the most accessible manner is through a
visual aid with descriptions of the various components.
19
The tweet metadata depicted in the following screenshot is a graphical representation
and may be difficult to view. Appendix 1 depicts a tweet’s metadata in its native textual
representation of Java Script Object Notation (JSON).
20
21
The following list details some of the potentially significant metadata fields associated
with tweets and describes their functions:
1. _id: This provides a unique alphanumerical identifier for the individual tweet.
2. Contributors: This lists the IDs of users who have contributed to the tweet.
3. Text: The actual message filed of the tweet, this is what most people usually see when
a tweet is created or consumed.
4. In_reply_to_status_id: If the tweet is a reply to another tweet this filed will provide
the integer representation of the original tweet’s ID.
5. Favorite count: How many times the tweet has been “favorited” by other Twitter
users.
6. Source: The generating source of the tweet (such as the Twitter for the iPhone App).
7. Coordinates: The longitude and latitude of the tweets generating source.
8. Entities: This field contains the following sub fields: hashtags, any hashtags
referenced in the tweet; user_mentions, any Twitter users mentioned in the tweet;
symbols, any symbols listed in the tweet; media, the resource locators for an
associated pictures, videos, or other media files associated with the tweet; and urls,
the universal resource locators provided in the tweet.
9. Retweet_count: The number of times the tweet has been retweeted.
10. Retweeted_status: Within retweeted_status, are contained the following descriptive
and identifying data fields, which are associated with the creator of the retweeted
tweet: contributors, id, favorite_count, source, retweeted, coordinates, and entities.
11. User: Within the user field exist data and metadata that identify and describe the
primary composer of the tweet and contain the following fields: id, the unique
identifier of the user account that creates the tweet; verified, whether or not the user’s
Twitter account is verified; friends_count, the number of friends the tweet creator
has; location, the city in which the tweet is created; geo_enabled, indicates whether
the user account has geo-tracking enabled; name, the name of the Twitter account;
lang, the language the tweet is written in; favorites_count, the number of tweets that
the user marks as favorite; screen_name, the screen name of the Twitter user;
created_at, the date-time stamp of the tweets creation; contributors_enabled, indicates
22
whether or not the Twitter user has permitted the use of authenticated contributors;
time_zone, the time zone in which the tweet is created.
The metadata fields I just described are only a few of the total fields available in the ever-
evolving Twitter system. As you can see, there are many more metadata fields depicted in the
graphical representation, and many more in the textual representation illustrated in Appendix 1.
The potential applications for deriving value from these metadata and data points is limited only
to the creativity, ability, resources, and access of the individual or system that captures,
processes, and analyzes them.
At this stage of the operation, the data-mining system should have collected a database
composed of collections, which will contain every tweet referencing an object listed in your
“objects” file. Now that I have detailed the creation of a data-mining system, created an
authenticated Twitter app, and dissected and explored a tweet’s metadata structure, we can move
on to the methods and approach I used to process and analyze Twitter data and metadata.5
5
Reference Appendix 3 for database backup and restore instructions.
23
CHAPTER 5: DATA PROCESSING AND ANALYSIS
The following examples are queries and information products I created using data
captured, collected, and processed by my data-mining and analytics system. During my
collection period, beginning on 8 October 2014 at 2000 hrs., and ending on 25 October 2014 at
2000 hrs., I collected almost every tweet referencing a randomly selected list of 279 MCs. In
total, my data-mining and analytics system collected 472,395 tweets, including all the
corresponding metadata: automatically. Now, I will move onto the analysis portion of this
project.
One of the most approachable methods to analyze social media information, without
initially being too bogged down in the intricacies of metadata analysis, is to create a table
analysis. In this example, I select a sample-set of collected Twitter objects, in this case the
Twitter handles of certain MCs, and assign them variables. The variables correspond to the MC’s
name, age, party, chamber, state, district, district competitiveness (DC), and the number of
tweets associated with that MC.6
6
District competitiveness is defined with an “S” for safe, or an “N” for not safe. The district competitiveness
information was collected from Sabato’s Chrystal Ball at http://www.centerforpolitics.org/crystalball/
24
Handle Name Age Party Chamber State District DC Tweets
@SpeakerBoehner Boehner, John 64 R H OH 8 S 44590
@SteveScalise Scalise, Steve 49 R H LA 1 S 4278
@WhipHoyer Hoyer, Steny 75 D H MD 5 S 1963
@McConnellPress McConnell,
Mitch
72 R S KY - S 4978
@SenatorDurbin Durbin, Richard 69 D S IL - S 1857
@SenFeinstein Feinstein, Dianne 81 D S CA - S 2567
@JoaquinCastrotx Castro, Joaquin 40 D H TX 20 S 1636
@RepCuellar Cuellar, Henry 59 D H TX 28 S 403
@SenSanders Sanders, Bernie 73 D S VT - S 10657
@SenJohnMcCain McCain, John 78 R S AZ - S 20726
@SenTedCruz Cruz, Ted 43 R S TX - S 32307
@SenSchumer Schumer, Chuck 63 D S NY - S 2633
@RepBetoORourke O’Rourke, Beto 42 D H TX 16 S 884
@RepWestmoreland Westmoreland,
Lynn
64 R H GA 3 S 1166
@RepTomPrice Price, Tom 60 R H GA 6 S 777
@repjohnbarrow Barrow, John 59 D H GA 12 S 368
@LEETERRYNE Terry, Lee 52 R H NE 2 N 2671
@RepNickRahall Rahall, Nick 65 D H VA 3 N 610
@CongMikeSimpson Simpson, Mike 64 R H ID 2 N 530
@RepBera Bera, Ami 49 D H CA 7 N 836
Now that descriptive and identifying attributes have been associated with the MC’s
Twitter handles, the next step is to run some basic analytic queries against the collections in the
database and extract some potentially useful information. For the next step, I will query the
collection associated with each of the MCs listed and extract the number of Tweets that
referenced the MC in the “Twitterverse,” during the collection period.7
7
In order to capture the total number of documents (tweets) contained within a collection (MC) within the database,
enter the following command from the Mongo shell or a GUI database management console: If from the command
line terminal, enter the following command: “mongo” – Then from the Mongo Shell enter the following commands:
use “enter db name” – from the database enter the following command: db['@enter-tracking-object-name-
here'].stats()
25
Once the query executes, the system will produce statistics from the queried collection
and output them to the terminal in a JSON representation. The output will look like this:
{
"ns" : "DMTPS.@SpeakerBoehner",
"count" : 44590,
"size" : 346754336,
"avgObjSize" : 7776,
"storageSize" : 460861440,
"numExtents" : 14,
"nindexes" : 1,
"lastExtentSize" : 124993536,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 1455328,
"indexSizes" : {
"_id_" : 1455328
},
"ok" : 1
}
For the purpose of this query, the important value to extract is the “count” filed, which is
the total number of documents, tweets in this case, that referenced a particular MC during the
collection period. In the following examples, I will construct more advanced, metadata driven,
queries that will extract identifying and associative data from the collection database.
26
In the following examples, I have anonymized any personally identifying information my
queries and analytics produced for privacy reasons. For these queries and analytics, I use the
MongoDB Aggregation Framework to query the collection associated with a Member of
Congress, and then to find the following information for every tweet that references the specified
MC:
1. The text of the tweet referencing a specific MC.
2. The Twitter users referenced in the tweet (the intended audience).
3. The user screen name and “real name” of the Twitter account holder that created the
tweet.
4. The amount of friends that the Twitter user has
5. The location and country the tweet was created at
6. The geographic coordinates of the location where the tweet was created.
In order to query the collection and extract the pertinent information, the following query
must be run against the collection you wish to analyze using the MongoDB Aggregation
Framework.
{
$group: {
_id: {
text: "$text",
entities_user_mentions_screen_name: "$entities.user_mentions.screen_name",
user_name: "$user.name",
user_screen_name: "$user.screen_name",
user_friends_count: "$user.friends_count",
place_full_name: "$place.full_name",
geo_coordinates: "$geo.coordinates"
}
}
}
27
Once the query executes, a document will be created that contains the information you
extracted from the data. The following output is a real example of an information product the
query generated:
{
"_id":{
"text":"These are #Ukraine war crimes. #ukrainevotes jail the Kiev criminals. http://t.co/uI
W6KxqzG4"n@WhiteHouse n@BarackObama n@SpeakerBoehner",
"entities_user_mentions_screen_name":[
"WhiteHouse",
"BarackObama",
"SpeakerBoehner"
],
"user_name":"Pattys4Putin-USA",
"user_screen_name":"PattyDs50",
"user_friends_count":1598,
"place_full_name":"New Hampshire, US",
"geo_coordinates":[
42.908474,
-71.841744
]
}
}
28
For the following example, I queried an MC’s collection in order to find all the tweets
referencing the MC that where written in Spanish during my collection period. The query also
found the screen names, real name, city, and state, where the Twitter account holder was when
he or she created the tweet and the geographical coordinates of the exact location the tweet was
created. The following query also employs the MongoDB Aggregation Framework:
{
$group: {
_id: {
text: "$text",
lang: "$lang",
user_screen_name: "$user.screen_name",
user_name: "$user.name",
place_full_name: "$place.full_name",
geo_coordinates: "$geo.coordinates"
}
}
},
{
$match: {
"_id.lang": "es"
}
}
Once the query executes, a document will be created that contains the information you
extracted from the data. The following output is a real example of an information product the
query generated:
"_id":{
"text":"Gracias @JoaquinCastrotx por apoyar la #ReformaMigratoria. Por favor sigue lucha
ndo por #CIR. #TimeIsNow http://t.co/p5F9Y54Ac2 vía @FWD_us",
"lang":"es",
"user_screen_name":"DguezVd",
"user_name":"Vaneza Dominguez",
"place_full_name":"Dallas, TX",
"geo_coordinates":[
32.900652,
-96.871544
{
29
The three examples I documented here only scratch the surface of what is possible by
incorporating a metadata driven approach to social media data-mining and analytics. By
leveraging the robust Twitter metadata architecture, I was able to collect a vast and nearly
complete dataset referencing 279 MCs and collecting almost a half a million tweets. I was then
able to query the individual collections corresponding to the MCs, and then create potentially
valuable information products. In the example quires, I was able to identify the tweet frequencies
associated with particular MCs, the Twitter handles of users referenced in a tweet, the amount of
friends the tweet generator has, the intended audiences of the tweets, the physical location of the
tweet generators, and even to filter tweets by language.
However, the queries I provided here are only the beginning. Truly, the possibilities for
generating valuable information products by leveraging data-mining and analytics, is only
limited to the creativity, skill, access, resources, and time of the new political data-miner. In the
following section, I will expand upon some of the possible applications of using data-mining and
analytics for the benefit of Political Science.
30
CHAPTER 6: POTENTIAL APPLICATIONS
While the potential applications for data-mining and analytics in Political Science are
only limited to the creativity of the data-miner, I wanted to provide a hypothetical example of a
political science activity that could benefit from such an approach.
In this hypothetical scenario, a research team is given the task to collect all the tweets
created by, or referencing all congressional candidates during a particular election cycle. Once
the election cycle is over, the research team is to analyze the tweets and generate an information
product that investigates how social media campaigns effect the creation and behavior of
political associations on social media and in the real world.
In this scenario, the task would be difficult, if not impossible, to do with conventional
social media collection and analysis tools. The research team could decide to comb the web for
tweet collection websites, and to manually collect and operationalize the tweets using
spreadsheets and the like. However, this method would be very labor and time intensive, and
would only yield the message field of the tweet and some minimally identifying and descriptive
data. In this method, the research team would not be able to construct a web-of-association that
would identify the congressional districts in which the tweets were created, consumed, or shared.
However, the research team could employ another option. The research team could reach
out to a tweet vendor and purchase all the tweets created by or referencing particular
congressional candidates, and then run queries and analytics against those data sets. However,
this method is expensive and does not lend its self to dynamically adjusting the collection of
tweets, as a research team might do during their project. Nevertheless, if a research team has
access to significant funding, employing a tweet vendor could be a simple method to collect
31
sizable twitter data sets. However, if you want the metadata associated with the tweet, which is
often more valuable that the message itself, it will cost significantly more.
In this scenario, employing a data-mining and analytics system like the one I created and
detailed in this project is ideal. The information system I created for this project uses all free and
open-source technologies that are readily available and well documented on the Internet. The
support communities for all the technologies required for building and operating this type of
system are highly robust, typically friendly, and usually able to help most anyone troubleshoot or
navigate a particular technology. In addition, the software required to build and operate this type
of system can be run on most commodity hardware, ranging from small desktop computers, to a
massive clusters of networked servers, and even on the cloud. Another benefit of this type of
system is that you can securely access, monitor, and maintain the system remotely from virtually
anywhere with an Internet connection. From your computer at home, your tablet on vacation, or
your smartphone on the road, you can update your object collection list, create new databases,
and write new analytic queries.
A further benefit of employing and maintaining a data-mining and analytics system is
that once it is established, the system can continue to collect information indefinitely. The system
can also be used or replicated by others, who then can use the system as is, or expand the system
and add new functionality and features to it. The possibilities of using open-source technologies
for data-mining and analytics to the benefit of Political Science are almost limitless.
32
CHAPTER 7: CONCLUSION
When I started this project, I wanted to create an information system that could pave the
way for Political Science researchers to explore new technologies and methods in order to make
their work easier, more innovative, and more productive. I had a strong background in
Information Systems and Cybersecurity, but I had never before created a data-mining and
analytics information system. I thought the process would be fun and challenging. However, I
had no idea how fun and challenging the process would actually be. I knew that there were
significant implications for data-mining and analytics in Political Science, but I was not sure how
to bridge-the-gap, between the disciplines.
After much study, research, trial, and error, I created an information system that can mine
data from the Internet easily, automatically, and perpetually. The following challenge was in the
analytics. When I started this project, I did not have much knowledge or experience in “data-
analytics.” I had, of course, analyzed data before, but not in the context of a formal data-mining
and data-science initiative. Throughout this project, I thought myself a great deal about data-
mining, data-science, and data-analytics. As such, I was able to produce some basic information
products that I am sure will pique the interest of the more adventurous political scientists. The
data-mining and analytics system I created and detailed here is a basic system, but one that I
hope will serve as the foundation for further development and study.
With this system, I was able to capture a relatively large data set of tweets of political
interest relatively easily and automatically. I was then able collect the tweets into a database
capable of storing unstructured data from practically anywhere in the digital world. Furthermore,
I was able manipulate, transform, and query the tweets to produce information products with the
capacity to advance normative political theory and quantitative political analysis. In the end, I
33
was able to provide a roadmap for future “political data-miners” to get started in constructing
their own data-mining and analytics information systems for the benefit of Political Science.
34
REFERENCES
Chodorow, Kristina. 2013. MongoDB: The Definitive Guide. Sebastopol: O’Reilly.
Del Fresno, Guillermo. 2014. “twitterstream-to-mongodb” [Software]. GitHub: Retrieved from
https://github.com/gdelfresno/twitterstream-to-mongodb
Farhi, Paul. 2009. “The Twitter Explosion.” American Journalism Review 31(3): 26–31.
http://search.ebscohost.com/login.aspx?direct=true&db=ufh&AN=41877978&site=ehost-
live (February 19, 2010).
Gervais, Bryan T. 2014. “Incivility Online: Affective and Behavioral Reactions to Uncivil
Political Posts in a Web-based Experiment.” Journal of Information Technology &
Politics (Forthcoming)
Golbeck, Jennifer, Justin M. Grimes, and Anthony Rogers. 2010. “Twitter Use by the U.S.
Congress.” Journal of the American Society for Information Science and Technology
61(8): 1612–21.
Greenwald, Glenn. 2014. No Place to Hide: Edward Snowden, the NSA, and the U.S.
Surveillance State. New York: Metropolitan Books.
Mansbridge, Jane. 1999. “Everyday Talk in the Deliberative System” In Deliberative Politics:
Essays on Democracy and Disagreement, ed Stephen Macedo: Oxford University Press,
1 – 211.
McKinney, Wes. 2013. Python for Data Analysis 2nd ed. Sebastopol: O’Reilly.
Mutz, Diana C. 2006. Hearing the Other Side: Deliberative Versus Participatory Democracy.
New York: Cambridge University Press.
Provost, Foster, & Tom Fawcett. 2013. Data Science for Business: What You Need to Know
About Data Mining and Data-Analytic Thinking. Sebastopol: O’Reilly.
Rosenzweig, Paul. 2013. Cyber Warfare: How Conflicts in Cyberspace Are Challenging
America and Changing the World. Santa Barbara: Praeger.
Ruggie, John G. 2013. Just Business: Multinational Corporations and Human Rights. New
York: Norton, W. W. & Company, Inc.
Russell, Matthew A. 2014. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn,
Google+, GitHub, and More 2nd ed. Sebastopol: O’Reilly.
35
APPENDICES
Appendix 1: What a Tweet Really Looks Like in Its Native JSON
NOTE: I highlighted the text filed, which contains the actual message portion of a tweet.
/* 0 */
{
"_id" : ObjectId("54361be43b811434f9a21da4"),
"contributors" : null,
"truncated" : false,
"text" : "✖ @AustinScottGA08 Silence Is Complicity #MSSen #RememberMississippi #MakeDCListen",
"in_reply_to_status_id" : null,
"id" : NumberLong(520082176352980993),
"favorite_count" : 0,
"source" : "<a href="http://tweetadder.com" rel="nofollow">TweetAdder v4</a>",
"retweeted" : false,
"coordinates" : null,
"timestamp_ms" : "1412832228168",
"entities" : {
"user_mentions" : [
{
"id" : 234797704,
"indices" : [
2,
18
],
"id_str" : "234797704",
"screen_name" : "AustinScottGA08",
"name" : "Rep. Austin Scott"
}
],
"symbols" : [],
"trends" : [],
"hashtags" : [
{
"indices" : [
41,
47
],
"text" : "MSSen"
},
{
"indices" : [
48,
68
],
"text" : "RememberMississippi"
},
{
"indices" : [
69,
82
],
"text" : "MakeDCListen"
}
],
"urls" : []
},
"in_reply_to_screen_name" : null,
"id_str" : "520082176352980993",
36
"retweet_count" : 0,
"in_reply_to_user_id" : null,
"favorited" : false,
"user" : {
"follow_request_sent" : null,
"profile_use_background_image" : true,
"default_profile_image" : false,
"id" : 265658805,
"verified" : false,
"profile_image_url_https" : "https://pbs.twimg.com/profile_images/455915260524769280/ClR7foxv_normal.png",
"profile_sidebar_fill_color" : "DDEEF6",
"profile_text_color" : "333333",
"followers_count" : 3559,
"profile_sidebar_border_color" : "000000",
"id_str" : "265658805",
"profile_background_color" : "000000",
"listed_count" : 57,
"profile_background_image_url_https" :
"https://pbs.twimg.com/profile_background_images/845237718/447b881c8b774ed9199f6bf5505beb66.jpeg",
"utc_offset" : -14400,
"statuses_count" : 97286,
"description" : "A Declaration Conservative: That 2 secure these (unalienable) rights, Govts R instituted among Men, deriving their just
powers from the consent of the governed",
"friends_count" : 3389,
"location" : "Western Pennsylvania",
"profile_link_color" : "000000",
"profile_image_url" : "http://pbs.twimg.com/profile_images/455915260524769280/ClR7foxv_normal.png",
"following" : null,
"geo_enabled" : false,
"profile_banner_url" : "https://pbs.twimg.com/profile_banners/265658805/1397533565",
"profile_background_image_url" :
"http://pbs.twimg.com/profile_background_images/845237718/447b881c8b774ed9199f6bf5505beb66.jpeg",
"name" : "Freedoms Fool",
"lang" : "en",
"profile_background_tile" : false,
"favourites_count" : 94,
"screen_name" : "freedomsfool",
"notifications" : null,
"url" : null,
"created_at" : "Sun Mar 13 23:26:25 +0000 2011",
"contributors_enabled" : false,
"time_zone" : "Eastern Time (US & Canada)",
"protected" : false,
"default_profile" : false,
"is_translator" : false
},
"geo" : null,
"in_reply_to_user_id_str" : null,
"possibly_sensitive" : false,
"lang" : "en",
"created_at" : "Thu Oct 09 05:23:48 +0000 2014",
"filter_level" : "medium",
"in_reply_to_status_id_str" : null,
"place" : null
}
/* 1 */
{
"_id" : ObjectId("5435dcae3b811434f9a1ff12"),
"contributors" : null,
"truncated" : false,
"text" : "RT @FreeTheMarine: GA @AustinScottGA08 Pls support #HRes620 assisting our #MarineHeldInMexico. He needs treatment for
PTSD ASAP #BringBackO…",
"in_reply_to_status_id" : null,
"id" : NumberLong(520014305681768449),
"favorite_count" : 0,
37
"source" : "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>",
"retweeted" : false,
"coordinates" : null,
"timestamp_ms" : "1412816046572",
"entities" : {
"user_mentions" : [
{
"id" : NumberLong(2476804154),
"indices" : [
3,
17
],
"id_str" : "2476804154",
"screen_name" : "FreeTheMarine",
"name" : "Free Sgt Tahmooressi"
},
{
"id" : 234797704,
"indices" : [
22,
38
],
"id_str" : "234797704",
"screen_name" : "AustinScottGA08",
"name" : "Rep. Austin Scott"
}
],
"symbols" : [],
"trends" : [],
"hashtags" : [
{
"indices" : [
51,
59
],
"text" : "HRes620"
},
{
"indices" : [
74,
93
],
"text" : "MarineHeldInMexico"
},
{
"indices" : [
128,
140
],
"text" : "BringBackOurMarine"
}
],
"urls" : []
},
"in_reply_to_screen_name" : null,
"id_str" : "520014305681768449",
"retweet_count" : 0,
"in_reply_to_user_id" : null,
"favorited" : false,
"retweeted_status" : {
"contributors" : null,
"truncated" : false,
"text" : "GA @AustinScottGA08 Pls support #HRes620 assisting our #MarineHeldInMexico. He needs
treatment for PTSD ASAP #BringBackOurMarine",
"in_reply_to_status_id" : null,
38
"id" : NumberLong(519365396366110720),
"favorite_count" : 12,
"source" : "<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>",
"retweeted" : false,
"coordinates" : null,
"entities" : {
"user_mentions" : [
{
"id" : 234797704,
"indices" : [
3,
19
],
"id_str" : "234797704",
"screen_name" : "AustinScottGA08",
"name" : "Rep. Austin Scott"
}
],
"symbols" : [],
"trends" : [],
"hashtags" : [
{
"indices" : [
32,
40
],
"text" : "HRes620"
},
{
"indices" : [
55,
74
],
"text" : "MarineHeldInMexico"
},
{
"indices" : [
109,
128
],
"text" : "BringBackOurMarine"
}
],
"urls" : []
},
"in_reply_to_screen_name" : null,
"id_str" : "519365396366110720",
"retweet_count" : 29,
"in_reply_to_user_id" : null,
"favorited" : false,
"user" : {
"follow_request_sent" : null,
"profile_use_background_image" : false,
"default_profile_image" : false,
"id" : NumberLong(2476804154),
"verified" : false,
"profile_image_url_https" : "https://pbs.twimg.com/profile_images/509417608936845312/OX6Pm-8B_normal.jpeg",
"profile_sidebar_fill_color" : "DDEEF6",
"profile_text_color" : "333333",
"followers_count" : 3542,
"profile_sidebar_border_color" : "000000",
"id_str" : "2476804154",
"profile_background_color" : "000000",
"listed_count" : 59,
"profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png",
"utc_offset" : -25200,
39
"statuses_count" : 3706,
"description" : "OFFICIAL Tahmooressi Family Account. Please visit: http://t.co/8PyH5q0uWE | #MarineHeldInMexico #HRes620 | Media
Requests: jonathan@lucidpublicrelations.com",
"friends_count" : 243,
"location" : "www.andrewfreedomfund.com",
"profile_link_color" : "134673",
"profile_image_url" : "http://pbs.twimg.com/profile_images/509417608936845312/OX6Pm-8B_normal.jpeg",
"following" : null,
"geo_enabled" : false,
"profile_banner_url" : "https://pbs.twimg.com/profile_banners/2476804154/1403332310",
"profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png",
"name" : "Free Sgt Tahmooressi",
"lang" : "en",
"profile_background_tile" : false,
"favourites_count" : 12637,
"screen_name" : "FreeTheMarine",
"notifications" : null,
"url" : "http://www.facebook.com/freethemarine",
"created_at" : "Sun May 04 12:12:23 +0000 2014",
"contributors_enabled" : false,
"time_zone" : "Arizona",
"protected" : false,
"default_profile" : false,
"is_translator" : false
},
"geo" : null,
"in_reply_to_user_id_str" : null,
"possibly_sensitive" : false,
"lang" : "en",
"created_at" : "Tue Oct 07 05:55:34 +0000 2014",
"filter_level" : "low",
"in_reply_to_status_id_str" : null,
"place" : null
},
"user" : {
"follow_request_sent" : null,
"profile_use_background_image" : true,
"default_profile_image" : false,
"id" : 959017200,
"verified" : false,
"profile_image_url_https" : "https://pbs.twimg.com/profile_images/509521711671558144/oqRiNGin_normal.jpeg",
"profile_sidebar_fill_color" : "DDEEF6",
"profile_text_color" : "333333",
"followers_count" : 145,
"profile_sidebar_border_color" : "C0DEED",
"id_str" : "959017200",
"profile_background_color" : "C0DEED",
"listed_count" : 2,
"profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png",
"utc_offset" : null,
"statuses_count" : 4451,
"description" : null,
"friends_count" : 263,
"location" : "",
"profile_link_color" : "0084B4",
"profile_image_url" : "http://pbs.twimg.com/profile_images/509521711671558144/oqRiNGin_normal.jpeg",
"following" : null,
"geo_enabled" : false,
"profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png",
"name" : "MomOrWhatever",
"lang" : "en",
"profile_background_tile" : false,
"favourites_count" : 2426,
"screen_name" : "MomOrWhatever",
"notifications" : null,
"url" : null,
40
"created_at" : "Tue Nov 20 00:24:19 +0000 2012",
"contributors_enabled" : false,
"time_zone" : null,
"protected" : false,
"default_profile" : true,
"is_translator" : false
},
"geo" : null,
"in_reply_to_user_id_str" : null,
"possibly_sensitive" : false,
"lang" : "en",
"created_at" : "Thu Oct 09 00:54:06 +0000 2014",
"filter_level" : "medium",
"in_reply_to_status_id_str" : null,
"place" : null
}
/* 2 */
{
"_id" : ObjectId("5435e9183b811434f9a204d8"),
"contributors" : null,
"truncated" : false,
"text" : "RT @fenolj: @AustinScottGA08 When physical abuse #Tahmooressi endured comes 2 light YOU will be accountable. Co-sponsor
#HRes620 #BringBa…",
"in_reply_to_status_id" : null,
"id" : NumberLong(520027633044946944),
"favorite_count" : 0,
"source" : "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>",
"retweeted" : false,
"coordinates" : null,
"timestamp_ms" : "1412819224039",
"entities" : {
"user_mentions" : [
{
"id" : NumberLong(2680673774),
"indices" : [
3,
10
],
"id_str" : "2680673774",
"screen_name" : "fenolj",
"name" : "Jackie Fenolio"
},
{
"id" : 234797704,
"indices" : [
12,
28
],
"id_str" : "234797704",
"screen_name" : "AustinScottGA08",
"name" : "Rep. Austin Scott"
}
],
"symbols" : [],
"trends" : [],
"hashtags" : [
{
"indices" : [
49,
61
],
"text" : "Tahmooressi"
},
{
"indices" : [
41
122,
130
],
"text" : "HRes620"
},
{
"indices" : [
131,
140
],
"text" : "BringBackOurMarine"
}
],
"urls" : []
},
"in_reply_to_screen_name" : null,
"id_str" : "520027633044946944",
"retweet_count" : 0,
"in_reply_to_user_id" : null,
"favorited" : false,
"retweeted_status" : {
"contributors" : null,
"truncated" : false,
"text" : "@AustinScottGA08 When physical abuse #Tahmooressi endured comes 2 light YOU will be accountable. Co-sponsor #HRes620
#BringBackOurMarine.",
"in_reply_to_status_id" : null,
"id" : NumberLong(519999120233467904),
"favorite_count" : 1,
"source" : "<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>",
"retweeted" : false,
"coordinates" : null,
"entities" : {
"user_mentions" : [
{
"id" : 234797704,
"indices" : [
0,
16
],
"id_str" : "234797704",
"screen_name" : "AustinScottGA08",
"name" : "Rep. Austin Scott"
}
],
"symbols" : [],
"trends" : [],
"hashtags" : [
{
"indices" : [
37,
49
],
"text" : "Tahmooressi"
},
{
"indices" : [
110,
118
],
"text" : "HRes620"
},
{
"indices" : [
119,
138
],
42
"text" : "BringBackOurMarine"
}
],
"urls" : []
},
"in_reply_to_screen_name" : "AustinScottGA08",
"id_str" : "519999120233467904",
"retweet_count" : 2,
"in_reply_to_user_id" : 234797704,
"favorited" : false,
"user" : {
"follow_request_sent" : null,
"profile_use_background_image" : true,
"default_profile_image" : false,
"id" : NumberLong(2680673774),
"verified" : false,
"profile_image_url_https" : "https://pbs.twimg.com/profile_images/519337504214753280/xZ6DzFeB_normal.jpeg",
"profile_sidebar_fill_color" : "DDEEF6",
"profile_text_color" : "333333",
"followers_count" : 125,
"profile_sidebar_border_color" : "C0DEED",
"id_str" : "2680673774",
"profile_background_color" : "C0DEED",
"listed_count" : 1,
"profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png",
"utc_offset" : null,
"statuses_count" : 9624,
"description" : null,
"friends_count" : 58,
"location" : "",
"profile_link_color" : "0084B4",
"profile_image_url" : "http://pbs.twimg.com/profile_images/519337504214753280/xZ6DzFeB_normal.jpeg",
"following" : null,
"geo_enabled" : false,
"profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png",
"name" : "Jackie Fenolio",
"lang" : "en",
"profile_background_tile" : false,
"favourites_count" : 39,
"screen_name" : "fenolj",
"notifications" : null,
"url" : null,
"created_at" : "Fri Jul 25 22:56:37 +0000 2014",
"contributors_enabled" : false,
"time_zone" : null,
"protected" : false,
"default_profile" : true,
"is_translator" : false
},
"geo" : null,
"in_reply_to_user_id_str" : "234797704",
"possibly_sensitive" : false,
"lang" : "en",
"created_at" : "Wed Oct 08 23:53:46 +0000 2014",
"filter_level" : "low",
"in_reply_to_status_id_str" : null,
"place" : null
},
"user" : {
"follow_request_sent" : null,
"profile_use_background_image" : true,
"default_profile_image" : false,
"id" : 981285295,
"verified" : false,
"profile_image_url_https" : "https://pbs.twimg.com/profile_images/517386303441481728/RVa6gyU1_normal.jpeg",
"profile_sidebar_fill_color" : "DDEEF6",
43
Appendix 2: Twitter App
44
Appendix 3: MongoDB Database Backup, Restore, and Initialization
1. Logon to your data-mining system.
2. From the terminal, suspend the data collection by pressing “Ctrl + z”
3. Enter the MongoDB administrative shell by entering the command “mongo”
4. From the MongoDB administrative shell, enter the command “use admin” - This will
switch you to the administrative database.
5. From the administrative database, enter the command “db.shutdownServer()” - This will
shut down the MongoDB service.
6. Navigate to a folder designated to hold the database backups (you can create the folder
locally, or on any other storage medium that is logically connected).
7. From the backup directory, enter the command “sudo mongodump --dbpath
/path/to/your/mongodb” – The backup may take some time to complete depending on the
size of the databases. In addition, the backup will contain all the collections contained in
your databases, stored in JSON and their corresponding sub-metadata stored in BSON
(Binary JSON).
8. Once the backup is done, enter the following command to initiate the MongoDB service
and resume logging “sudo mongod --dbpath /path/to/your/mongodb --fork --logpath
/var/log/mongodb.log”
Once the MongoDB service is initiated, enter the following command to resume collection, if
need be:
“sudo python twitterstreamtomongodb.py --oauth=oauth.json --server=127.0.0.1 --port=27017 --
database=“insert DB name here” --track=objects.txt”8
8
Insert the variables appropriate to your system in the server, port, and database fields.

More Related Content

What's hot

The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...
The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...
The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...suyu22
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterDan Nguyen
 
A Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social MediaA Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social MediaRSIS International
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social MediaArjen de Vries
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Harsha, information
Harsha, informationHarsha, information
Harsha, informationharshaec
 
Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...
Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...
Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...Alejandro Arizpe, MBA, MSc IT, PMP
 
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...IJDKP
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataIOSR Journals
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.Lanujessy
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)silambu111
 
Understanding Online Socials Harm: Examples of Harassment and Radicalization
Understanding Online Socials Harm:  Examples of Harassment and RadicalizationUnderstanding Online Socials Harm:  Examples of Harassment and Radicalization
Understanding Online Socials Harm: Examples of Harassment and RadicalizationAmit Sheth
 
K state candidacy presentation
K state candidacy presentationK state candidacy presentation
K state candidacy presentationTeague Allen
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrievalMalobe Lottin Cyrille Marcel
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalhplap
 
Crim 4384 statistics
Crim 4384 statisticsCrim 4384 statistics
Crim 4384 statisticsciakov
 

What's hot (18)

The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...
The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...
The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Ana...
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
A Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social MediaA Systematic Survey on Detection of Extremism in Social Media
A Systematic Survey on Detection of Extremism in Social Media
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social Media
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Harsha, information
Harsha, informationHarsha, information
Harsha, information
 
Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...
Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...
Alejandro Arizpe - Artificial Intelligence, Machine Learning, and Databases i...
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...
TERRORIST WATCHER: AN INTERACTIVE WEBBASED VISUAL ANALYTICAL TOOL OF TERRORIS...
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media Data
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
Understanding Online Socials Harm: Examples of Harassment and Radicalization
Understanding Online Socials Harm:  Examples of Harassment and RadicalizationUnderstanding Online Socials Harm:  Examples of Harassment and Radicalization
Understanding Online Socials Harm: Examples of Harassment and Radicalization
 
K state candidacy presentation
K state candidacy presentationK state candidacy presentation
K state candidacy presentation
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrieval
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Crim 4384 statistics
Crim 4384 statisticsCrim 4384 statistics
Crim 4384 statistics
 

Viewers also liked

Undergraduate Thesis
Undergraduate ThesisUndergraduate Thesis
Undergraduate ThesisKailey Kumm
 
Maggie Morrow Honors Thesis
Maggie Morrow Honors ThesisMaggie Morrow Honors Thesis
Maggie Morrow Honors ThesisMaggie Morrow
 
Political Science Bachelor Thesis Stina Ahnlid
Political Science Bachelor Thesis Stina AhnlidPolitical Science Bachelor Thesis Stina Ahnlid
Political Science Bachelor Thesis Stina AhnlidStina Ahnlid
 
Summary Of Bachelor Thesis
Summary Of Bachelor ThesisSummary Of Bachelor Thesis
Summary Of Bachelor ThesisAgne Valeckaite
 
Overview of the Possibilities of Quantitative Methods in Political Science
Overview of the Possibilities of Quantitative Methods in Political ScienceOverview of the Possibilities of Quantitative Methods in Political Science
Overview of the Possibilities of Quantitative Methods in Political Scienceenvironmentalconflicts
 
Final thesis presented december 2009 march 2010
Final thesis presented december 2009 march 2010Final thesis presented december 2009 march 2010
Final thesis presented december 2009 march 2010Lumbad 1989
 
Related Literature and Related Studies
Related Literature and Related StudiesRelated Literature and Related Studies
Related Literature and Related StudiesJenny Reyes
 
Hillary clinton's 1969 Political Science thesis
Hillary clinton's 1969 Political Science thesisHillary clinton's 1969 Political Science thesis
Hillary clinton's 1969 Political Science thesisSteven Montgomery
 

Viewers also liked (10)

Undergraduate Thesis
Undergraduate ThesisUndergraduate Thesis
Undergraduate Thesis
 
Maggie Morrow Honors Thesis
Maggie Morrow Honors ThesisMaggie Morrow Honors Thesis
Maggie Morrow Honors Thesis
 
Political Science Bachelor Thesis Stina Ahnlid
Political Science Bachelor Thesis Stina AhnlidPolitical Science Bachelor Thesis Stina Ahnlid
Political Science Bachelor Thesis Stina Ahnlid
 
Summary Of Bachelor Thesis
Summary Of Bachelor ThesisSummary Of Bachelor Thesis
Summary Of Bachelor Thesis
 
Overview of the Possibilities of Quantitative Methods in Political Science
Overview of the Possibilities of Quantitative Methods in Political ScienceOverview of the Possibilities of Quantitative Methods in Political Science
Overview of the Possibilities of Quantitative Methods in Political Science
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Final thesis presented december 2009 march 2010
Final thesis presented december 2009 march 2010Final thesis presented december 2009 march 2010
Final thesis presented december 2009 march 2010
 
Related Literature and Related Studies
Related Literature and Related StudiesRelated Literature and Related Studies
Related Literature and Related Studies
 
Hillary clinton's 1969 Political Science thesis
Hillary clinton's 1969 Political Science thesisHillary clinton's 1969 Political Science thesis
Hillary clinton's 1969 Political Science thesis
 
Thesis elaine
Thesis elaineThesis elaine
Thesis elaine
 

Similar to Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

Beyond-Data-Literacy-2015
Beyond-Data-Literacy-2015Beyond-Data-Literacy-2015
Beyond-Data-Literacy-2015Amanda noonan
 
Cutting the trees of knowledge
Cutting the trees of knowledgeCutting the trees of knowledge
Cutting the trees of knowledgeirismei
 
Cutting the trees of knowledge
Cutting the trees of knowledgeCutting the trees of knowledge
Cutting the trees of knowledgeirismei
 
Eavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteEavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsSeth Grimes
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Artificial Intelligence Institute at UofSC
 
IRJET - Political Orientation Prediction using Social Media Activity
IRJET -  	  Political Orientation Prediction using Social Media ActivityIRJET -  	  Political Orientation Prediction using Social Media Activity
IRJET - Political Orientation Prediction using Social Media ActivityIRJET Journal
 
Black Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic TransparencyBlack Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic TransparencySimon Buckingham Shum
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Alfonso Crisci
 
Rogers studyingpoliticalissues mar2014_optimized_ii_
Rogers studyingpoliticalissues mar2014_optimized_ii_Rogers studyingpoliticalissues mar2014_optimized_ii_
Rogers studyingpoliticalissues mar2014_optimized_ii_Digital Methods Initiative
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsRESHAN FARAZ
 
The Networked Creativity in the Censored Web 2.0
The Networked Creativity in the Censored Web 2.0The Networked Creativity in the Censored Web 2.0
The Networked Creativity in the Censored Web 2.0Weiai Wayne Xu
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for eventijma
 
Presentation10-OF-project.pptx
Presentation10-OF-project.pptxPresentation10-OF-project.pptx
Presentation10-OF-project.pptxShaliniKumari491
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data JournalismIrina Radchenko
 
WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...
WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...
WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...Ramine Tinati
 
The evolution of research on social media
The evolution of research on social mediaThe evolution of research on social media
The evolution of research on social mediaFarida Vis
 

Similar to Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis (20)

s00146-014-0549-4.pdf
s00146-014-0549-4.pdfs00146-014-0549-4.pdf
s00146-014-0549-4.pdf
 
Beyond-Data-Literacy-2015
Beyond-Data-Literacy-2015Beyond-Data-Literacy-2015
Beyond-Data-Literacy-2015
 
Cutting the trees of knowledge
Cutting the trees of knowledgeCutting the trees of knowledge
Cutting the trees of knowledge
 
Cutting the trees of knowledge
Cutting the trees of knowledgeCutting the trees of knowledge
Cutting the trees of knowledge
 
Eavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteEavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging Site
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
IRJET - Political Orientation Prediction using Social Media Activity
IRJET -  	  Political Orientation Prediction using Social Media ActivityIRJET -  	  Political Orientation Prediction using Social Media Activity
IRJET - Political Orientation Prediction using Social Media Activity
 
Black Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic TransparencyBlack Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic Transparency
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...
 
Collecting Twitter Data
Collecting Twitter DataCollecting Twitter Data
Collecting Twitter Data
 
Rogers studyingpoliticalissues mar2014_optimized_ii_
Rogers studyingpoliticalissues mar2014_optimized_ii_Rogers studyingpoliticalissues mar2014_optimized_ii_
Rogers studyingpoliticalissues mar2014_optimized_ii_
 
757
757757
757
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
 
The Networked Creativity in the Censored Web 2.0
The Networked Creativity in the Censored Web 2.0The Networked Creativity in the Censored Web 2.0
The Networked Creativity in the Censored Web 2.0
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for event
 
Presentation10-OF-project.pptx
Presentation10-OF-project.pptxPresentation10-OF-project.pptx
Presentation10-OF-project.pptx
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
 
WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...
WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...
WSI Stimulus Project: Centre for longitudinal studies of online citizen parti...
 
The evolution of research on social media
The evolution of research on social mediaThe evolution of research on social media
The evolution of research on social media
 

Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

  • 1. DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED METHODOLOGICAL APPROACH by ALFREDO HICKMAN JR THESIS Presented to the Faculty of the Honors College The University of Texas at San Antonio In Partial Fulfillment Of the Requirements For the Degree of BACHELOR OF ARTS IN POLITICAL SCIENCE WITH HIGHEST HONORS IN THE HONORS COLLEGE THE UNIVERSITY OF TEXAS AT SAN ANTONIO College of Liberal and Fine Arts Department of Political Science and Geography May 2015
  • 2. DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED METHODOLOGICAL APPROACH PREPARED BY: ________________________________________ Alfredo Hickman Jr APPROVED BY: ________________________________________ Bryan Gervais, Ph.D., Thesis Advisor ________________________________________ Ritu Mathur, Ph.D., Thesis Reader ________________________________________ Walter Wilson, Ph.D., Thesis Reader Accepted: _________________________________________ Richard Diem, Ph.D., Dean of the Honors College Received by the Honors College: ______________________
  • 3. iii ACKNOWLEDGEMENTS First and foremost, I would like to acknowledge and thank God. I would like to acknowledge my parents. Had it not been for the sacrifice and efforts of my parents, I would not exist or be the man that I am today. I would like thank and acknowledge my wife, Crystal. My wife’s support throughout this project has been a blessing. I would like to thank and acknowledge the faculty and staff at the University of Texas at San Antonio, its Honors College, and its College of Liberal and Fine Arts. Dr. Bryan T. Gervais, Ph.D., has been a great source of knowledge, experience, and wisdom, and is a trusted mentor and advisor. Dr. Ann Eisenberg, Ph.D., has also been a great source of encouragement and support, in not only the development of this Thesis and supporting research, but also my all around educational development while at the University of Texas at San Antonio. In addition, I would like to thank my thesis readers, Dr. Ritu Mathur, Ph.D., and Dr. Walter Wilson, Ph.D. Ultimately, I would like to thank and acknowledge the academics, researchers, and software developers that have contributed to the base of knowledge, information, and software that exist in the realms of Political Science, Data Science, and Information Systems. In particular, I would like to thank JetBrains for the development software, Ubuntu and the Linux community for the platform and support, MongoDB for the database, Robomongo for the database administration software, and GitHub for hosting the open-source code repositories, and Guillermo Del Fresno, on GitHub, for developing twitterstream-to-mongodb. The work that I present in this Thesis is an amalgam of the fields and technologies mentioned, and which builds on the effort, intellect, and sacrifice of those that have come before me; they are truly the giants on whose shoulders I stand. May 2015
  • 4. iv ABSTRACT DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED METHODOLOGICAL APPROACH Alfredo Hickman Jr, B.A. The University of Texas at San Antonio, 2015 Supervising Professor: Bryan Gervais, Ph.D. This thesis will examine the creation and use of a data-mining system to extract, process, and analyze Twitter “tweets” for Political Science. By providing a free and open platform for rapidly sharing and exchanging ideas, Twitter has become the most popular microblogging site and system in the world. Twitter allows its users to disclose their actual names, or post tweets anonymously; this has fostered an environment that allows people to discus and comment on politics with a scope, liberty, and, candor that has never before existed. Twitter can be an invaluable tool for political scientists that wish to better understand the motives, thoughts, sentiments, and social networks of people as it pertains to politics and social phenomena. During the course of my research, I have built and maintained an information system that collects and process selective Twitter data live. In conjunction with ps_proj, an authenticated application I created on Twitter’s Developers Site, I use Twitter’s Streaming Application Programming Interface (API) to collect streaming data on a randomly selected list of 279 Members of Congress (MCs). Once the tweet data set is captured, I will analyze the messages, and the accompanying metadata and data. I expect the data, once analyzed, will produce insights into the American political being, and allow the political scientists to create information products critical to understanding social and political behavior.
  • 5. v TABLE OF CONTENTS ACKNOWLEDGEMENTS.................................................................................................................................. IVII ABSTRACT .............................................................................................................................................................. IV ACRONYMS AND DEFINITIONS..........................................................ERROR! BOOKMARK NOT DEFINED. CHAPTER 1: INTENT AND ETHICAL CONSIDERATIONS.............................................................................1 CHAPTER 2: INTRODUCTION ..............................................................................................................................4 CHAPTER 3: THESIS STATEMENT....................................................................................................................11 CHAPTER 4: METHODS AND APPROACH.......................................................................................................13 CHAPTER 5: DATA PROCESSING AND ANALYSIS .......................................................................................23 CHAPTER 6: POTENTIAL APPLICATIONS......................................................................................................30 CHAPTER 7: CONCLUSION .................................................................................................................................32 REFERENCES ..........................................................................................................................................................34 APPENDICES ...........................................................................................................................................................35
  • 6. vi ACRONYMS AND DEFINITIONS API: Application Programming Interface – A programmatic specification and mechanism for interfacing with software components. Back-end: The mechanism that allows data to be collected and stored in a distributed computational system. Client: Software or hardware system that requires services from another platform. Cloud computing: Computational services hosted on remote, networked, and, distributed information systems that are consumed like a commodity. Front-End: The mechanism that allows a distributed computational system to input, process, and transmit data. Host: Software or hardware system that provides a platform for other systems. IP Address: The identifying value assigned to a device participating in an Internet Protocol network. iSCSI: Internet Small Computing Interface – A protocol used to facilitate the use and connection of storage resources on computer networks. JSON: Java Script Object Notation- A language independent standard used for transmitting human readable text between computer systems. Linux: A free and open source operating system base. LUN: Logical Unit Number – The identification mechanism used to identify a networked storage resource in an iSCSI storage model. MC (s): Member of Congress MongoDB: A NoSQL document oriented database that uses JSON to provide flexible schemas. NoSQL: The concept of non-structured storage and retrieval in non-relational databases. Operating System: The suite of software that provides functionality to client computer software and host hardware. Python: A popular, multi-purpose, high-level computer programming language. Server: A networked computer whose function is to provide services to client computers. Ubuntu: A Linux based operating system. Vagrant: A configurable, portable, and reproducible computational work environment. VirtualBox: A software platform for virtualizing computer operating systems.
  • 7. 1 CHAPTER 1: INTENT AND ETHICAL CONSIDERATIONS The intent of this thesis is not to delve into a theoretically normative discourse on the pros, cons, or applications of data-mining and analytics in general. However, I will briefly explore some of the politically theoretical and normative literature that influenced me in the development of my data-mining and analytics system, and the accompanying research. Rather, my goal is to display and share the empirical and methodological development and application of a data-mining and analytics information system for the benefit of Political Science. I would be remiss and negligent if I did not acknowledge and share some of my concerns for the potentially harmful applications and consequences of data-mining and analytics systems such as the one that I have created, and those, much more sophisticated systems that are in production and under development now and will be in the future. Before delving into the internals and potential applications for a data-mining system such as the one I present in this thesis, I believe it is crucial to explore some of the ethical considerations involved with mining data from the public at-large. In the relatively short amount of time since the Internet was created and made available for public use (by the American defense and academic communities), people from all over the world have come to depend on the technology for an ever increasing amount of daily activity. The Internet has revolutionized the ways in which we live, communicate, create and consume information and generate data, metadata, and knowledge. With the rapid development of the Internet and peripheral technologies, humanity has not only been able to share existing knowledge and information, but has also created and distributed more new information, data, and metadata than in any time prior in the human experience.
  • 8. 2 With the astronomic amounts of public and private information, data, and metadata that have been created and shared on the Internet, have come new possibilities, opportunities, and derivative technologies. For example, the nascent industries of electronic business intelligence, data-mining, and data-analytics have emerged in the belief that vast amounts of value can be generated from the information and data that the public creates and shares on the Internet. Technologists, by collecting vast amounts of public and private data and metadata, can track, analyze, and predict human behavior and generate potentially valuable information, products, and services. With this information, public and private interests can create and construct products and services that leverage, and potentially manipulate, human behavior in manners never before possible. With the ability to track, monitor, and potentially manipulate human individuals and populations at-large, have come many concerns about how electronic information and data are used and abused. Massive data breaches, mostly driven by organized criminal and state actors, of the world’s largest and most powerful private and public institutions and businesses have rattled many individuals, firms, and governments into questioning how and why electronic data is being collected, processed, stored, and secured (Rosenzweig, 2013). Revelations of governments demanding data, and metadata from Internet Service and Data providers legally, illegally, or otherwise unethically, has alarmed many people in the human rights and civil liberty communities. Instances such as Yahoo’s complicity in China’s persecution of political dissidents have alarmed many state and non-state actors into demanding reforms and regulations for how, and for what purposes data and metadata on people are collected, used, and consumed (Ruggie, 2013).
  • 9. 3 Issues of ownership over the data and metadata that the public creates and consumes have also been raised. At the time of this writing the status quo operates under the assumption, and is the de facto standard, that public data and metadata are mostly commodities to be processed, sold, bought, and consumed, so long as the providers “general terms and conditions,” do here apply and have been accepted. In addition, at the time of this writing, the revelations by the former National Security Agency (NSA) contractor, Edward Snowden are still fresh on many minds. Edward Snowden alleged that the United States and other governments are collecting massive amounts of public and private data and metadata, sometimes illegally, in the name of national security and other interests (Greenwald, 2014). Since the Snowden revelations, many of the allegations made against the United States were publically and officially substantiated, and some reforms were initiated. With all the potential applications of data-mining and analytics, one must question and query the potential public and private benefits and harms that can arise in the age of instant communications and “big-data.” With the enormous amounts of data and metadata that are being created and consumed daily, we, as a society, can choose to use the information, products, and services they yield for the benefit or harm of our fellow man, and our shared environment and communities.
  • 10. 4 CHAPTER 2: INTRODUCTION Since Twitter’s creation and website launch in 2006, it has become the largest and fastest growing micro-blogging site and system on the planet (Farhi, 2009). Twitter’s ability to cater to people’s innate curiosity and need for information and interaction has resulted in close to 1 billion registered users and 271 million monthly active users since October 2014. Due to the roughly140 character limit per Tweet, the format of the communication forces people to construct their messages succinctly and to the point. Contributing to the success of the Twitter platform, is the ability to post messages anonymously or not, follow other users, retweet other user’s tweets, follow other users, allow yourself to be followed, and a myriad of other features that allow people to communicate, associate, and express themselves in ways never before possible. Because of Twitter’s popularity, use, and innate features, the site has fostered a community of opinion and dialog unlike any system that has existed before it. The results of Twitter’s system and operations are more extensive social networks, contexts, and information, some of which are new to humanity at large and the social sciences in particular. Because of Twitter’s success and proliferation, many social and political scientists have researched the communications posted on the site in effort to understand the intent, motivation, sentiment, behavior, and other sociological factors of the Twitter users that create them. In the context of the social sciences, the vast amount of scholarly work done in the realm of Internet based social networking has come in the form of direct collection, and analysis of social networking messages and data. While the conventional methods used in the social sciences for collecting, and analyzing the data are valid, I believe the methodology leaves a crucial factor out of the equation - the metadata.
  • 11. 5 However, before delving into the world of Twitter architecture, metadata, data, and information, and their potential value to the social sciences, I will define some key terms. I will then briefly explore some of the relevant work that has come before, and how that helped frame this research and its intent. In the course of this thesis, I will use following terms in this manner: 1. Metadata (um): The underlying information about the data being referenced that can serve to provide enriched functionality, context, network, and potential meaning to the information and data generated. In essence, the metadata is the glue and pointers that bind and direct the individual message into the larger social network and information ecosystem. 2. Data (um): The qualitative or quantitative dynamic values or value that make up information, and which are structured or unstructured (raw) in a manner conducive to mechanical and/or biological processing means and methods. 3. Information: The qualitative or quantitative product of a causal relationship between data components in a system and its environment. Information can be transmitted and consumed via message, observation, perception, or other biological or mechanical processes. Information is what we want, and what is, but not always, of value from a data-mining and analytic system. When shaping the idea for this project, I wanted to not only describe how to build a functional social media collection and processing system, but also to explore how new technologies like the Internet and social media can provide insights into the way people create and consume political data and information. The spark that ignited my interest in the potential value of social media in regards to political science, was my interpretation of Diana Mutz's
  • 12. 6 Hearing the Other Side: Deliberative versus Participatory Democracy. Mutz (2006) argues that exposure to multiple political views decreases participation in political activities and highlights the potential conflict between deliberative and participatory democracy. Furthermore, Mutz argues that the context and network in which political discussion takes place does matter, and that they can serve to either facilitate or hamper political learning and action. Due to the social norms that govern interpersonal communication and association, people often self-censor their public political opinions and views in order to avoid conflict, ridicule, rejection, or a wide variety of other social consequences. Because tight social fabrics can stifle public political expression of dissenting opinions and views, the observation of political expression in a medium as open as the Internet can be of value to the political scientist and psychologist. Since much of the interpersonal communication that can occur on the Internet and social media is free from the social norms and consequences of live political expression and association, the observation and analysis of such behavior can render valuable insights into the uncensored political mind (Gervais 2014). Research on political discourse and deliberation can be greatly enriched by using data and metadata driven analysis of political discussion in the context of social networks on the Internet. By capturing, collecting, processing, and analyzing tweets and their corresponding metadata, researchers can understand how people create, consume, and share data and information on the Internet. From these observations, researchers can better understand what political topics are important to people, and where these topics are important in both the physical and virtual words. By collecting and analyzing social media communications and their corresponding metadata, researchers can identify political association and behavior as it occurs in the context of social networks on the Internet and in real life.
  • 13. 7 Another field of study within the social sciences that can be advanced with the use of social media data and metadata collection and analysis is the study of the communication between government and its constituencies. In research done during the 111th Congress, Matthew Eric Glassman and others looked into the way that government officials used Twitter to communicate and inform people on a variety of topics of political importance. What Glassman and his partners discovered was that MCs in the minority party tended to use social media at higher rates than those of the majority party, and that the information was constructed to fulfill requirements of information within functional contexts. The contexts ranged from district and state constituencies, official political action groups, personal communications, replies to other comments or questions, and position taking. The implications for out-groups having a larger voice when not in power or when disenfranchised from society are something of critical value for the political outlier. This value is even more evident when the communications occur in contexts where speech and political descent are commonly self-repressed, such as in certain physical social settings and on traditional media. What Glassman discovered was that social media allowed people to communicate and be informed by their representatives in a more direct and unfiltered manner than was possible using traditional media channels, such as television, radio news, and press conferences. In regards to this type of study, data-mining and analytics could support the normative and theoretical bodies of political science by providing new information on issues such as, constituent – representative relations, political communication and association on the Internet, and the potential for social media and the Internet to encourage plebiscitary politics. As such, analysis into the social networks that are created in the physical and virtual worlds when people create, consume, and share electronic communications on the Internet could provide potential insights into the
  • 14. 8 “political being.” With this in mind, my research will explore the evolving data and metadata trail that are created when such actions occur within Twitter. However, the possibilities span much further than any one website or platform. I hypothesize that if the Internet and social media provide new networks for communication and association, along with the potential for social consequences and action, then the data and metadata that are created and consumed when those actions occur, when analyzed, can be of value to Political Science. However, it may prove difficult to please strict political theorists in regards to defining what constitutes political communication, deliberation, and participation in the context of the Internet and social media, and what that looks like. As such, Jane Mansbridge (1999) argues that everyday political talk can be useful in promoting political deliberation and participation if it meets certain and stringent criteria. However, if all political dialogue were held to this standard, very few discussions would ever be considered true political deliberation. Nevertheless, if we loosen Mansbridge’s standard and apply the social norms of the Internet then we can see that the analysis of political communications and social networks can be of value. While much of the political communications that take place via social media on the Internet may not meet all of Mansbridge’s standards, the collection of the communications and the information that can be derived from the underlying metadata can be of significant value to the political scientist. As I have mentioned before, there is more to a tweet than just the text of the message. The majority of what constitutes a tweet is actually a vast construct of metadata and data structures that serve to provide enriched functionality and value to a tweet and its creators, distributors, and consumers. I will elaborate on this in later chapters. However, what this implies is that by collecting and analyzing tweets in their entirety, a political scientist can not only study
  • 15. 9 the content of the message field, but he or she can also construct sophisticated data models that describe the locations, sentiments, interests, behaviors, and associations of the people that create, consume, or share those tweets. So, when taking into account that social media is a primary medium by which people communicate and act on matters of politics on the Internet, data- mining and analytics can be an invaluable for the political scientist. To provide contrast to the research I present in this thesis, I found a research study conducted at The University of Maryland College of Information Studies, in which Jenifer Golbeck, Justin M. Grimes and Anthony Rogers (2009), collected and analyzed over 6,000 tweets posted by various MCs. The conclusion of that study was that the tweets MCs created and shared, “tend not do provide new insights into government or the legislative process, or to improve transparency, rather they are vehicles for self-promotion.” In response to that, this thesis will not attempt to prove or disprove that information collected from social media can serve to be the end-all-be-all of insight into the political mind. Rather, the intent of my research is to display the development and potential applications of a data-mining and analytics information system that can yield data and information valuable to political science. A data-mining and analytics system, like the one I present in this thesis, can be used to collect and process social media data and metadata, and then to create a framework for future political studies. In essence, I will support the idea around which the entire “big-data,” and data mining and analytics industries have emerged. The idea being that there is potentially significant value in the information produced when the underlying structures that are created when people create, share, and consume information on the Internet and social media are analyzed and operationalized.
  • 16. 10 This thesis also attempts to highlight that in leaving out the overwhelming majority of what constitutes a tweet (or most other electronic messages) from their analysis, leaves out a huge factor from the research – the metadata (reference Appendix 1: what a tweet really looks like). Golbeck, Grimes and Rogers go on to state, “We have chosen not to study the underlying social network (followers, following, and friends), but this is a rich space for future work.” I will attempt to fill some of that space with my research and system. I will also support the idea that the underlying social constructs enumerated in the metadata, and of which the actual message is only minimal component, can be of significant value to political science. By collecting, processing, and analyzing tweets in their entirety, metadata and all, political scientist can develop a more robust understanding of people’s locations, sentiments, interests, behaviors, and associations as they relate to matters of political interest and activity on the Internet and in the “real world”. Perhaps, by exploring this new medium for electronic communication and association, innovative methodologies can be developed to leverage the Internet and social media, and help bridge the gap between political normative theory and empirical quantitative analysis…even if only a bit. Enjoy!
  • 17. 11 CHAPTER 3: THESIS STATEMENT Contemporary political science research of social media communications involves collecting data, analyzing the data for components, creating variables, coding the variables, operationalizing the variables, and attempting to produce an intellectual product of significance and meaning. What I believe is left out of much of the Internet and social media based research done in Political Science is the leveraging of information systems to facilitate a more robust collection and analysis of electronic communications and social networks. As a result, in the past, much collection and analysis of social media communications have left out some of the most crucial and potentially valuable components of political communications and social networks on the Internet, the metadata. By using data-mining and analytics, political scientists can programmatically collect, process, and analyze social media and other Internet communications automatically and perpetually. By using these systems, political scientists can collect and operationalizing massive Internet derived data sets, and craft unlimited amounts and types of queries and analytics to create potentially valuable information products. These information products can then can be used to describe the political sentiments, interests, behaviors, and associations of practically anyone using the Internet. With that, these information products can then be used to create new bodies of political knowledge and information. By utilizing data-mining and analytics, political scientists can produce information products that detail valuable information such as topics of political interest, and overcome some of the challenges that occur when tackling complex collection based projects on the Internet with reduced resources.
  • 18. 12 As such, I will attempt to convey the value and possibilities of employing data-mining and analytics information systems for the benefit of Political Science by explore the following topics: 1. How to build a data-mining and analytics information system. 2. How to capture, transfer, and store tweets in their entirety (metadata and all). 3. What exactly is a tweet, and why is it potentially valuable (we will explore a dissected tweet and identify and explain its composition). 4. Potential applications for data-mining and analytics systems in political science.
  • 19. 13 CHAPTER 4: METHODS AND APPROACH Information System and Data Collection In this section, I will detail the creation and composition of data-mining and analytics information system used for this project. The software and hardware used during the course of project is flexible and can be adapted, or scaled as necessary. In addition, with the development and proliferation of relatively inexpensive and accessible cloud computing services, the information system I detail here can be adapted and ported over to a cloud provider and scaled as needed. The platform I created for this project is comprised of the following components: 1. A physical server computer to host the operating system and client software: this can be a virtual server if running from the cloud or another networked computer. I chose to use a dedicated PC computer that I loaded with s server operating system. For production purposes, I recommend a dedicated physical, virtual, or cloud based server or a cluster of servers if you really want to scale. 2. An operating system: I chose to use Ubuntu Linux Server as my operating system. I chose to use Ubuntu Server because it is a free and open source, enterprise capable server operating system. In addition, Ubuntu is well maintained, documented, and enjoys a broad user and technical support base on the Internet. 3. Physical computer storage: I chose to create a storage area network (SAN) for my server to utilize. For this, I used a 4-terabyte network attached storage appliance, created a virtual disk pool, partitioned an iSCSI logical unit number (LUN) from the pool, and assigned it as virtual storage for my Ubuntu Server via a routed virtual local area network (VLAN).
  • 20. 14 4. A terminal to connect to your server: The terminal can be a physical monitor console, a web browser, or a software terminal emulator. I chose to use a Secure Shell (SSH) terminal emulator to securely connect to my server from anywhere. 5. A database: I chose to use MongoDB. MongoDB is an excellent fit for a data-mining and analytics information system, because it stores documents in the binary form of the Java Script Object Notation (JSON) that is native to much social media communications. 6. Database administration software: I choose to use Robomongo because it is a free, secure, and feature rich database administration suite. 7. Programming Language and interpreter (if required): I chose to use the Python Programming Language, because of its broad documentation, ease of use, clear syntax, broad support base, rich software library pool, and open source nature. 8. The data-mining software engine: I chose to use twitterstream-to-mongodb by Guillermo Del Fresno on GitHub (2014), because it is free, open source, licensed for general use (GNU GPL), and it is written in my favorite programming language, Python. Once you have acquired the necessary components, you will need to assemble, install, and deploy your information system. 1 1 Reference the installation and deployment instructions particular to your software, hardware, and operating system components.
  • 21. 15 Once the data-mining system is setup, the next step is to create an authenticated application on the Twitter Developer’s Website at https://dev.twitter.com/ (this can be done either before or after the previous step). The Twitter application you create in this step will allow your data-mining and analytics system make authenticated requests to Twitter’s APIs, and is required for the data-mining portion of this system. From the Twitter developer’s website, you can create an account, log in, and create a Twitter App (read-only access will suffice, unless you want your system to publish information on behalf of your application). Once you have created the Twitter app, you will need to record and safeguard the following values: “consumer key, consumer secret, access token, and the access token secret.”2 2 Reference the screenshot in Appendix 2: Twitter App, for what the authentication and authorization variables look like.
  • 22. 16 Now that the system is established, and the authenticated Twitter app is created, the next step is to populate your system with the files and values it needs in order to data-mine Twitter. In this step, you will need to login to your server and navigate to the directory in which your twitterstream-to-mongodb script is located. Once you are in the correct directory, you will need create the following files and populate them with the following values. 1. Create a file named “oauth.json”: In this file, enter the following terms and values as such: { “consumer_key” : “enter your consumer key here”, “consumer_secret” : “enter your consumer secret here”, “access_token” : “enter your access token here”, “access_token_secret” : “enter your access token secret here” } OAuth is the open standard that Twitter uses to allow for programmatic authentication and access to their APIs. 2. Create a file named “objects.txt”: In this file, enter the objects you wish to track, each individual object must be separated by one space, and cannot exceed 400 objects (the 400-object maximum is a Twitter API limitation); the objects can include the following types of values: #example, @example, and example.3 3 Version 1 of the system I present in this thesis utilizes the Twitter Streaming API’s “track” feature. As such, the system will only collect tweets that contain the values listed in the objects text. This particular API limitation will exclude tweets that are created by a value listed in the objects file. However, the system will collect every tweet that references a value listed in the objects file. In version 2 of this system, I will incorporate the Twitter Streaming API’s “follow” feature, which will permit the collection of tweets that are directly created or shared by a value listed in the objects file.
  • 23. 17 Once the following files are created and populated, the next step is to initiate the script and data collection. Initiate the script in the following manner, and from the directory that contains the oauth, objects, and twitterstreamtomongodb.py files: 1. From the terminal, enter the following command and parameters (if on a Windows system, disregard the “sudo”): sudo python twitterstreamtomongodb.py --oauth=oauth.json --server=127.0.0.1 --port=27017 -- database=“insert DB name here” --track=objects.txt4 1. 4 Sudo, is used on Linux based systems to invoke the context of another account, typically with elevated or administrative privileges. The “python” command calls the Python Interpreter to interpret the script (the file immediately after and ending in the .py extension). 2. --oauth, is the parameter that passes your Twitter app’s credentials, stored in the file, to the program for authentication and authorization to Twitter’s APIs. 3. --server, denotes the Internet Protocol (IP) address of your server, this value can also be a resolved host name if you are using an externally provided hosting platform, or have otherwise resolved the IP address to a host name. In this example, I am using the local host address of 127.0.0.1, which indicates that I am running the program directly from the local machine. The tweets you collect will be routed or directed to the IP address you place into this parameter. On this system, I am using a SAN, which has its own set of IP addresses. However, the SAN is providing virtual storage that is logically attached to the host server, which is why I am using the local host address. 4. --port, denotes the software endpoint that facilitates application or protocol specific communication. In this case, port 27017 is the default listening port for MongoDB core services 5. --database, is the parameter that references the database that will house the incoming Twitter data. The program will automatically create the database on the server hosting MongoDB services that is referenced in the --sever parameter. 6. --objects, is the parameter that references the text file that contains the objects the system will track and collect (one object per line with a maximum of 400 objects).
  • 24. 18 Once the program is initiated, the data-mining begins, and the tweets will start pouring in as soon as they are created or distributed. From this point, how long you collect tweets is up to you, and is only limited to the resources you allocate to the data-mining system and Twitter’s rate limitation protocol. Once you have collected an acceptable data set, the next thing to do is to analyze the data and generate an information product of potential value. However, before detailing the analysis portion of this project, I believe that it is crucial to explore the composition of a tweet, and explore why a collection of tweets can be valuable. So what exactly is a tweet? The common conception is that a tweet is a roughly 140 character message that, on its face, is only able to communicate the most minimal of information. However, as I have alluded to throughout this thesis, there is more to a tweet than meets the eye: much more. At the heart of a tweet lies a rich metadata architecture that binds and directs the individual tweet into the larger social network and information ecosystem. In essence, the tweet metadata provides defined fields, which can then be populated by personally identifying and descriptive data pertaining to the creator, distributor, and consumer of the tweet. The data and metadata associated with a tweet can then be used to create information constructs such as: location mosaics, “webs-of-association”, behavior -pattern analyses, and sentiment analyses. Information constructs derived from the underlying data and metadata contained within a tweet can then detail how individual creators and consumers of a tweet relate in the broader Twitter social network, and even in the real world. What this means is that an individual, or an automated information system, can use Twitter data and metadata, or most other metadata, to create information models that detail human behavior and association. While there are numerous ways to depict twitter meta-data, I believe the most accessible manner is through a visual aid with descriptions of the various components.
  • 25. 19 The tweet metadata depicted in the following screenshot is a graphical representation and may be difficult to view. Appendix 1 depicts a tweet’s metadata in its native textual representation of Java Script Object Notation (JSON).
  • 26. 20
  • 27. 21 The following list details some of the potentially significant metadata fields associated with tweets and describes their functions: 1. _id: This provides a unique alphanumerical identifier for the individual tweet. 2. Contributors: This lists the IDs of users who have contributed to the tweet. 3. Text: The actual message filed of the tweet, this is what most people usually see when a tweet is created or consumed. 4. In_reply_to_status_id: If the tweet is a reply to another tweet this filed will provide the integer representation of the original tweet’s ID. 5. Favorite count: How many times the tweet has been “favorited” by other Twitter users. 6. Source: The generating source of the tweet (such as the Twitter for the iPhone App). 7. Coordinates: The longitude and latitude of the tweets generating source. 8. Entities: This field contains the following sub fields: hashtags, any hashtags referenced in the tweet; user_mentions, any Twitter users mentioned in the tweet; symbols, any symbols listed in the tweet; media, the resource locators for an associated pictures, videos, or other media files associated with the tweet; and urls, the universal resource locators provided in the tweet. 9. Retweet_count: The number of times the tweet has been retweeted. 10. Retweeted_status: Within retweeted_status, are contained the following descriptive and identifying data fields, which are associated with the creator of the retweeted tweet: contributors, id, favorite_count, source, retweeted, coordinates, and entities. 11. User: Within the user field exist data and metadata that identify and describe the primary composer of the tweet and contain the following fields: id, the unique identifier of the user account that creates the tweet; verified, whether or not the user’s Twitter account is verified; friends_count, the number of friends the tweet creator has; location, the city in which the tweet is created; geo_enabled, indicates whether the user account has geo-tracking enabled; name, the name of the Twitter account; lang, the language the tweet is written in; favorites_count, the number of tweets that the user marks as favorite; screen_name, the screen name of the Twitter user; created_at, the date-time stamp of the tweets creation; contributors_enabled, indicates
  • 28. 22 whether or not the Twitter user has permitted the use of authenticated contributors; time_zone, the time zone in which the tweet is created. The metadata fields I just described are only a few of the total fields available in the ever- evolving Twitter system. As you can see, there are many more metadata fields depicted in the graphical representation, and many more in the textual representation illustrated in Appendix 1. The potential applications for deriving value from these metadata and data points is limited only to the creativity, ability, resources, and access of the individual or system that captures, processes, and analyzes them. At this stage of the operation, the data-mining system should have collected a database composed of collections, which will contain every tweet referencing an object listed in your “objects” file. Now that I have detailed the creation of a data-mining system, created an authenticated Twitter app, and dissected and explored a tweet’s metadata structure, we can move on to the methods and approach I used to process and analyze Twitter data and metadata.5 5 Reference Appendix 3 for database backup and restore instructions.
  • 29. 23 CHAPTER 5: DATA PROCESSING AND ANALYSIS The following examples are queries and information products I created using data captured, collected, and processed by my data-mining and analytics system. During my collection period, beginning on 8 October 2014 at 2000 hrs., and ending on 25 October 2014 at 2000 hrs., I collected almost every tweet referencing a randomly selected list of 279 MCs. In total, my data-mining and analytics system collected 472,395 tweets, including all the corresponding metadata: automatically. Now, I will move onto the analysis portion of this project. One of the most approachable methods to analyze social media information, without initially being too bogged down in the intricacies of metadata analysis, is to create a table analysis. In this example, I select a sample-set of collected Twitter objects, in this case the Twitter handles of certain MCs, and assign them variables. The variables correspond to the MC’s name, age, party, chamber, state, district, district competitiveness (DC), and the number of tweets associated with that MC.6 6 District competitiveness is defined with an “S” for safe, or an “N” for not safe. The district competitiveness information was collected from Sabato’s Chrystal Ball at http://www.centerforpolitics.org/crystalball/
  • 30. 24 Handle Name Age Party Chamber State District DC Tweets @SpeakerBoehner Boehner, John 64 R H OH 8 S 44590 @SteveScalise Scalise, Steve 49 R H LA 1 S 4278 @WhipHoyer Hoyer, Steny 75 D H MD 5 S 1963 @McConnellPress McConnell, Mitch 72 R S KY - S 4978 @SenatorDurbin Durbin, Richard 69 D S IL - S 1857 @SenFeinstein Feinstein, Dianne 81 D S CA - S 2567 @JoaquinCastrotx Castro, Joaquin 40 D H TX 20 S 1636 @RepCuellar Cuellar, Henry 59 D H TX 28 S 403 @SenSanders Sanders, Bernie 73 D S VT - S 10657 @SenJohnMcCain McCain, John 78 R S AZ - S 20726 @SenTedCruz Cruz, Ted 43 R S TX - S 32307 @SenSchumer Schumer, Chuck 63 D S NY - S 2633 @RepBetoORourke O’Rourke, Beto 42 D H TX 16 S 884 @RepWestmoreland Westmoreland, Lynn 64 R H GA 3 S 1166 @RepTomPrice Price, Tom 60 R H GA 6 S 777 @repjohnbarrow Barrow, John 59 D H GA 12 S 368 @LEETERRYNE Terry, Lee 52 R H NE 2 N 2671 @RepNickRahall Rahall, Nick 65 D H VA 3 N 610 @CongMikeSimpson Simpson, Mike 64 R H ID 2 N 530 @RepBera Bera, Ami 49 D H CA 7 N 836 Now that descriptive and identifying attributes have been associated with the MC’s Twitter handles, the next step is to run some basic analytic queries against the collections in the database and extract some potentially useful information. For the next step, I will query the collection associated with each of the MCs listed and extract the number of Tweets that referenced the MC in the “Twitterverse,” during the collection period.7 7 In order to capture the total number of documents (tweets) contained within a collection (MC) within the database, enter the following command from the Mongo shell or a GUI database management console: If from the command line terminal, enter the following command: “mongo” – Then from the Mongo Shell enter the following commands: use “enter db name” – from the database enter the following command: db['@enter-tracking-object-name- here'].stats()
  • 31. 25 Once the query executes, the system will produce statistics from the queried collection and output them to the terminal in a JSON representation. The output will look like this: { "ns" : "DMTPS.@SpeakerBoehner", "count" : 44590, "size" : 346754336, "avgObjSize" : 7776, "storageSize" : 460861440, "numExtents" : 14, "nindexes" : 1, "lastExtentSize" : 124993536, "paddingFactor" : 1, "systemFlags" : 1, "userFlags" : 1, "totalIndexSize" : 1455328, "indexSizes" : { "_id_" : 1455328 }, "ok" : 1 } For the purpose of this query, the important value to extract is the “count” filed, which is the total number of documents, tweets in this case, that referenced a particular MC during the collection period. In the following examples, I will construct more advanced, metadata driven, queries that will extract identifying and associative data from the collection database.
  • 32. 26 In the following examples, I have anonymized any personally identifying information my queries and analytics produced for privacy reasons. For these queries and analytics, I use the MongoDB Aggregation Framework to query the collection associated with a Member of Congress, and then to find the following information for every tweet that references the specified MC: 1. The text of the tweet referencing a specific MC. 2. The Twitter users referenced in the tweet (the intended audience). 3. The user screen name and “real name” of the Twitter account holder that created the tweet. 4. The amount of friends that the Twitter user has 5. The location and country the tweet was created at 6. The geographic coordinates of the location where the tweet was created. In order to query the collection and extract the pertinent information, the following query must be run against the collection you wish to analyze using the MongoDB Aggregation Framework. { $group: { _id: { text: "$text", entities_user_mentions_screen_name: "$entities.user_mentions.screen_name", user_name: "$user.name", user_screen_name: "$user.screen_name", user_friends_count: "$user.friends_count", place_full_name: "$place.full_name", geo_coordinates: "$geo.coordinates" } } }
  • 33. 27 Once the query executes, a document will be created that contains the information you extracted from the data. The following output is a real example of an information product the query generated: { "_id":{ "text":"These are #Ukraine war crimes. #ukrainevotes jail the Kiev criminals. http://t.co/uI W6KxqzG4"n@WhiteHouse n@BarackObama n@SpeakerBoehner", "entities_user_mentions_screen_name":[ "WhiteHouse", "BarackObama", "SpeakerBoehner" ], "user_name":"Pattys4Putin-USA", "user_screen_name":"PattyDs50", "user_friends_count":1598, "place_full_name":"New Hampshire, US", "geo_coordinates":[ 42.908474, -71.841744 ] } }
  • 34. 28 For the following example, I queried an MC’s collection in order to find all the tweets referencing the MC that where written in Spanish during my collection period. The query also found the screen names, real name, city, and state, where the Twitter account holder was when he or she created the tweet and the geographical coordinates of the exact location the tweet was created. The following query also employs the MongoDB Aggregation Framework: { $group: { _id: { text: "$text", lang: "$lang", user_screen_name: "$user.screen_name", user_name: "$user.name", place_full_name: "$place.full_name", geo_coordinates: "$geo.coordinates" } } }, { $match: { "_id.lang": "es" } } Once the query executes, a document will be created that contains the information you extracted from the data. The following output is a real example of an information product the query generated: "_id":{ "text":"Gracias @JoaquinCastrotx por apoyar la #ReformaMigratoria. Por favor sigue lucha ndo por #CIR. #TimeIsNow http://t.co/p5F9Y54Ac2 vía @FWD_us", "lang":"es", "user_screen_name":"DguezVd", "user_name":"Vaneza Dominguez", "place_full_name":"Dallas, TX", "geo_coordinates":[ 32.900652, -96.871544 {
  • 35. 29 The three examples I documented here only scratch the surface of what is possible by incorporating a metadata driven approach to social media data-mining and analytics. By leveraging the robust Twitter metadata architecture, I was able to collect a vast and nearly complete dataset referencing 279 MCs and collecting almost a half a million tweets. I was then able to query the individual collections corresponding to the MCs, and then create potentially valuable information products. In the example quires, I was able to identify the tweet frequencies associated with particular MCs, the Twitter handles of users referenced in a tweet, the amount of friends the tweet generator has, the intended audiences of the tweets, the physical location of the tweet generators, and even to filter tweets by language. However, the queries I provided here are only the beginning. Truly, the possibilities for generating valuable information products by leveraging data-mining and analytics, is only limited to the creativity, skill, access, resources, and time of the new political data-miner. In the following section, I will expand upon some of the possible applications of using data-mining and analytics for the benefit of Political Science.
  • 36. 30 CHAPTER 6: POTENTIAL APPLICATIONS While the potential applications for data-mining and analytics in Political Science are only limited to the creativity of the data-miner, I wanted to provide a hypothetical example of a political science activity that could benefit from such an approach. In this hypothetical scenario, a research team is given the task to collect all the tweets created by, or referencing all congressional candidates during a particular election cycle. Once the election cycle is over, the research team is to analyze the tweets and generate an information product that investigates how social media campaigns effect the creation and behavior of political associations on social media and in the real world. In this scenario, the task would be difficult, if not impossible, to do with conventional social media collection and analysis tools. The research team could decide to comb the web for tweet collection websites, and to manually collect and operationalize the tweets using spreadsheets and the like. However, this method would be very labor and time intensive, and would only yield the message field of the tweet and some minimally identifying and descriptive data. In this method, the research team would not be able to construct a web-of-association that would identify the congressional districts in which the tweets were created, consumed, or shared. However, the research team could employ another option. The research team could reach out to a tweet vendor and purchase all the tweets created by or referencing particular congressional candidates, and then run queries and analytics against those data sets. However, this method is expensive and does not lend its self to dynamically adjusting the collection of tweets, as a research team might do during their project. Nevertheless, if a research team has access to significant funding, employing a tweet vendor could be a simple method to collect
  • 37. 31 sizable twitter data sets. However, if you want the metadata associated with the tweet, which is often more valuable that the message itself, it will cost significantly more. In this scenario, employing a data-mining and analytics system like the one I created and detailed in this project is ideal. The information system I created for this project uses all free and open-source technologies that are readily available and well documented on the Internet. The support communities for all the technologies required for building and operating this type of system are highly robust, typically friendly, and usually able to help most anyone troubleshoot or navigate a particular technology. In addition, the software required to build and operate this type of system can be run on most commodity hardware, ranging from small desktop computers, to a massive clusters of networked servers, and even on the cloud. Another benefit of this type of system is that you can securely access, monitor, and maintain the system remotely from virtually anywhere with an Internet connection. From your computer at home, your tablet on vacation, or your smartphone on the road, you can update your object collection list, create new databases, and write new analytic queries. A further benefit of employing and maintaining a data-mining and analytics system is that once it is established, the system can continue to collect information indefinitely. The system can also be used or replicated by others, who then can use the system as is, or expand the system and add new functionality and features to it. The possibilities of using open-source technologies for data-mining and analytics to the benefit of Political Science are almost limitless.
  • 38. 32 CHAPTER 7: CONCLUSION When I started this project, I wanted to create an information system that could pave the way for Political Science researchers to explore new technologies and methods in order to make their work easier, more innovative, and more productive. I had a strong background in Information Systems and Cybersecurity, but I had never before created a data-mining and analytics information system. I thought the process would be fun and challenging. However, I had no idea how fun and challenging the process would actually be. I knew that there were significant implications for data-mining and analytics in Political Science, but I was not sure how to bridge-the-gap, between the disciplines. After much study, research, trial, and error, I created an information system that can mine data from the Internet easily, automatically, and perpetually. The following challenge was in the analytics. When I started this project, I did not have much knowledge or experience in “data- analytics.” I had, of course, analyzed data before, but not in the context of a formal data-mining and data-science initiative. Throughout this project, I thought myself a great deal about data- mining, data-science, and data-analytics. As such, I was able to produce some basic information products that I am sure will pique the interest of the more adventurous political scientists. The data-mining and analytics system I created and detailed here is a basic system, but one that I hope will serve as the foundation for further development and study. With this system, I was able to capture a relatively large data set of tweets of political interest relatively easily and automatically. I was then able collect the tweets into a database capable of storing unstructured data from practically anywhere in the digital world. Furthermore, I was able manipulate, transform, and query the tweets to produce information products with the capacity to advance normative political theory and quantitative political analysis. In the end, I
  • 39. 33 was able to provide a roadmap for future “political data-miners” to get started in constructing their own data-mining and analytics information systems for the benefit of Political Science.
  • 40. 34 REFERENCES Chodorow, Kristina. 2013. MongoDB: The Definitive Guide. Sebastopol: O’Reilly. Del Fresno, Guillermo. 2014. “twitterstream-to-mongodb” [Software]. GitHub: Retrieved from https://github.com/gdelfresno/twitterstream-to-mongodb Farhi, Paul. 2009. “The Twitter Explosion.” American Journalism Review 31(3): 26–31. http://search.ebscohost.com/login.aspx?direct=true&db=ufh&AN=41877978&site=ehost- live (February 19, 2010). Gervais, Bryan T. 2014. “Incivility Online: Affective and Behavioral Reactions to Uncivil Political Posts in a Web-based Experiment.” Journal of Information Technology & Politics (Forthcoming) Golbeck, Jennifer, Justin M. Grimes, and Anthony Rogers. 2010. “Twitter Use by the U.S. Congress.” Journal of the American Society for Information Science and Technology 61(8): 1612–21. Greenwald, Glenn. 2014. No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State. New York: Metropolitan Books. Mansbridge, Jane. 1999. “Everyday Talk in the Deliberative System” In Deliberative Politics: Essays on Democracy and Disagreement, ed Stephen Macedo: Oxford University Press, 1 – 211. McKinney, Wes. 2013. Python for Data Analysis 2nd ed. Sebastopol: O’Reilly. Mutz, Diana C. 2006. Hearing the Other Side: Deliberative Versus Participatory Democracy. New York: Cambridge University Press. Provost, Foster, & Tom Fawcett. 2013. Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking. Sebastopol: O’Reilly. Rosenzweig, Paul. 2013. Cyber Warfare: How Conflicts in Cyberspace Are Challenging America and Changing the World. Santa Barbara: Praeger. Ruggie, John G. 2013. Just Business: Multinational Corporations and Human Rights. New York: Norton, W. W. & Company, Inc. Russell, Matthew A. 2014. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More 2nd ed. Sebastopol: O’Reilly.
  • 41. 35 APPENDICES Appendix 1: What a Tweet Really Looks Like in Its Native JSON NOTE: I highlighted the text filed, which contains the actual message portion of a tweet. /* 0 */ { "_id" : ObjectId("54361be43b811434f9a21da4"), "contributors" : null, "truncated" : false, "text" : "✖ @AustinScottGA08 Silence Is Complicity #MSSen #RememberMississippi #MakeDCListen", "in_reply_to_status_id" : null, "id" : NumberLong(520082176352980993), "favorite_count" : 0, "source" : "<a href="http://tweetadder.com" rel="nofollow">TweetAdder v4</a>", "retweeted" : false, "coordinates" : null, "timestamp_ms" : "1412832228168", "entities" : { "user_mentions" : [ { "id" : 234797704, "indices" : [ 2, 18 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 41, 47 ], "text" : "MSSen" }, { "indices" : [ 48, 68 ], "text" : "RememberMississippi" }, { "indices" : [ 69, 82 ], "text" : "MakeDCListen" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "520082176352980993",
  • 42. 36 "retweet_count" : 0, "in_reply_to_user_id" : null, "favorited" : false, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : 265658805, "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/455915260524769280/ClR7foxv_normal.png", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 3559, "profile_sidebar_border_color" : "000000", "id_str" : "265658805", "profile_background_color" : "000000", "listed_count" : 57, "profile_background_image_url_https" : "https://pbs.twimg.com/profile_background_images/845237718/447b881c8b774ed9199f6bf5505beb66.jpeg", "utc_offset" : -14400, "statuses_count" : 97286, "description" : "A Declaration Conservative: That 2 secure these (unalienable) rights, Govts R instituted among Men, deriving their just powers from the consent of the governed", "friends_count" : 3389, "location" : "Western Pennsylvania", "profile_link_color" : "000000", "profile_image_url" : "http://pbs.twimg.com/profile_images/455915260524769280/ClR7foxv_normal.png", "following" : null, "geo_enabled" : false, "profile_banner_url" : "https://pbs.twimg.com/profile_banners/265658805/1397533565", "profile_background_image_url" : "http://pbs.twimg.com/profile_background_images/845237718/447b881c8b774ed9199f6bf5505beb66.jpeg", "name" : "Freedoms Fool", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 94, "screen_name" : "freedomsfool", "notifications" : null, "url" : null, "created_at" : "Sun Mar 13 23:26:25 +0000 2011", "contributors_enabled" : false, "time_zone" : "Eastern Time (US & Canada)", "protected" : false, "default_profile" : false, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : null, "possibly_sensitive" : false, "lang" : "en", "created_at" : "Thu Oct 09 05:23:48 +0000 2014", "filter_level" : "medium", "in_reply_to_status_id_str" : null, "place" : null } /* 1 */ { "_id" : ObjectId("5435dcae3b811434f9a1ff12"), "contributors" : null, "truncated" : false, "text" : "RT @FreeTheMarine: GA @AustinScottGA08 Pls support #HRes620 assisting our #MarineHeldInMexico. He needs treatment for PTSD ASAP #BringBackO…", "in_reply_to_status_id" : null, "id" : NumberLong(520014305681768449), "favorite_count" : 0,
  • 43. 37 "source" : "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>", "retweeted" : false, "coordinates" : null, "timestamp_ms" : "1412816046572", "entities" : { "user_mentions" : [ { "id" : NumberLong(2476804154), "indices" : [ 3, 17 ], "id_str" : "2476804154", "screen_name" : "FreeTheMarine", "name" : "Free Sgt Tahmooressi" }, { "id" : 234797704, "indices" : [ 22, 38 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 51, 59 ], "text" : "HRes620" }, { "indices" : [ 74, 93 ], "text" : "MarineHeldInMexico" }, { "indices" : [ 128, 140 ], "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "520014305681768449", "retweet_count" : 0, "in_reply_to_user_id" : null, "favorited" : false, "retweeted_status" : { "contributors" : null, "truncated" : false, "text" : "GA @AustinScottGA08 Pls support #HRes620 assisting our #MarineHeldInMexico. He needs treatment for PTSD ASAP #BringBackOurMarine", "in_reply_to_status_id" : null,
  • 44. 38 "id" : NumberLong(519365396366110720), "favorite_count" : 12, "source" : "<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>", "retweeted" : false, "coordinates" : null, "entities" : { "user_mentions" : [ { "id" : 234797704, "indices" : [ 3, 19 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 32, 40 ], "text" : "HRes620" }, { "indices" : [ 55, 74 ], "text" : "MarineHeldInMexico" }, { "indices" : [ 109, 128 ], "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "519365396366110720", "retweet_count" : 29, "in_reply_to_user_id" : null, "favorited" : false, "user" : { "follow_request_sent" : null, "profile_use_background_image" : false, "default_profile_image" : false, "id" : NumberLong(2476804154), "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/509417608936845312/OX6Pm-8B_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 3542, "profile_sidebar_border_color" : "000000", "id_str" : "2476804154", "profile_background_color" : "000000", "listed_count" : 59, "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png", "utc_offset" : -25200,
  • 45. 39 "statuses_count" : 3706, "description" : "OFFICIAL Tahmooressi Family Account. Please visit: http://t.co/8PyH5q0uWE | #MarineHeldInMexico #HRes620 | Media Requests: jonathan@lucidpublicrelations.com", "friends_count" : 243, "location" : "www.andrewfreedomfund.com", "profile_link_color" : "134673", "profile_image_url" : "http://pbs.twimg.com/profile_images/509417608936845312/OX6Pm-8B_normal.jpeg", "following" : null, "geo_enabled" : false, "profile_banner_url" : "https://pbs.twimg.com/profile_banners/2476804154/1403332310", "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png", "name" : "Free Sgt Tahmooressi", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 12637, "screen_name" : "FreeTheMarine", "notifications" : null, "url" : "http://www.facebook.com/freethemarine", "created_at" : "Sun May 04 12:12:23 +0000 2014", "contributors_enabled" : false, "time_zone" : "Arizona", "protected" : false, "default_profile" : false, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : null, "possibly_sensitive" : false, "lang" : "en", "created_at" : "Tue Oct 07 05:55:34 +0000 2014", "filter_level" : "low", "in_reply_to_status_id_str" : null, "place" : null }, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : 959017200, "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/509521711671558144/oqRiNGin_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 145, "profile_sidebar_border_color" : "C0DEED", "id_str" : "959017200", "profile_background_color" : "C0DEED", "listed_count" : 2, "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png", "utc_offset" : null, "statuses_count" : 4451, "description" : null, "friends_count" : 263, "location" : "", "profile_link_color" : "0084B4", "profile_image_url" : "http://pbs.twimg.com/profile_images/509521711671558144/oqRiNGin_normal.jpeg", "following" : null, "geo_enabled" : false, "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png", "name" : "MomOrWhatever", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 2426, "screen_name" : "MomOrWhatever", "notifications" : null, "url" : null,
  • 46. 40 "created_at" : "Tue Nov 20 00:24:19 +0000 2012", "contributors_enabled" : false, "time_zone" : null, "protected" : false, "default_profile" : true, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : null, "possibly_sensitive" : false, "lang" : "en", "created_at" : "Thu Oct 09 00:54:06 +0000 2014", "filter_level" : "medium", "in_reply_to_status_id_str" : null, "place" : null } /* 2 */ { "_id" : ObjectId("5435e9183b811434f9a204d8"), "contributors" : null, "truncated" : false, "text" : "RT @fenolj: @AustinScottGA08 When physical abuse #Tahmooressi endured comes 2 light YOU will be accountable. Co-sponsor #HRes620 #BringBa…", "in_reply_to_status_id" : null, "id" : NumberLong(520027633044946944), "favorite_count" : 0, "source" : "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>", "retweeted" : false, "coordinates" : null, "timestamp_ms" : "1412819224039", "entities" : { "user_mentions" : [ { "id" : NumberLong(2680673774), "indices" : [ 3, 10 ], "id_str" : "2680673774", "screen_name" : "fenolj", "name" : "Jackie Fenolio" }, { "id" : 234797704, "indices" : [ 12, 28 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 49, 61 ], "text" : "Tahmooressi" }, { "indices" : [
  • 47. 41 122, 130 ], "text" : "HRes620" }, { "indices" : [ 131, 140 ], "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "520027633044946944", "retweet_count" : 0, "in_reply_to_user_id" : null, "favorited" : false, "retweeted_status" : { "contributors" : null, "truncated" : false, "text" : "@AustinScottGA08 When physical abuse #Tahmooressi endured comes 2 light YOU will be accountable. Co-sponsor #HRes620 #BringBackOurMarine.", "in_reply_to_status_id" : null, "id" : NumberLong(519999120233467904), "favorite_count" : 1, "source" : "<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>", "retweeted" : false, "coordinates" : null, "entities" : { "user_mentions" : [ { "id" : 234797704, "indices" : [ 0, 16 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 37, 49 ], "text" : "Tahmooressi" }, { "indices" : [ 110, 118 ], "text" : "HRes620" }, { "indices" : [ 119, 138 ],
  • 48. 42 "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : "AustinScottGA08", "id_str" : "519999120233467904", "retweet_count" : 2, "in_reply_to_user_id" : 234797704, "favorited" : false, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : NumberLong(2680673774), "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/519337504214753280/xZ6DzFeB_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 125, "profile_sidebar_border_color" : "C0DEED", "id_str" : "2680673774", "profile_background_color" : "C0DEED", "listed_count" : 1, "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png", "utc_offset" : null, "statuses_count" : 9624, "description" : null, "friends_count" : 58, "location" : "", "profile_link_color" : "0084B4", "profile_image_url" : "http://pbs.twimg.com/profile_images/519337504214753280/xZ6DzFeB_normal.jpeg", "following" : null, "geo_enabled" : false, "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png", "name" : "Jackie Fenolio", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 39, "screen_name" : "fenolj", "notifications" : null, "url" : null, "created_at" : "Fri Jul 25 22:56:37 +0000 2014", "contributors_enabled" : false, "time_zone" : null, "protected" : false, "default_profile" : true, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : "234797704", "possibly_sensitive" : false, "lang" : "en", "created_at" : "Wed Oct 08 23:53:46 +0000 2014", "filter_level" : "low", "in_reply_to_status_id_str" : null, "place" : null }, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : 981285295, "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/517386303441481728/RVa6gyU1_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6",
  • 50. 44 Appendix 3: MongoDB Database Backup, Restore, and Initialization 1. Logon to your data-mining system. 2. From the terminal, suspend the data collection by pressing “Ctrl + z” 3. Enter the MongoDB administrative shell by entering the command “mongo” 4. From the MongoDB administrative shell, enter the command “use admin” - This will switch you to the administrative database. 5. From the administrative database, enter the command “db.shutdownServer()” - This will shut down the MongoDB service. 6. Navigate to a folder designated to hold the database backups (you can create the folder locally, or on any other storage medium that is logically connected). 7. From the backup directory, enter the command “sudo mongodump --dbpath /path/to/your/mongodb” – The backup may take some time to complete depending on the size of the databases. In addition, the backup will contain all the collections contained in your databases, stored in JSON and their corresponding sub-metadata stored in BSON (Binary JSON). 8. Once the backup is done, enter the following command to initiate the MongoDB service and resume logging “sudo mongod --dbpath /path/to/your/mongodb --fork --logpath /var/log/mongodb.log” Once the MongoDB service is initiated, enter the following command to resume collection, if need be: “sudo python twitterstreamtomongodb.py --oauth=oauth.json --server=127.0.0.1 --port=27017 -- database=“insert DB name here” --track=objects.txt”8 8 Insert the variables appropriate to your system in the server, port, and database fields.