Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Using Tweets for Understanding Public Opinion During U.S. Primaries
and Predicting Election Results
Monica Powell
Barnard ...
tweets to examine general public opinion regard-
ing the Democratic candidates Hillary Clinton and
Bernie Sanders, as well...
4 DATA
Over 13 million tweets were gathered on Twitter
from February 2016 to April 2016. The entire
dataset, as well as ra...
Saturday, April 9:
Colorado Republican state convention
Wyoming Democratic caucuses
5 METHODS
By capturing tweets mentioni...
percent of all tweets). This is largely because of
Trump, who was mentioned in more than 50 per-
cent of all tweets in the...
has won all of the primary elections in these states
with the exception of California, which has not yet
taken place. Stat...
insight in public opinion on Twitter regarding the
2016 Primary Elections.
In order to extract sentiments for each of the
...
Figure 9: Average Very Positive Sentiment Score
over all Tweets for Each Candidate
wordcloud visualizations represent the ...
Figure 12: February, 27th, 2016 Wordcloud for
Donald Trump
Figure 13: February, 27th, 2016 Sentiment
consistency further r...
Figure 16: March, 12th, 2016 Wordcloud for Don-
ald Trump
Figure 17: March, 26th, 2016 Wordcloud for Ted
Cruz
around the n...
place in Nebraska on May 10th, and so we would
like evaluate our models prediction results against
the actual outcome of t...
Figure 21: ROC curves for all Three Prediction
Models
visualize our results, we build several static and
interactive visua...
Upcoming SlideShare
Loading in …5
×

Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

3,187 views

Published on

Abstract
Using social media for political analysis, especially during elections, has become popular in the past few years where many researchers and media now use social media to understand the public opinion and current trends. In this paper, we investigate methods for using Twitter to analyze public opinion and to predict U.S. Presidential Primary Election results. We analyzed over 13 million tweets from February 2016 to April 2016 during the primary elections, and we looked at tweets that mentioned either Hillary Clin- ton, Bernie Sanders, Donald Trump or Ted Cruz. First, we use the methods of sentiment analysis, geospatial analysis, net- work analysis, and visualizations tools to examine public opinion on twitter. We then use the twitter data and analysis results to propose a prediction model for predicting primary election results. Our results highlight the feasibility of using social media to look at public opinion and predict election results.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

  1. 1. Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results Monica Powell Barnard College Columbia University 3009 Broadway New York, NY 10027 mmp2181@barnard.edu Nadia Jabbar Columbia University Graduate School of Arts and Sciences 535 West 116th Street New York, NY 10027 nj2290@columbia.edu Abstract Using social media for political analy- sis, especially during elections, has be- come popular in the past few years where many researchers and media now use so- cial media to understand the public opin- ion and current trends. In this paper, we investigate methods for using Twit- ter to analyze public opinion and to pre- dict U.S. Presidential Primary Election re- sults. We analyzed over 13 million tweets from February 2016 to April 2016 during the primary elections, and we looked at tweets that mentioned either Hillary Clin- ton, Bernie Sanders, Donald Trump or Ted Cruz. First, we use the methods of sen- timent analysis, geospatial analysis, net- work analysis, and visualizations tools to examine public opinion on twitter. We then use the twitter data and analysis re- sults to propose a prediction model for pre- dicting primary election results. Our re- sults highlight the feasibility of using so- cial media to look at public opinion and predict election results. General Terms Data Visualization, Prediction Models Keywords Twitter, Presidential Election, senti- ment analysis, geomapping, RShiny, D3.js, Social media, data visualization, Hillary Clinton, Bernie Sanders, Donald Trump, Ted Cruz. 1 INTRODUCTION Microblogging platforms such as Twitter have be- come increasingly popular communication tools for social media users who often use these plat- forms to express their opinions on a variety of top- ics and to discuss current issues. As more people use Twitter and other microblogging platforms, they post not only about their personal matters, but also about products and services they use, and they even discuss their political and/or religious views. As a result, these microblogging websites have be- come valuable sources for gathering public opin- ion and sentiment analysis. Twitter has over 310 Monthly active users and 1 billion unique visits monthly to sites with embed- ded Tweets (?). Similar to other social networking websites, Twitter allows people to share informa- tion and express themselves in real-time. This im- mediacy makes Twitter a platform that users can utilize in order to express their political support or discontent for particular individuals or policies. However, it can be argued whether or not Twitter influences election results and/or if sentiments ex- pressed on Twitter represents a random sample of a given population. All of the 2016 presidential candidates have a presence on Twitter and More than two thirds of U.S. congress members have created a Twitter ac- count and many are actively using Twitter to reach their constituents (Wang and et. al. 2012). An in- dividual’s network and sentiments associated with them on Twitter are unique to them and may or may not mirror their network offline. In this paper, we analyze tweets obtained from February 2016 to April 2016 in order to examine public opinion on the 2016 U.S. Presidential Pri- mary Elections that are currently taking place. We hypothesize that Twitter, and by extension, other popular microblogging websites such as Facebook and Google+, are good sources for understand- ing general public opinion regarding political elec- tions. Furthermore, we hypothesize that Twitter (as well as other popular micrblogging platforms) are also useful for predicting election results. In order to test our hypotheses, we use several different techniques to extract useful information from the tweets, including sentiment analysis of the tweets, geospatial analysis, and network anal- ysis. We use these methods to mine the collective
  2. 2. tweets to examine general public opinion regard- ing the Democratic candidates Hillary Clinton and Bernie Sanders, as well as the Republican candi- dates Donald Trump and Ted Cruz. We next use the information that is extracted from the tweets to build several predictive models and test them in order to analyze how well Twit- ter is indicative of general public opinion regard- ing the 2016 Primaries. In our predictive models, we also incorporated polling data from several na- tional polls conducted by different organizations and gathered by FiveThirtyEight, a website that focuses on opinion poll analysis. Additionally, we incorporated the final results for those states where the primaries have already happened into our pre- dictive model in order to test the accuracy of our model. 2 LITERATURE REVIEW While their is some controversy regarding this topic, social media data can certainly be used for analyzing socio-political trends from the past, dur- ing the present, and for the future. Assure and Huberman (2012) effectively used Twitter to pre- dict some real-world outcomes, such as box of- fice revenues for movies pre-release and trends in the housing market sector. Their work suggested that Twitter data can be successfully used to pre- dict consumer metrics. Furthermore, Varian and Choi (2009) used data from google trends to pre- dict real-time events, and their work indicated that google trends can be used to predict retail sales for Motor vehicle and parts dealers. In yet another study by Ginsberg et al. (2010), researchers used social media data to predict flu epidemics, while Mao and Zeng (2011) used Twitter to perform sen- timent analysis in order to predict stock market trends. Social media has also been used for examine po- litical trends. OConnor et al. (2010) studied pub- lic opinion measured from polls along with senti- ment measured from text analysis of Twitter posts. Their results showed a strong correlation (as high as 80 percent) between Twitter data and presiden- tial elections. Furthermore, Tumasjan et al. (2010) studied the German federal election to investigate whether Twitters messages correctly mirror offline political sentiment, and they found that tweet sen- timent regarding the candidates’ political stances strongly correlated with the political landscape of- fline. In 2012, Wang and et. al. created a system for real-time twitter sentiment analysis for the presi- dential election because the nature and popularity of Twitter allows researchers to analyze sentiment in real-time, as opposed to being forced to wait af- ter a certain period of time in order to implement more traditional methods of data collection. The Swedish presidential election was also tracked in real-time by researchers using data gathered from Twitter (Larsson, 2012). While the role of Twitter in election outcomes is debatable, twitters users are definitively not apolitical and thus it was in- triguing to investigate whether or not their is a direct correlation between political outcomes and twitter activity. Yet, some studies have concluded that Twit- ter and other social media are not strongly re- flective of real world outcomes. Gayo-Avello et al. (2012) analyzed the 2010 U.S. Congressional elections using Twitter data to test Twitters pre- dictive power, and were unable to find any cor- relation between the data analysis results and the actual electoral outcomes. However, it is impor- tant to note that the landscape of social media has dramatically changed in the last few years, and so Twitter may be a more accurate measure of public opinion today than it was a few years ago. 3 RESEARCH QUESTION Using social media for political discourse, espe- cially during political elections has become com- mon practice. Predicting election outcomes from social media data can be feasible and as discussed previously, positive results have often been re- ported. In this paper, we will test the predictive power of the social media platform Twitter in the context of the 2016 U.S. Primary elections. We will use Twitter data to develop a picture of public opinion about the political candidates online, and analyze our results against the results of the pri- maries that have already happened. We will then create a predictive model using the Twitter data analysis results, and test those models using the primary results for those states where candidate elections have already taken place. We propose that while Twitter is a good platform for analyz- ing public opinion, it can not immediately replace other measures for gathering public opinion, such as polling data.
  3. 3. 4 DATA Over 13 million tweets were gathered on Twitter from February 2016 to April 2016. The entire dataset, as well as random samples of tweets from the dataset were used to analyze online sentiments towards Hillary Clinton, Bernie Sanders, Donald Trump, and Ted Cruz. We also looked specifi- cally at dates when at least one primary election was held. These dates were February 9th, Febru- ary 20th, February 23, February 27th, March 1st, March 5th, March 6th, March 9th, March 10th, March 12th, March 15th, March 22nd, March 26th, April 5th, April 9th, April 19th, April 26th, and May 3rd. We did not have data for February 1st, and so this was the only primary election date that was left out from our analysis. Below is a list of the specific elections that happened on each date, as well as the states where the elections took place. Tuesday, February 9: New Hampshire Saturday, February 20: Nevada Democratic caucuses South Carolina Republican primary Tuesday, February 23: Nevada Republican caucuses Saturday, February 27: South Carolina Democratic primary Tuesday, March 1: Alabama Alaska Republican caucuses American Samoa Democratic caucuses Arkansas Colorado caucuses (both parties, no preference vote for Republicans) Democrats Abroad party-run primary Georgia Massachusetts Minnesota caucuses (both parties) North Dakota Republican caucuses (completed by March 1) Oklahoma Tennessee Texas Vermont Virginia Wyoming Republican caucuses Saturday, March 5: Kansas caucuses (both parties) Kentucky Republican caucuses Louisiana Maine Republican caucuses Nebraska Democratic caucuses Sunday, March 6: Maine Democratic caucuses Puerto Rico (Republicans only) Tuesday, March 8: Hawaii Republican caucuses Idaho (Republicans only) Michigan Mississippi Thursday, March 10: Virgin Islands Republican caucuses Saturday, March 12: Guam Republican convention Northern Mariana Islands Democratic caucuses Washington, DC Republican convention Tuesday, March 15: Florida Illinois Missouri North Carolina Northern Mariana Islands Republican caucuses Ohio Tuesday, March 22: American Samoa Republican convention Arizona Idaho Democratic caucuses Utah caucuses (both parties) Saturday, March 26: Alaska Democratic caucuses Hawaii Democratic caucuses Washington Democratic caucuses Friday-Sunday, April 1-3: North Dakota Republican state convention Tuesday, April 5: Wisconsin
  4. 4. Saturday, April 9: Colorado Republican state convention Wyoming Democratic caucuses 5 METHODS By capturing tweets mentioning each presiden- tial candidate and analyzing the sentiments be- hind those tweets, we could track peoples opinions about each candidate and thus predict the final pri- mary election results. A function was constructed in R to automatically collect tweets from each day for the months of February, March, and April. The tweets along with information about the users twit- ter handle, the location of the user, the text of the tweet, the description of the users profile, if the tweet was retweeted, and other information was encoded into JSON (JavaScript Object Notation) files. The rjson package in the R software was used to parse the JSON files. We extracted all tweets related to at least one of the four political candidates (Clinton, Sanders, Trump, and Cruz), and combined all extracted tweets into a .csv file for further analysis. All of the Twitter data was analyzed using the R software, D3.js and QGIS in order to determine whether or not certain dimensions of Twitter ac- tivity related to presidential election correlate with primary election results. Specifically, the research methods implemented aimed to address whether or not more positive sentiments towards as partic- ular candidate on Twitter significantly increases a candidates probability of winning a primary elec- tion. The primary focus of the analysis was text mining for sentiments, geospatial analysis using GIS to look at specific states, and network analy- sis to evaluate the network elements of the tweets and look at useful network parameters of the men- tion network of all four candidates. Additionally, we used our selected parameters, as well as gen- eral polling results obtained from FiveThrityEight, a website that focuses on opinion poll analysis and politics, and built several prediction models to test if Twitter is a good indicator of offline public opin- ion and political election outcomes. 6 VISUALIZATIONS We constructed three types of visualizations to test our hypotheses. We created several static visual- izations using the R software package to get an overall look at all of the tweets in relation to each of the four candidates. We then created interac- tive visualizations using R Shiny and D3.js to look more closely at changes in public opinion over pri- mary days in order to evaluate twitter trends dur- ing primary elections days. 6.1 Static Visualizations We first looked at the entire dataset, which con- sisted of a total of over 13,289,699 tweets for the three months of February, March, and April. These tweets were parsed and divided into four categories representing each of the four politi- cal candidates. So, for example, all tweets that mentioned Clinton were merged into a single ob- ject data frame. This was also done for Sanders, Trump, and Cruz. Preliminary data-analysis was conducted on the over 13 million tweets that were collected in or- der to reveal high-level trends that would be rel- evant and provide context for further sentiment analysis. As the pie chart below illustrates, More than fifty percent of all of the tweets in the entire data set mentioned Trump. He is undoubtedly the most discussed candidate on Twitter. Furthermore, Sanders was the second most discussed candidate on Twitter, while Cruz and Clinton were both dis- cussed the least. Figure 1: Proportion of Total Tweets Mentioning Each Candidate The next visualization (depicted below) is also in relation to tweet volume by each candidate and by party for all of the tweets in the final data set. The outer donut illustrates the proportion of tweets belonging to each party. It is clear that the Repub- lican candidates had far more tweets (68 percent of all tweets) than the Democratic candidates (32
  5. 5. percent of all tweets). This is largely because of Trump, who was mentioned in more than 50 per- cent of all tweets in the data set. The inner donut shows tweets proportions for each of the four can- didates. As can be seen, Trump-related tweets make up the vast majority of the final data set with 54 percent of all tweets mentioning Trump. Sanders was the next most popular on twitter with 19 percent of tweets mentioning him, while Cruz had 14 percent of tweets mentioning him. Clinton is the least popular candidate on Twitter with 12 percent of all tweets mentioning her. Trump, of course, has won the most primary elections by a large margin in comparison to the other Republican candidates, which Twitter con- firms here. If we were to go by tweet vol- umes alone to predict the Presidential Elections, it would seem to support the claim that Trump will win by a landslide. Likewise, the fact that Clin- ton is less popular on Twitter compared to Sanders would seem to indicate that Sanders will win the Democratic primaries if we only look at tweet vol- ume. However, looking at the primary elections that have happened thus far, Clinton has won more states than Sanders. Hence, this may indicate that tweet volume is not entirely accurate for predict- ing real outcome election results. Due to the fact that tweet volume alone can not predict a candi- date’s popularity in the general election expanded the scope of measures to examine. Figure 2: Tweet Volume by Party and by Candi- date We next extracted all tweets that were geo- tagged. This considerably reduced the number of tweets, as it is estimated that only between 5 to 20 percent of all tweets are geo-tagged with a loca- tion. However, we wanted to look at the origins of our tweets, and we assume that the sub-sample of geo-tagged tweets is a strong representative of the entire data set of tweets. Figure 2 below is a world map showing the origins of all tweets that were geo-tagged in the complete data set of tweets. The yellow dots in- dicate where tweets originated from on the map. Unsurprisingly, that vast majority of tweets origi- nate form inside the United States. The primaries are for the U.S. Presidency so it is expected that the four candidates would be most talked about within America. However, it was interesting to see that the highest concentration of tweets was on the East coast of the U.S., while the West coast was also very concentrated with tweet origins. Middle America was not very concentrated with tweets. If these geo-tagged tweets are reflective of the total sample of tweets used, then there may be a bias in- troduced in the data set with a greater proportion of tweets from the East coast, and very few tweets originating from Middle America. Outside of America, northern Europe, particu- larly the U.K. was also heavily concentrated with tweets pertaining to the candidates. We do not know why the U.K. in general had such a high proportion of tweets. It may be because British people are highly interested in American politics, that British twitter users have a different system for geo-tagging than other countries or that a lot of Americans travel abroad to Northern Europe and remain engaged with politics on Twitter dur- ing their vacation. The four candidates were dis- cussed in other regions of the world as well, but with much lesser concentration. Europe showed more interest in American politics than any other country (excluding the United States of America). Figure 3: World Origins of Tweets We next looked at the tweet frequency by state for each candidate within the United States. We first look at Hilary Clintons map, which is de- picted below. The states with the most number of tweets mentioning Clinton were California, Texas, Florida, Illinois, and of course, New York. Clinton
  6. 6. has won all of the primary elections in these states with the exception of California, which has not yet taken place. States like North and South Dakota, Montana, Wyoming and Nebraska had almost non existent frequency of tweets mentioning Clinton. However, it is interesting that Clinton seems to be more popular in Utah on Twitter in comparison to the other four candidates, even though she lost to Sanders in the Utah primary elections. This in- dicates that tweet volumes on Twitter may not be entirely accurate in predicting election results. Figure 4: Clinton’s Tweet Frequencies Map We next looked at Bernie Sanderss map of tweet frequencies by state (depicted below). It is in- teresting to see what states he is more popular in compared to Clinton. Surprisingly, discussions about Sanders are very popular on Twitter in Ohio, even though he lost in the Ohio state primaries to Clinton by a substantial margin. Figure 5: Sanders’ Tweet Frequencies Map We next looked at Ted Cruzs map of tweet fre- quencies by state (depicted below). Compared to Clinton and Sanders, Ted Cruz is more popular in the west coast, with states like Nevada, Arizona and Oregon showing more interest in him on Twit- ter. He is also mentioned more in states like Mon- tana and Nebraska, where Clinton and Sanders had almost non-existent mentions in those states. Lastly, we looked at Donald Trumps map of tweet frequencies by state (depicted below). In- terestingly, he is not as popular as Cruz in Mon- tana, Wyoming and Nebraska, where he is rarely mentioned on twitter. Figure 6: Cruz’s Tweet Frequencies Map Figure 7: Trump’s Tweet Frequencies Map After looking at all of the maps for the five candidates tweet frequencies, it does seem that tweet frequencies are not always a good indica- tor of election results when it comes to using only tweet volume per candidate. As mentioned above, there were some instances (such as Clin- ton having a very high volume of tweets in Utah and Sanders very high volume of tweets in Ohio) where twitter did not correlate with the real world outcome of the elections (based on the assump- tion that a higher tweet volume should correlate to winning the majority of votes). However, over- all, there seems to be more correlation regarding tweet volumes for each candidates in each state than vice versa. Tweet volume per candidate will be a predictor variable incorporated into the pre- diction model that will be introduced later in this paper. It seems that tweet volume performs sporadi- cally as a predictor of election results. However, we can use an algorithm to evaluate and catego- rize the feelings expressed in text; this is called sentiment analysis. Hence, we next looked at tex- tual sentiment analysis of the tweets to get a better
  7. 7. insight in public opinion on Twitter regarding the 2016 Primary Elections. In order to extract sentiments for each of the tweets, the Syuzhet R package was utilized, which comes with four sentiment dictionaries and pro- vides a method for accessing the robust, but computationally expensive, sentiment extraction tool developed in the NLP group at Stanford. The developers of this algorithm built a dictio- nary/lexicon containing lots of words with asso- ciated scores for eight different emotions and two sentiments (positive/negative). Each individual word in the lexicon will have a yes (one) or no (zero) for the emotions and sentiments, and we can calculate the total sentiment of a sentence by adding up the individual sentiments for each word in the sentence. It is important to note that senti- ment analysis of tweets comes with its fair share of problems. For example, sentiment analysis al- gorithms are built in such a way that they are more sensitive to expressions typical of men than women. Furthermore, it can be argued that com- puters are not optimal at identifying emotions cor- rectly in all cases. They are likely not great at at identifying something like sarcasm. Most of these concerns wont have a big effect on my analysis here because we are looking at text. Additionally, when using as large a dataset as the one for this study, it is likely, that many more tweets will be correctly identified by sentiment, and the effects of identifying sentiments incorrectly will be nor- malized. The entire data set was used to derive sentiment scores for all four candidates, and the bar graphs depicting aggregates of the results are shown below. The more positive the sentiment score is, the more positive the overall sentiment is of the tweets that are associated with each of the candidates. Hence, the Sanders has the highest average senti- ment score compared to all other candidates while Trump has the second highest average sentiment score over all tweets. Both Clinton and Cruz have lower average sentiment scores over all the tweets. When we look at the average very positive sen- timent scores for each of the candidates over all of the tweets, Trump has on average, more pos- itive sentiment scores than the other candidates while Sanders comes close in second place. How- ever, it is important to note that Trump has a very large proportion of tweets compared to Sanders, and this may be skewing the average very positive Figure 8: Average Sentiment Score over all Tweets for Each Candidate sentiment scores. It may be interesting to equal- ize the data set to contain fewer tweets mention- ing Trump, and to see how this affects the average very positive sentiment scores. It is interesting that Clinton has the lowest average positive sentiment scores over all tweets mentioning her. Lastly, we look at the average very negative sentiment scores bar graph, and the results correspond to the other two graphs. Cruz has the highest average negative scores over all tweet relating to him, while Clin- ton comes in second place. Sanders, on the other hand, has the lowest average negative sentiment scores over the all tweets mentioning him. Hence, if we were to go by these sentiment scores to pre- dict election outcomes, it would seem that Sanders would win the Democratic primaries while Trump would win the Republican primaries. 6.2 R Shiny Visualization: Word Frequency (Wordcloud) An interactive visualization app using the R Shiny platform was produced to analyze the text of the tweets. Preliminary data-analysis was conducted on the tweets that were collected in order to reveal trends that would be relevant with further senti- ment analysis. An R Shiny application was de- veloped to generate a different wordcloud visual- ization for each date that data was collected. The
  8. 8. Figure 9: Average Very Positive Sentiment Score over all Tweets for Each Candidate wordcloud visualizations represent the words that were most prevalent in tweets related to a par- ticular candidate. Each day had slightly differ- ent words that dominated a candidate’s network and on some days in particular there was a strong theme or increased polarization. For example, on February 27th tweets related to Donald Trump mainly contained ’#nevertrump’ (Figure 12). Influencers on Twitter such as Marco Rubio, Glenn Beck and Amanda Carpenter all published tweets that contained the hashtag as a strategic move against Donald Trump prior to Su- per Tuesday on March 1st which led Trump to have a 0.09 Sentiment Score (Figure 11) (Figure 13). Backlash and outrage to Hillary Clinton com- mending Nancy Reagan’s involvement in the H.I.V./AIDS conversation following Reagan’s death dominated tweets about Clinton on March 12th (Figure 14). ”The problem with Mrs. Clintons compliment: It was the Reagans who wanted nothing to do with the disease at the time (Source: http://www.nytimes.com/politics/first- draft/2016/03/11/hillary-clinton-lauds-reagans- on-aids-a-backlash-erupts/). It was confirmed by sentiment analysis that tweets regarding Clinton were overall negative as she only had a sentiment score of -0.01 on March 12th, while Sanders Figure 10: Average Very Negative Sentiment Score over all Tweets for Each Candidate Figure 11: Marco Rubio Tweets #NeverTrump on February 27th and Cruz had higher sentiment scores (.22 and .31 respectively) (Figure 15). Trump also, had a relatively low sentiment score of -0.01 on March 12th which was the same day that protesters disrupted a Trump rally in Chicago and forced the event to be canceled (Figure 15) (Figure 16). On March, 26th a scandal broke out involving Ted Cruz. The group anonymous alleged that Ted Cruz was involved in a sex scandal. Most tweets that mentioned Ted Cruz on March 26th involved the scandal (Figure 17). Although, he generally had the highest sentiment score out of all the can- didates on March 22nd and March 26th he had the lowest sentiment score of all the candidate (0.02 and 0.01 respectively) (Figure 18). In general, the words that appeared most fre- quently (as illustrated in the wordcloud) were pre- dictive of a candidate’s sentiment score and this
  9. 9. Figure 12: February, 27th, 2016 Wordcloud for Donald Trump Figure 13: February, 27th, 2016 Sentiment consistency further reinforced the appropriateness and validity of the Syuzhet R package that was used for sentiment calculations. The sentiment score is able to provide a concrete quantitative measure of how a network feels towards a partic- ular candidate whereas the wordcloud represented the qualitative feelings of individuals and provide further context for the sentiment scores. 6.3 D3.js Visualizations We next created several visualizations using the D3.js platform. D3.js (D3 for Data-Driven Doc- uments) is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It makes use of the widely imple- mented SVG, HTML5, and CSS standards. All of the visualizations produced using D3.js are available at: http://aboutmonica.com/ final%20D3/. A D3 visualization of a force layout visualiza- tion of the mention network for all four candidates was generated. The network was constructed only from tweets on the days of primary elections. This is a very large network with over 30,000 edges, and hence when the D3 visualization is produced, the resulting layout graph is very large and takes a while to load. Figure 14: March, 12th, 2016 Wordcloud for Hillary Clinton Figure 15: March, 12th, 2016 Sentiment In the force layout visualization at the provided link above, you can see a vast social mention net- work of tweets. Since this social network has di- rected edges, we can look at the direction of tweet mentions where there are many nodes connected to one central node, the arrows are all pointing to the central node. This means that there are many twitter users tweeting and mentioning the central node. From the network graph you can see that some twitter users (represented by the nodes in the graph) have very large networks and are very densely connected by edges to other twitter users. Edges between two twitter users signify that one of the users mentioned or re-tweeted the other user, so those areas that are very dense and dark in the graph are likely people who were mentioned or re-tweeted many times. On the other hand, to- wards the outskirts of the graph, there are a few nodes that are connected to each other by a few edges. These twitter users are connected by edges because they mentioned or re-tweeted each other during the time that the data was collected. How- ever, they are separate from other clusters in the graph by not being connected to other nodes by edges. This graph clearly depicts which twitter users have larger networks (more dense clusters
  10. 10. Figure 16: March, 12th, 2016 Wordcloud for Don- ald Trump Figure 17: March, 26th, 2016 Wordcloud for Ted Cruz around the nodes). Lastly, we also have nodes that are connected by maybe one or two (or in some cases, no ties) indicating that they are not being mentioned by others users in the networks and they are also not mentioning others users in their tweets. We next looked at each of the four candidates networks separately, and using Gephi, we derived network parameter values in order to better assess what is going on in this network in relation to each of the four candidates. Table 1 below depicts the results of our analysis. There are some interesting results to point out. Cruzs average clustering co- efficient is 0 while Trumps network is almost zero at 0.001. Hence, it seems that Cruzs tweet men- tion network is very small with there being very little clustering of users, and most users not be- ing interconnected. In general, all of the candi- dates have very small clustering coefficients with Sanders having the highest value at 0.005. This may be due to the fact that the social network an- alyzed is a network of mentioned tweets, and it is unlikely that the candidates would reply to many of the tweets that mention them. Additionally, these tweets were collected in real-time, so a can- didate may have responded to any of the tweets in the network at a later time, that was not cap- tured in our dataset. Furthermore, Sanders net- work has the highest average degree at 2.109 while Figure 18: March, 26th, 2016 Sentiment Clinton leads closely behind at 2.013. This im- plies that on average, a node in Sanders network has 2.109 edges connected with it, meaning that users are more likely to interact in Sanders net- work in comparison to networks of the other can- didates. In Sanders network it appears that nodes are more likely to interact and mention other nodes than other candidate networks. Sanders also has the largest network diameter at 6, which indicates that it is likely that he reaches a greater audience than the other candidates. Lastly, it is interest- ing to note that both Republican candidates have lower average path lengths when compared to the Democratic candidates, meaning that nodes can be reached in fewer steps in the networks for the Re- publican candidates. Figure 19: Network Parameters for Each Candi- date 7 Prediction Model After having explored and analyzed the twitter data, we next focused on building a prediction model. We created a panel dataset for all four candidates and looked at the primary election days as well as randomly chosen days from out twitter data set. In the end, we had a total of 70 days of twitter data used for the panel data set. We chose to specifically look at the states of New York, In- diana, and Nebraska. The next primaries will take
  11. 11. place in Nebraska on May 10th, and so we would like evaluate our models prediction results against the actual outcome of the Nebraska primaries. We coded all my independent and dependent variables for these three states. New York was used as a training dataset to train the prediction model on. Trump and Clinton won this state. The testing dataset that was used was for the states of Indiana and Pennsylvania, and it was fitted using the re- sults of the training dataset to calculate predicted values for who would win in Indiana and Pennsyl- vania. It is important to note that for this analysis, it was necessary to look at only those tweets that were geo-tagged and belonged to one of the three states used in the panel data. This undoubtedly decreased the total size of the tweets available to analyze, as most of the tweets collected were not geotagged at all. However, we were still able to obtain thousands of tweets for most of the days for each candidate. The dependent variable used in the analysis was called electionresults, and it was basically equal to 0 if the candidate did not win the primary elec- tion (or poll average taken from FiveThirtyEight from that day) and equal to 1 otherwise. The re- searchers at FiveThirtyEight have collected and continue to collect national polls for the Repub- lican and Democratic primaries, and they generate a polling average from all polls collected for each candidate. For the Democratic primary, a total of 671 polls have been collected thus far, and a total of 681 polls have been collected for the Republi- can primary. This polling average is adjusted for pollster quality, sample size, and recency, and as a result, it is a good indicator of public opinion regarding the primaries and the candidates. Fur- thermore, FiveThirtyEight offers daily polling av- erages from as early as July 10, 2015 upto the cur- rent day. Hence, it was fairly simple for me to collect the daily polling average to see which can- didate won the polls for each day in my dataset. For the independent variables, we used tweet volume for each candidate in each state, the av- erage sentiment score (which was calculated from each candidates tweet corpus for each day for each state), and lastly we used the networked pa- rameters that we described above. These param- eters (average degree, average clustering coeffi- cient, network diameter, and average path length) were not derived for each day and each specific state, but were derived from the entire Twitter data set, and were thus constant over all days. In addition to the independent variables, we also added control variables to the panel dataset. The control variables used were the population for each state and average income for each state. Lastly, we used a lagged dependent variable as an independent variable because in time series anal- ysis, it is expected that the poll results from the previous day would be predictors of the poll re- sults for the current day, and we needed to account for this correlation. 8 Findings In order to train and test our data set, we used three different statistical methods: Logistic Regression, Random Forests, and Support Vector Machines. We wanted to see if one of these three models performed better than the others. The regression equation for the logistic regression model is shown in the figure below. Figure 20: Logistic Regression Equation All three models performed well in terms of prediction results, which is surprising for us. As we mentioned earlier, it seems that Sanders is very popular on twitter compared to Clinton, and so we expected that this would skew the results, but it does not look like it did. After training the model on the New York dataset, we tested the trained models on the Indiana and Pennsylvania datasets, and for both datasets, the models correctly de- picted that Clinton and Trump would win Penn- sylvania (which they did) and Sanders and Trump would win Indiana (which they did). We used ROC curves (depicted below) in order to evaluate the predictive accuracy of our models and it seems that the Random Forests model and Support Vec- tor Machine model performed better than the Lo- gistic Regression model. 9 Conclusion In this paper, we looked at twitter data from the months of February, March, and April in order to predict election outcomes for the 2016 Presiden- tial Primaries. We analyzed several variables to explore the twitter data, including network param- eters, text sentiments of the tweets, and tweet vol- ume for each of the four candidates. In order to
  12. 12. Figure 21: ROC curves for all Three Prediction Models visualize our results, we build several static and interactive visualizations. The prediction models that we developed for our analysis performed very well in predicting the election outcomes. How- ever, we only tested our models on two states, and would like to do further tests on using other state primaries in order to test the predictive accuracy of our models. 10 References ”Company — About.” Company — About. Twitter, 31 Mar. 2016. Web. 07 May 2016. ¡https://about.twitter.com/company¿. Larsson, A. O., & Moe, H. (2012). Studying political microblogging: Twitter users in the 2010 Swedish election campaign. New Media & Society, 14(5), 729-747. Wang, H., Can, D., Kazemzadeh, A., Bar, F., & Narayanan, S. (2012, July). A system for real-time twitter sentiment analysis of 2012 us presidential election cycle. In Proceedings of the ACL 2012 System Demonstrations (pp. 115-120). Associa- tion for Computational Linguistics. 11 URLs All D3 graphics used in this project are available for viewing online. The R Shiny application was too large to upload but the source code is avail- able to be viewed by clicking on the menu items at http://aboutmonica.com/final%20D3/ Republican Sentiments: http://aboutmonica.com/final%20D3/republican% 20sentiments/ Democrat Sentiments: http://aboutmonica.com/final%20D3/democrat% 20sentiments/ Volume of Tweets per Candidate: http://aboutmonica.com/final%20D3/candidate% 20tweet%20volume%20prop%20D3/

×