This document summarizes an analysis of tweets related to the hashtag "#prayforparis" following the November 2015 Paris attacks. The analysis was conducted using R for text mining. It preprocessed 3200 tweets by cleaning, stemming words, and building a term-document matrix. Frequent words and associations were identified. Tweets were clustered into 10 groups and sentiment analysis was performed, finding most expressed sadness, anger, and prayed for Paris with a positive attitude toward the attacks. Key text mining techniques like preprocessing, matrix building, clustering and sentiment analysis were implemented in R to analyze information from Twitter about this event.
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
This research work discusses how an integrated open source intelligence framework can help the law enforcements and government entities who are investigating crimes based on statistical and graph analysis on Twitter data. The solution supports a real-time and off-line analysis of the tweets collections. The framework employs tools that support big data processing capabilities, to collect, process and analyze a huge amount of data. The outline solution supports content and textual based analysis, helping the investigators to dig into a person and the community linked to that person based on a tweet. Our solution supports an investigative processes composed of the following phases (i) find suspicious tweets and individuals based on hash tags analysis (ii) classify the user profile based on Twitter features (iii) identify influencers in the FOAF networks of the senders (iiii) analyze these influencers’ background and history to find hints of past or current criminal activity.
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
Fast and Appropriate Social Network Analysis (SNA) tools ,techniques, are required to collect and classify
opinion scores on social networksites , as a grouping on wrong opinion may create problems for a society or
country . Social Network Analysis (SNA) is popular means for researcher as the number of users and groups
increasing day by day on that social sites , and a large group may influence other.In this paper, we
recommendhybrid model of opinion recommendation systems, for single user and for collective community
respectively, formed on social liking and influence network theory. By collecting thedata of user social networks
and preferenceslike, we designed aimproved hybrid prototype to imitate the social influence by like and sharing
the information among groups.The significance of this paper to analyze the suitability of ANN and Fuzzy sets
method in a hybrid manner for social web sites classifications, First, we intend to use Artificial Neural
Network(ANN)techniques in social media data classification by using some contemporary methods different
than the conventional methods of statistics and data analysis, in next we want to propagate the fuzzy approach
as a way to overcome the uncertainity that is always present in social media analysis . We give a brief overview
of the main ideas and recent results of social networks analysis , and we point to relationships between the two
social network analysis and classification approaches .This researchsuggests a hybrid classification model build
on fuzzy and artificial neural network (HFANN). Information Gain and three popular social sites are used to
collect information depicting features that are then used to train and test the proposed methods . This neoteric
approach combines the advantages of ANN and Fuzzy sets in classification accuracy with utilizing social data
and knowledge base available in the hate lexicons.
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
This research work discusses how an integrated open source intelligence framework can help the law enforcements and government entities who are investigating crimes based on statistical and graph analysis on Twitter data. The solution supports a real-time and off-line analysis of the tweets collections. The framework employs tools that support big data processing capabilities, to collect, process and analyze a huge amount of data. The outline solution supports content and textual based analysis, helping the investigators to dig into a person and the community linked to that person based on a tweet. Our solution supports an investigative processes composed of the following phases (i) find suspicious tweets and individuals based on hash tags analysis (ii) classify the user profile based on Twitter features (iii) identify influencers in the FOAF networks of the senders (iiii) analyze these influencers’ background and history to find hints of past or current criminal activity.
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
Fast and Appropriate Social Network Analysis (SNA) tools ,techniques, are required to collect and classify
opinion scores on social networksites , as a grouping on wrong opinion may create problems for a society or
country . Social Network Analysis (SNA) is popular means for researcher as the number of users and groups
increasing day by day on that social sites , and a large group may influence other.In this paper, we
recommendhybrid model of opinion recommendation systems, for single user and for collective community
respectively, formed on social liking and influence network theory. By collecting thedata of user social networks
and preferenceslike, we designed aimproved hybrid prototype to imitate the social influence by like and sharing
the information among groups.The significance of this paper to analyze the suitability of ANN and Fuzzy sets
method in a hybrid manner for social web sites classifications, First, we intend to use Artificial Neural
Network(ANN)techniques in social media data classification by using some contemporary methods different
than the conventional methods of statistics and data analysis, in next we want to propagate the fuzzy approach
as a way to overcome the uncertainity that is always present in social media analysis . We give a brief overview
of the main ideas and recent results of social networks analysis , and we point to relationships between the two
social network analysis and classification approaches .This researchsuggests a hybrid classification model build
on fuzzy and artificial neural network (HFANN). Information Gain and three popular social sites are used to
collect information depicting features that are then used to train and test the proposed methods . This neoteric
approach combines the advantages of ANN and Fuzzy sets in classification accuracy with utilizing social data
and knowledge base available in the hate lexicons.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Social Media Posts On Platforms Such As Twitter Or Instagram Use Hashtags,Which Are Author-Created
Labels Representing Topics Or Themes, Toassist In Categorization Of Posts And Searches For Posts Of
Interest. The Structural Analysis Of Hashtags Is Necessary As Precursor To Understandingtheir Meanings.
This Paper Describes Our Work On Segmenting Nondelimited Strings Of Hashtag-Type English Text. We
Adapt And Extend Methods Used Mostly In Non-Eng
News Reliability Evaluation using Latent Semantic AnalysisTELKOMNIKA JOURNAL
The rapid rise and widespread of ‘Fake News’ has severe implications in the society today. Much efforts have been directed towards the development of methods to verify news reliability on the Internet in recent years. In this paper, an automated news reliability evaluation system was proposed. The system utilizes term several Natural Language Processing (NLP) techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), Phrase Detection and Cosine Similarity in tandem with Latent Semantic Analysis (LSA). A collection of 9203 labelled articles from both reliable and unreliable sources were collected. This dataset was then applied random test-train split to create the training dataset and testing dataset. The final results obtained shows 81.87% for precision and 86.95% for recall with the accuracy being 73.33%.
Prediction of Reaction towards Textual Posts in Social NetworksMohamed El-Geish
Posting on social networks could be a gratifying or a terrifying experience depending on the reaction the post and its author —by association— receive from the readers. To better understand what makes a post popular, this project inquires into the factors that determine the number of likes, comments, and shares a textual post gets on LinkedIn; and finds a predictor function that can estimate those quantitative social gestures.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
This paper involves two approaches for finding the trending topics in social networks that is key-based approach and link-based approach. In conventional key-based approach for topics detection have mainly focus on frequencies of (textual) words. We propose a link-based approach which focuses on posts reflected in the mentioning behavior of hundreds users. The anomaly detection in the twitter data set is carried out by retrieving the trend topics from the twitter in a sequential manner by using some API and corresponding user for training, then computed anomaly score is aggregated from different users. Further the aggregated anomaly score will be feed into change-point analysis or burst detection at the pinpoint, in order to detect the emerging topics. We have used the real time twitter account, so results are vary according to the tweet trends made. The experiment shows that proposed link-based approach performs even better than the keyword-based approach.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. Understanding and processing this new type of data to glean actionable patterns presents challenges and opportunities for interdisciplinary research, novel algorithms, and tool development. Social Media Mining integrates social media, social network analysis, and data mining to provide a convenient and coherent platform for students, practitioners, researchers, and project managers to understand the basics and potentials of social media mining. It introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining. Suitable for use in advanced undergraduate and beginning graduate courses as well as professional short courses, the text contains exercises of different degrees of difficulty that improve understanding and help apply concepts, principles, and methods in various scenarios of social media mining.
Details at: http://dmml.asu.edu/smm/
Highlighted notes on Deeper Inside PageRank.
While doing research work under Prof. Kishore Kothapalli.
This is a really "deep" review of PageRank! Should be a good story for a PhD student going to be working with PageRank optimizations.
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...Deolu Adeleye
I wrote this brief lecture with the aim of enlightening the reader on the simplicity of using R and its packages, (such as 'twitteR') in performing powerful datamining exercises and analyses, as in this text mining example.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Social Media Posts On Platforms Such As Twitter Or Instagram Use Hashtags,Which Are Author-Created
Labels Representing Topics Or Themes, Toassist In Categorization Of Posts And Searches For Posts Of
Interest. The Structural Analysis Of Hashtags Is Necessary As Precursor To Understandingtheir Meanings.
This Paper Describes Our Work On Segmenting Nondelimited Strings Of Hashtag-Type English Text. We
Adapt And Extend Methods Used Mostly In Non-Eng
News Reliability Evaluation using Latent Semantic AnalysisTELKOMNIKA JOURNAL
The rapid rise and widespread of ‘Fake News’ has severe implications in the society today. Much efforts have been directed towards the development of methods to verify news reliability on the Internet in recent years. In this paper, an automated news reliability evaluation system was proposed. The system utilizes term several Natural Language Processing (NLP) techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), Phrase Detection and Cosine Similarity in tandem with Latent Semantic Analysis (LSA). A collection of 9203 labelled articles from both reliable and unreliable sources were collected. This dataset was then applied random test-train split to create the training dataset and testing dataset. The final results obtained shows 81.87% for precision and 86.95% for recall with the accuracy being 73.33%.
Prediction of Reaction towards Textual Posts in Social NetworksMohamed El-Geish
Posting on social networks could be a gratifying or a terrifying experience depending on the reaction the post and its author —by association— receive from the readers. To better understand what makes a post popular, this project inquires into the factors that determine the number of likes, comments, and shares a textual post gets on LinkedIn; and finds a predictor function that can estimate those quantitative social gestures.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
This paper involves two approaches for finding the trending topics in social networks that is key-based approach and link-based approach. In conventional key-based approach for topics detection have mainly focus on frequencies of (textual) words. We propose a link-based approach which focuses on posts reflected in the mentioning behavior of hundreds users. The anomaly detection in the twitter data set is carried out by retrieving the trend topics from the twitter in a sequential manner by using some API and corresponding user for training, then computed anomaly score is aggregated from different users. Further the aggregated anomaly score will be feed into change-point analysis or burst detection at the pinpoint, in order to detect the emerging topics. We have used the real time twitter account, so results are vary according to the tweet trends made. The experiment shows that proposed link-based approach performs even better than the keyword-based approach.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. Understanding and processing this new type of data to glean actionable patterns presents challenges and opportunities for interdisciplinary research, novel algorithms, and tool development. Social Media Mining integrates social media, social network analysis, and data mining to provide a convenient and coherent platform for students, practitioners, researchers, and project managers to understand the basics and potentials of social media mining. It introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining. Suitable for use in advanced undergraduate and beginning graduate courses as well as professional short courses, the text contains exercises of different degrees of difficulty that improve understanding and help apply concepts, principles, and methods in various scenarios of social media mining.
Details at: http://dmml.asu.edu/smm/
Highlighted notes on Deeper Inside PageRank.
While doing research work under Prof. Kishore Kothapalli.
This is a really "deep" review of PageRank! Should be a good story for a PhD student going to be working with PageRank optimizations.
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...Deolu Adeleye
I wrote this brief lecture with the aim of enlightening the reader on the simplicity of using R and its packages, (such as 'twitteR') in performing powerful datamining exercises and analyses, as in this text mining example.
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien
Presentation given at the 6th and last meeting of the European Commission "Licenses for Europe" Text and Data Mining Working Group (WG4).
The first part of the talk gives a very brief introduction of some basic concepts of text mining techniques used in Pharmaceutical industry using the Accelrys PP text mining collection.
The second part of the talk focuses on existing limitations pharmaceutical companies are facing in the field of Text mining.
http://ec.europa.eu/licences-for-europe-dialogue/en/content/text-and-data-mining-working-group-wg4
These slides explain the basic meaning of text mining,its comparision with other data retrieval methods,its subtasks and applications, limitations, present and future of text mining. Also included is the topic data mining with its goals and applications.
Slides for the class, From Pattern Matching to Knowledge Discovery Using Text Mining and Visualization Techniques, presented June 13, 2010, at the Special Libraries Association 2010 annual meeting.
With the rise of social networking epoch, there has been a surge of user generated content. Micro blogging sites have millions of people sharing their thoughts daily because of its characteristic short and simple manner of expression. We propose and investigate a paradigm to mine the sentiment from a popular real-time micro blogging service, Twitter, where users post real time reactions to and opinions about “everything”. In this paper, we expound a hybrid approach using both corpus based and dictionary based methods to determine the semantic orientation of the opinion words in tweets. A case study is presented to illustrate the use and effectiveness of the proposed system.
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxjasoninnes20
BUS 625 Week 4 Response to Discussion 2
Guided Response: Your initial response should be a minimum of 300 words in length. Respond to at least two of your classmates by commenting on their posts. Though two replies are the basic expectation for class discussions, for deeper engagement and learning, you are encouraged to provide responses to any comments or questions others have given to you.
Below there are two of my classmate’s discussion that needs I need to response to their names are Umadevi Sayana
and Britney Graves
Umadevi Sayana
TuesdayMar 17 at 7:50am
Manage Discussion Entry
Twitter mining analyzed the Twitter message in predicting, discovering, or investigating the causation. Twitter mining included text mining that designed specifically to leverage Twitter content and context tweets. With the use of text mining, twitter was able to include analysis of additional information that associates to tweets, which include hashtags, names, and other related characteristics. The mining also employs much information as several tweets, likes, retweets, and favorites trying to understand the considerations better. Twitter using text mining was successful in capturing and reflecting different events that relate to other conventional and social media. In 2013, there were over 500 million messages per day for twitter and became impossible for any human to analyze. It became important than to develop computer-based algorithms, including data mining. Twitter implements text mining in analyzing the sentiment that associates with twitter messages. It based on the analysis of the keyword that words are having a negative, positive, or neutral sentiment (Sunmoo, Noémie& Suzanne, (Links to an external site.)n.d). Positive words, for example like great, beautiful, love, and negative words of stupid, evil, and waste, do regularly have lexicons. Using text mining, Twitter was able to capture sentiments by capturing many dictionary symbols. Moreover, the sentiment applied to abbreviations, emoticons, and repeated characters, symbols, and abbreviations.
The sentiments on topics of economics, politics, and security are usually negative, and sentiments related to sports are harmful. Twitter also used text mining to collect and analyze for topic modeling techniques over time. To pull out the data from Twitter, TwitterR used. “Someone well versed in database architecture and data storage is needed to extract the relevant information in different databases and to merge them into a form that is useful for analysis” ( Sharpe, De Veaux & Velleman, 2019, p.753). It provides the interface that connects to Twitter web API; retweetedby/ids also used combined with RCurl package in finding out several tweets that retweeted. Text mining is also used in Twitter to clean the text by taking out hyperlinks, numbers, stop words, punctuations, followed by stem completion. Text mining also implemented for social network analysis.
Web mining focus on data knowledge discovery ...
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxcurwenmichaela
BUS 625 Week 4 Response to Discussion 2
Guided Response: Your initial response should be a minimum of 300 words in length. Respond to at least two of your classmates by commenting on their posts. Though two replies are the basic expectation for class discussions, for deeper engagement and learning, you are encouraged to provide responses to any comments or questions others have given to you.
Below there are two of my classmate’s discussion that needs I need to response to their names are Umadevi Sayana
and Britney Graves
Umadevi Sayana
TuesdayMar 17 at 7:50am
Manage Discussion Entry
Twitter mining analyzed the Twitter message in predicting, discovering, or investigating the causation. Twitter mining included text mining that designed specifically to leverage Twitter content and context tweets. With the use of text mining, twitter was able to include analysis of additional information that associates to tweets, which include hashtags, names, and other related characteristics. The mining also employs much information as several tweets, likes, retweets, and favorites trying to understand the considerations better. Twitter using text mining was successful in capturing and reflecting different events that relate to other conventional and social media. In 2013, there were over 500 million messages per day for twitter and became impossible for any human to analyze. It became important than to develop computer-based algorithms, including data mining. Twitter implements text mining in analyzing the sentiment that associates with twitter messages. It based on the analysis of the keyword that words are having a negative, positive, or neutral sentiment (Sunmoo, Noémie& Suzanne, (Links to an external site.)n.d). Positive words, for example like great, beautiful, love, and negative words of stupid, evil, and waste, do regularly have lexicons. Using text mining, Twitter was able to capture sentiments by capturing many dictionary symbols. Moreover, the sentiment applied to abbreviations, emoticons, and repeated characters, symbols, and abbreviations.
The sentiments on topics of economics, politics, and security are usually negative, and sentiments related to sports are harmful. Twitter also used text mining to collect and analyze for topic modeling techniques over time. To pull out the data from Twitter, TwitterR used. “Someone well versed in database architecture and data storage is needed to extract the relevant information in different databases and to merge them into a form that is useful for analysis” ( Sharpe, De Veaux & Velleman, 2019, p.753). It provides the interface that connects to Twitter web API; retweetedby/ids also used combined with RCurl package in finding out several tweets that retweeted. Text mining is also used in Twitter to clean the text by taking out hyperlinks, numbers, stop words, punctuations, followed by stem completion. Text mining also implemented for social network analysis.
Web mining focus on data knowledge discovery .
A Baseline Based Deep Learning Approach of Live Tweetsijtsrd
In this scenario social media plays a vital role in influencing the life of people. Twitter , Facebook, Instagram etc are the major social media platforms . They act as a platform for users to raise their opinions on things and events around them. Twitter is one such micro blogging site that allows the user to tweet 6000 tweets per day each of 280 characters long. Data analyst rely on this data to reach conclusion on the events happening around and also to rate a product. But due to massive volume of reviews the analysts find it difficult to go through them and reach at conclusions. In order to solve this problem we adopt the method of sentiment analysis. Sentiment analysis is an approach to classify the sentiment of user reviews, documents etc in terms of positive good , negative bad , neutral surprise . I suggest an enhanced twitter sentiment analysis that retrieves data based on a baseline in a particular pre defined time span and performs sentiment analysis using Textblob . This scheme differs from the traditional and existing one which performs sentiment analysis on pre saved data by performing sentiment analysis on real time data fetched via Twitter API . Thereby providing a much recent and relevant conclusion. Anjana Jimmington ""A Baseline Based Deep Learning Approach of Live Tweets"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23918.pdf
Paper URL: https://www.ijtsrd.com/computer-science/other/23918/a-baseline-based-deep-learning-approach-of-live-tweets/anjana-jimmington
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...csandit
Predicting the behavior of shares in the stock market is a complex problem, that involves variables not always known and can undergo various influences, from the collective emotion to high-profile news. Such volatility, can represent considerable financial losses for investors. In order to anticipate such changes in the market, it has been proposed various mechanisms to try to predict the behavior of an asset in the stock market, based on previously existing information.
Such mechanisms include statistical data only, without considering the collective feeling. This article, is going to use natural language processing algorithms (LPN) to determine the collective mood on assets and later with the help of the SVM algorithm to extract patterns in an attempt to predict the active behavior. Nevertheless it is important to note that such approach is not intended to be the main factor in the decision making process, but rather an aid tool, which combined with other information, can provide higher accuracy for the solution of this problem
Similar to Text mining on Twitter information based on R platform (20)
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
Text mining on Twitter information based on R platform
1. Text mining on Twitter information based on R platform
Qiaoyang ZHANG
∗
Computer and information
science system
Macau University of Science
and Technology
3269046927@qq.com
Fayan TAO
†
Computer and information
science system
Macau University of Science
and Technology
fytao2015@gmail.com
Junyi LU
‡
Computer and information
science system
Macau University of Science
and Technology
448673862@qq.com
ABSTRACT
Twitter is one of the most popular social networks, which
plays a vital role in this new era. Exploring the information
diffusion on Twitter is attractive and useful.
In this report, we apply R to do text mining and anal-
ysis, which is about the topic–”#prayforparis” in Twitter.
We first do the data preprocessing, such as data cleaning
and stemming words. Then we show tweets frequency and
association. We find that the word ”prayforparis” ranks the
highest frequency, and most of words we mined are related to
”prayforparis”, ”paris” and ”parisattack”. We also show lay-
outs of whole tweets and some extracted tweets. Additional,
we cluster tweets as 10 groups to see the connections of dif-
ferent topics. Since tweets can indicate usersa´r attitudes
and emotions well, so we further do the sentiment analysis.
We find that most people expressed their sadness and anger
about Paris attack by ISIS and pray for Paris. Besides, the
majority hold the positive attitudes toward this attack.
Keywords
text mining; Twitter; R; ”#prayforparis”; sentiment analysis
1. INTRODUCTION AND MOTIVATION
As data mining and big data become hot research spots
in this new era, the technique of data analysis is required
much higher as well. It is difficult to store and analyze large
data by using traditional database methodologies. So we try
to employ the powerful statistics platform R to do big data
mining and analysis, because R provides kinds of statistical
models and data analysis methods, such as classic statistical
tests, time-series analysis, classification and clustering.
∗We rank the authors’ names by the inverse alphabet order
of the first number of authors’ last name.
Stu ID:1509853G-II20-0033
†Stu ID:1509853F-II20-0019
‡Stu ID:1509853G-II20-0061
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
We try to analyze a large social network data set, which
is mainly focus on Twitter users and their expressions about
latest news in this project. And it is executed to discover
some characteristics those tweets have. By analyzing the
large amount of social network data, we can get better knowl-
edge on users’ preferences and habits, which will be helpful
for people who are interested in such data. For example,
business firms/companies can provide better services after
analyzing similar social networks data. That is why we want
to choose this topic.
2. RELATED WORKS
2.1 Sentiments analysis by searching Twitter and
Weibo
1
User-Level sentiment evolution can be analysis in Weibo.
ZHANG Lumin and JIA Yan et al[16] firstly proposed a mul-
tidimensional sentiment model with hierarchical structure to
analyze users’s complicate sentiments.
Michael Mathioudakis and Nick Koudas[11] presented a
”TwitterMonitor”. It is a system that performs trend detec-
tion over the Twitter stream. The system identifies emerg-
ing topics (i.e. ’trends’) on Twitter in real time and provides
meaningful analysis that synthesize an accurate description
of each topic. Users interact with the system by ordering
the identified trends using different criteria and submitting
their own description for each trend.
Twitter, in particular, is currently the major microblog-
ging service, with more than 50 million subscribers. Twitter
users generate short text messages–the so–called ”tweets”
to report their current thoughts and actions, comment on
breaking news and engage in discussions.[11]
Passonneau and Rebecca[1] mainly introduced a model
based on Tree Kernel to analysis the POS-specific prior po-
larity features of Twitter data and used a Partial Tree (PT)
kernel which first proposed by Moschitti (2006) to calculate
the similarity between two trees(You can see an example in
figure 1).They are divided sentiment in tweets into 3 cat-
egories: positive, negative and neutral. They marked the
sentiment expressed by emotions with an emotional dictio-
nary and translate acronym (e.g. gr8, gr8t = great. lol =
laughing out loud.) with an acronym dictionary. Those dic-
tionaries can maps emotions or acronyms to their polarity.
And they used an English ”Word stop” dictionary which are
found in Word Net to identify stop words and used an sen-
timent dictionary which has many positive words, negative
1
This part of related works is provided by Qiaoyang Zhang.
2. Figure 1: A tree kernel for a synthesized tweet: ”@Fernando this isn’t a great day for playing the HARP! :)”
words and neutral words to map words in tweets to their
polarity.
The accuracy of the model they used is higher than the
accuracy of Unigram model by 4.02%. And the standard
deviation of the model they used is lower than the standard
deviation of Unigram model by 0.52%.
2.2 Study information diffusion on Twitter
2
A number of recent papers have explored the information
diffusion on Twitter, which is one of the most popular social
networks.
In the year 2011, Shaomei Wu et al.[14] focused on pro-
duction, flow, consumption of information in the context
of Twitter. They exploited a Twitter ”lists” to distinguish
elite users(Celebrities, Media, Organizations, Bloggers) and
ordinary users; and they found strong homophily within cat-
egories, which means that each category mainly follows it-
self. They also re–examined the classical ”two- step flow”
theory[10] of communications, finding considerable support
for it on Twitter. Additional, various URLs’ lifespans were
demonstrated under different categories. Finally, they ex-
amined the attention paid by the different user categories to
different news topics.
This paper Sheds clearly light on how media information
is transmitted on Twitter. The approach of defining a limit
set of predetermined user categories presented can be ex-
plored to automatic classification schemes. However, they
just focus on the one narrow cross-section of the media in-
formation(URLs). It would be better if their methods are
applied to other channels(TV, Radio); Another weakness
exists in this paper is lacking information flow on Twitter
with other sources of outcome data(e.g. users’ opinions and
actions).
Daniel Ramage et al.[13]studied search behaviors on Twit-
ter especially for the information that users prefer searching.
They also compared Twitter search with web search in terms
of users’ queries. They found that Twitter results contain
more social events and contents, while web results include
more facts and navigation.
Eytan Bakshy et al.[3] conducted regression model to ana-
lyze Twitter data. They explored word of mouth marketing
to study usersa´r influence on Twitter not only on commu-
nication but also on URLs. They found that the largest
2
This part of related works is provided by Fayan Tao.
cascades tend to be generated by users who have been influ-
ential in the past and who have a large number of followers.
They also found that URLs that were rated more interesting
and/or elicited more positive feelings by workers on Mechan-
ical Turk were more likely to spread.
As we can see that, all of three papers mentioned above are
all focus on a large number of tweets, and employ different
methods to analyze various characteristics of tweets from
different aspects. But they are all limited to mainly focus
on Twitter data rather extend to other more social networks.
2.3 Semantic Analysis and Text Mining
3
Many researches are done to gain better understanding
people’s characteristics in a specific field by analyzing the
semantics of social network content. This has many appli-
cations especially for the business marketing purpose
Topic mining and sentiment analysis is done on follow-
ers’ comments on a company’s Facebook Fan Page and the
authors get most frequent terms in each domain (TF, TF-
IDF, three sentiments) and sentiment distributions through-
out one year and its relation versus ”Likes”, respectively [5].
This can help the marketing staffs aware of the sentiment
trend as well as main sentiment to adjust the marketing
technique. Support Vector Machine (SVM) Classification
Model is used in their analysis method. Before classification,
word segmentation and feature extraction is done. Feature
extraction is based on semantic dictionary and some addi-
tional rules. They found that the sentiment distribution of
the comments can be a contributing factor of the distribu-
tion of ”Likes”.
Hsin-Ying Wu et al. [15] presented a method of analyzing
Facebook posts serving as a marketing tool to help young
entrepreneurs to identify existing competitors in the market
and also their succeed factors and features during decision
making process. The overall mining process consists of three
stages:
1 Extracting Facebook posts;
2 Text data preprocessing;
2 Key phrases and terms filtering and extraction.
In detail, they did word segmentation to original com-
ments based on the lexicons, morphological rules for quan-
tifier words and reduplicated words. The words and phrases
3
This part of related works is provided by Junyi Lu.
3. are extracted from text files and transformed into a key
phrase matrix based on frequencies. And next, a k-means
clustering algorithm based on the phrase frequency matrix
and their similarity is used to identify the most important
phrases(i.e. features and factors of each shop). Various tools
is utilized in their study. CKIP is for Chinese word segmen-
tation, PERL is for extracting text files and WEKA is for
key phrase clustering.
Social network mining is also done in educational field.
Chen et al.[4] conducted an initial research on mining tweets
for understanding the students’ learning experiences. They
first use Radian6, a commercial social monitoring tool, to
acquire studentsa´r posts using the hashtag #engineering-
Problems and they collected 19,799 unique tweets. Due to
the ambiguity and complexity of natural language, they con-
ducted inductive content analysis and categorized the tweets
into 5 prominent themes and one group called ”others”. The
main hashtag, non-letter symbols, repeating letters, stop-
words is removed in preprocessing stage. Multi-label naive
Bayesian classifier is used because a tweet can reflect sev-
eral problems. Then they obtained another data set us-
ing the geocode of Purdue University with a radius of 1.3
miles to demonstrate the effectiveness of the classifier and
try to detect the studentsa´r problems. They also demon-
strated the multi-label naive Bayesian classifier performs
better than other state-of-the-art classifiers (SVM and M3L)
according to 4 parameters(Accuracy, Precision, Recall, F1).
But therea´rs a main defect in their method since they as-
sume the categories are independent when they transform
the problem into single-label classification problems.
Most of the text mining process is much the same as each
other. Generally, text preprocessing is conducted(stopwords,
punctuation, weird symbols and characters removal, seg-
mentation) at the beginning. (Some study like sentiment
analysis need Part-of-speech Tagging.) Then a term fre-
quency matrix is built with the data set to calculate the term
frequencies. Finally, classification and clustering is mostly
used to analyze the data and generate knowledge.
3. TEXT MINING UNDER R PLATFORM
3.1 About R
R[18] is a language and environment for statistical com-
puting and graphics. It is a GNU project which is similar to
the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by
John Chambers and colleagues. R can be considered as a
different implementation of S.
R provides a wide variety of statistical (linear and nonlin-
ear modelling, classical statistical tests, time-series analysis,
classification, clustering.) and graphical techniques, and is
highly extensible. R also provides an Open Source route to
participation in statistical research. R is available as Free
Software under the terms of the Free Software Foundation’s
GNU General Public License in source code form. It com-
piles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows
and MacOS.
3.2 The idea
Text mining[2][17] is the discovery of interesting knowl-
edge in text documents. It is a challenging issue to find
the accurate knowledge from unstructured text documents
to help users to find what they want. It can be defied as
the art of extracting data from large amount of texts. It
allows to structure and categorize the text contents which
are initially non-organized and heterogeneous. Text mining
is an important data mining technique which includes the
most successful technique to extract the effective patterns.
This report presents examples of text mining with R.
Twitter text (”prayforparis”)is used as the data to analyze.
It starts with extracting text from Twitter. The extracted
text is then transformed to build a document-term matrix.
After that, frequent words and associations are found from
the matrix. Next, words and tweets are clustered to find
groups of words and topics of tweets. Finally, a sentiment
analysis of tweets are explored, and a word cloud is used to
present important words in documents.
In this report, ”tweet” and ”document” will be used in-
terchangeably, so are ”word” and ”term”. There are three
important packages used in the examples: twitteR, tm and
wordcloud Package twitter[8] provides access to Twitter data,
tm[6] provides functions for text mining, and wordcloud[7]
visualizes the result with a word cloud 2.
4. IMPLEMENTATIONS
4
4.1 Data Preprocessing
We firstly mine 3200 tweets from twitter by search the
main topic ”prayforparis” during the time 13rd, Nov,2015 to
13rd, Dec,2015. Then we do some data preprocessing.
4.1.1 Data Cleaning
The tweets are first converted to a data frame and then
to a corpus, which is a collection of text documents. After
that, the corpus needs a couple of transformations, including
changing letters to lower case, adding ”pray” and ”for” as ex-
tra stop words and removing URLs, punctuations, numbers
extra whitespace and stop words.
Next, we keep a copy of corpus to use later as a dictionary
for stem completion
4.1.2 Stemming Words
Stemming[19] is the term used in linguistic morphology
and information retrieval to describe the process for reduc-
ing inflected (or sometimes derived) words to their word
stem, base or root formalgenerally a written word form.
A stemmer for English, for example, should identify the
string ”stems”, ”stemmer”, ”stemming”, ”stemmed” as based
on ”stem”. Using word stremming makes the words would
look normal. This can be achieved with function ”stemCom-
pletion()” in R .
In the following steps, we use ”stemCompletion()” to com-
plete the stems with the unstemmed corpus ”myCorpus-
Copy” as a dictionary. With the default setting, it takes
the most frequent match in dictionary as completion.
4.1.3 Building a Term-Document Matrix
A term-document matrix indicates the relationship be-
tween terms and documents, where each row stands for a
term and each column for a document, and an entry is
the number of occurrences of the term in the document.
4
All of our implementation codes are attached in the end of
this report.
4. TermDocumentMatrix terms: 3621, documents: 3200
Non-/sparse entries 27543/11559657
Sparsity 100%
Maximal term length 38
Weighting term frequency (tf)
Table 1: TermDocumentMatrix
Figure 2: layout of whole tweets
Alternatively, one can also build a document-term matrix
by swapping row and column. In this report, we build
a term-document matrix from the above processed corpus
with function ”TermDocumentMatrix()”.
As the table 1 shows, there are totally 3621 terms and
3200 document in the ”TermDocumentMatrix”. We can see
that it is very sparse, with nearly 100% of the entries being
zero, which means that the majority terms are not contained
in the document.
We can also see the layout of whole tweets in figure2, they
are mainly located in two parts. As the large number of data,
we cannot tell those words clearly. Therefore, we select some
terms from the total data and show their distributions as
figure3 and figure4 shows. We can see that most of terms
are connected within a bounded zone, which means that
they have associations more or less.
5. FREQUENT TERMS AND ASSOCIATIONS
Based on the above data process, now we show the fre-
quent words. Note that there are 3200 tweets in total.
We first choose those words which appear more than 100
times, the results are shown as table 2. We can see that, for
example, the number of ”parisattack” , ”pour” and ”victim”
are all more that 100, which means they have high frequency
in this topical”prayforparis”.
In the further process, we show the number of all of words
that appear at least 100 times. The result is as figure5 shows.
As figure5 contains so many terms that we cannot tell the
number of each term. So we only choose 70 terms, and
show the number of all of words that appear at least 100
Figure 3: layout-1 of some parts selected from whole
tweets
Figure 4: layout-2 of some parts selected from whole
tweets
Figure 5: Total words that appear at least 100 times
Figure 6: Selecting some Words that appear at least
100 times
lose over papajackadvic struggl trust
0.56 0.56 0.56 0.56 0.56
worri prayfor think hope simoncowel
0.56 0.40 0.40 0.32 0.29
scare stay
0.28 0.25
Table 3: words associated with ”pray” with correla-
tion no less than 0.25
5. number of tweets words
[1] ”´ld’” ”attentat” ”aux” ”?a”
[5] ”de” ”d´ej`a” ”et” ”everyon”
[9] ”fait” ”franc” ”go” ”il”
[13] ”jamai” ”jour” ”la” ”les”
[17] ”louistomlinson” ”moi” ”ne” ”noubliera”
[21] ”novembr” ”pari” ”parisattack” ”pas”
[25] ”pens´le” ”pour” ”prayforpari” ”que”
[29] ”rt” ”simoncr” ”thought” ”un”
[33] ”victim” ”vous” ”y” ”ytbclara”
Table 2: the number of words are larger than 100
times. As the figure6 shows, it is not surprised that the count
of ”prayforparis” is highest, which is more than 3000. The
second one is ”pari” with the ”parisattack” following. This
result indicates that most of people care for Paris attack and
pray for Paris.
To find the associations among words, we take the ”pray”
for example to see which words are associated with ”pray”
with correlation no less than 0.25.
From the table 3,we can see that there are 12 terms includ-
ing ”loss”, ”struggl”, ”trust”, ”hope” have connection with
”pray”, in which six terms such as ”loss”, ”papajackadvic”
and ”trust” are associated with ”pray” by the correction of
0.56, While ”prayfor” and ”hope” have correction 0.40 and
0.32 with ”pray”, respectively.
6. CLUSTERING WORDS
We then try to find clusters of words with hierarchical
clustering. Sparse terms are removed, so that the plot of
clustering will not be crowded with words. we cut related
data into 10 clusters. The agglomeration method is set to
ward, which denotes the increase in variance when two clus-
ters are merged.
In the figure7, we can see different topics related to ”pray-
forparis” in the tweets. Words ”les”, ”parisattack” aˇrfaita´s
and some other words are clustered into one group, because
there are a couple of tweets on parisattack. Another group
contains ”everyone” and ”thought” because everyone are fo-
cus on this event. We can also see that ”moir”, ”d´ej`a” ”pray-
forpari” are in a single group, which means they have few
relationships.
Figure 7: cluster (10 groups)
7. EXPERIMENTS ABOUT SENTIMENTS
Figure 8: Emotion categories of #prayforparis
Figure 9: Classification by polarity of #prayforparis
Figure 10: A wordcloud of #prayforparis
6. Stages
&Individuals Works
Stage 1 Literatures survey
Determine project topic
Stage 2 R programming and
text ming learning
Stage 3 Implementations
Stage 4 Presentation
and Final report
Qiaoyang Zhang Mainly read the references:
[1],[9],[11],[12], and[16];
Sentiment analysis implementation.
Fayan Tao Mainly read [2], [3], [10], [13] and [14];
Data preprocessing
and data analysis
Junyi Lu Mainly read [4] [5] and [15];
Analyze data association
and do cluster words.
Remark All of us read [6], [7], [8], [17] [18], and [19].
Table 4: Timetable and working plan
We also did an experiment about sentiment in R software
with the method mentioned in the related works. We loaded
a package named ”sentiment” in R software and analyzed
the sentiment of tweets about a hashtag ”#prayforparis” in
Twitter. We used the ”sentiment” package to mine more
than 6800 tweets on Twitter and established corpus[12] in
R to mainly analysis the related words of speech, frequency
and its correlation. figure 8 shows the emotion categories of
”#prayforparis” with emotion dictionary. In this figure, we
can see that nearly 1000 persons felt sad and angry about
the terrorist attacks in Paris(angry about the terrorist attack
from ISIS).And there are a small number of people felt afraid
and surprised.
In the figure 9,we can see that nearly 5000 people used
positive words and more than 1500 people used negative
words in their tweets. In addition, there are less than 500
people used no polarity words in the hashtag about ”#pray-
forparis”.
From the picture 10 of word cloud[17], we can intuitively
see the most frequently used words about ”#prayforparis”in
Twitter(the larger the front, the more used in tweets). The
most of polarities were concentrated in the type of sadness,
anger and disgust.
From these experimental data, we can draw a conclusion
that the general attitudes of people around the world toward
terrorist attack is sad and anger. Most people feel sorry for
the victims and pray for the victims in paris.They are also
strongly against terrorism.
8. WORKING PLAN
To finish this project, we made a timetable and working
plan as table 4 shows.
9. CONCLUSION AND FUTURE WORKS
In this report, we apply R to do text mining and analy-
sis about ”#prayforparis” in Twitter. We first do the data
preprocessing, such as data cleaning and stemming words.
Then we show the tweets frequency and association, we find
that ”prayforparis” ranks the highest frequency, and most
of words we mined are related to ”prayforparis”, ”paris” and
”parisattack”. We also show the layout of whole tweets and
some extracted tweets. Additional, we cluster tweets topic
as 10 groups to see the connections of terms. Since tweets
can indicate users’ attitudes and emotions well, so we fur-
ther do the sentiment analysis. We find that most people
expressed their sadness and anger about Paris attack by ISIS
and praied for paris. As the results show, the majority hold
the positive attitudes toward to this attack, mainly because
of hope for good future to Paris and whole wold as well.
As the data we mined is limited to one topic, and it is not
so large, which may result in data incompleteness. Addi-
tional, there are some problems existing during the data pre-
processing, for example, the ”termdocmatrix” is so sparse,
which are likely to have an bad influence on the following
analysis and evaluations. In the future works, we plan to
develop a better model or algorithm, which can be used to
mine and analyze different kinds of social networks data by
R. We will also focus on the improvement of data prepro-
cessing, so that it can make the result more precise.
10. ACKNOWLEDGMENT
We wish to thank Dr. Hong-Ning DAI for his patient
guidance and vital suggestions on this report.
11. REFERENCES
[1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.
Passonneau. Sentiment analysis of twitter data.
Proceedings of the Workshop on Languages in Social
Media. 39(4):620´lC622, 2011.
[2] V.Aswini, S.K.Lavanya, Pattern Discovery for Text
Mining Computation of Power, Energy, Information
and Communication (ICCPEIC), 2014 International
Conference on IEEE. PP. 412-416. 2014.
[3] E. Bakshy, J.M. Hofman, W. A. Mason, and D. J.
Watts. Everyone’s an influencer: quantifying influence
on twitter. In *Proceedings of the fourth ACM
international conference on Web search and data
mining* (WSDM ’11). ACM, New York, NY, USA,
pp. 65-74. 2011.
DOI=http://dx.doi.org/10.1145/1935826.1935845
[4] X. Chen, M. Vorvoreanu, and K. P. C. Madhavan,
Mining Social Media Data for Understanding
Studentsa´r Learning Experiences. IEEE Trans. Learn.
Technol., vol. 7, no. 3, pp. 246´lC259, 2014.
[5] Kuan-Cheng Lin et al., Mining the user clusters on
Facebook fan pages based on topic and sentiment
analysis. Information Reuse and Integration (IRI),
2014 IEEE 15th International Conference on , vol.,
no., pp.627-632, 13-15 Aug. 2014
[6] I.Feinerer, tm: Text Mining Package. R package
version 0.5-7.1. 2012.
[7] I.Fellows, wordcloud: Word Clouds. R package version
2.0. 2012.
[8] J. Gentry, twitteR: R based Twitter client. R package
version 0.99.19. 2012.
[9] I.Guellil and K.Boukhalfa. Social big data mining: A
survey focused on opinion mining and sentiments
analysis. In Programming and Systems (ISPS), 2015
7. 12th International Symposium on, pp. 1–10, April
2015.
[10] E. Katz.The two-step flow of communication: An
up-to-date report on an hypothesis. Public Opinion
Quarterly, 21(1):61´lC78, 1957.
[11] M. Mathioudakis and N. Koudas. TwitterMonitor :
Trend Detection over the Twitter Stream. Proceeding:
SIGMOD ’10 Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data.
ACM New York, NY. pp. 1155–1157. 2010.
[12] A. Pak and P. Paroubek. Twitter as a corpus for
sentiment analysis and opinion mining. In Seventh
Conference on International Language Resources
Evaluation, 2010.
[13] J. Teevan, D. Ramage, and M. R. Morris.
TwitterSearch: a comparison of microblog search and
web search. In *Proceedings of the fourth ACM
international conference on Web search and data
mining* (WSDM ’11). ACM, New York, NY, USA,
pp. 35-44. 2011.DOI=
http://dx.doi.org/10.1145/1935826.1935842
[14] S.M. Wu, J. M. Hofman, W. A. Mason, and D.J.
Watts. Who says what to whom on twitter. In
*Proceedings of the 20th international conference on
World wide web* (WWW ’11). ACM, New York, NY,
USA, pp.705-714. 2011.
DOI=http://dx.doi.org/10.1145/1963405.1963504
[15] Hsin-Ying Wu; Kuan-Liang Liu; C. Trappey,
Understanding customers using Facebook Pages: Data
mining users feedback using text analysis. Computer
Supported Cooperative Work in Design (CSCWD),
Proceedings of the 2014 IEEE 18th International
Conference on , vol., no., pp.346-350, 21-23 May 2014
[16] L. M. Zhang, Y. Jia, X. Zhu, B. Zhou and Y. Han.
User-Level Sentiment Evolution Analysis in Microblog.
Browse Journals & Magazines, Communications,
China. Volume:11 Issue:12 pp. 152–163. 2011.
[17] Y.C. Zhao, R and Data Mining: Examples and Case
Studies. Published by Elsevier. 2012.
[18] More details about R:
https://www.r-project.org/about.html
[19] More information about stemming:
https://en.wikipedia.org/wiki/Stemming
APPENDIX
A. CODES FOR TEXTMINING
8. 1 l i b r a r y (ROAuth)
2 l i b r a r y ( bitops )
3 l i b r a r y ( RCurl )
4 l i b r a r y ( twitteR )
5 l i b r a r y (NLP)
6 l i b r a r y (tm)
7 l i b r a r y ( RColorBrewer )
8 l i b r a r y ( wordcloud )
9 l i b r a r y (XML)
10 #Set t w i t t e r auth url
11 reqTokenURL <− ”https :// api . t w i t t e r . com/oauth/ request token ”
12 accessTokenURL <− ”https :// api . t w i t t e r . com/oauth/ access token ”
13 authURL <− ”https :// api . t w i t t e r . com/oauth/ authorize ”
14 #Set t w i t t e r key
15 consumerkey <− ”PXoumpl5ndvroikd1DPeGkcqE ”
16 consumerSecret <− ”raDtyWXPYBS5zAH0WVjUGKoiObIAEpHroWJ8G6UjlVn5DBdzbv”
17 accessToken <− ”3954258018−HALNbJ0Jo0pPVK844ZvNBnz5yRCXcdyTPKNE4rq”
18 acce ss Secr e t <− ”K45pUUUpWjqwSM0VgQZWDzx7D7F7RN74fB7gDg1EAh05B”
19 setup twitter oauth ( consumerkey , consumerSecret , accessToken ,
20 +acce ss Secr e t )
21 l i b r a r y ( twitteR )
22 tweets <− searchTwitter ( ”PrayforParis ” , s i nc e = ”2015−11−13” ,
23 + u n t i l = ”2015−12−14” , n = 3200)
24 ( nDocs <− length ( tweets ))
25 #[ 1 ] 3200
26 # convert tweets to a data frame
27 tweets . df <− twListToDF ( tweets )
28 dim( tweets . df )
29 # 3200 16
30 #Text cleaning
31 l i b r a r y (tm)
32 # build a corpus , and s p e c i f y the source to be character vectors
33 myCorpus <− Corpus ( VectorSource ( tweets . df$text ))
34 # convert to lower case
35 # tm v0 .6
36 myCorpus <− tm map(myCorpus , content transformer ( tolower ))
37 # tm v0.5−10
38 # myCorpus <− tm map(myCorpus , tolower )
39 # remove URLs
40 removeURL <− function (x) gsub ( ”http [ ˆ [ : space : ] ] ∗ ” , ”” , x)
41 # tm v0 .6
42 myCorpus <− tm map(myCorpus , content transformer (removeURL ))
43 # tm v0.5−10
44 # myCorpus <− tm map(myCorpus , removeURL)
45 # remove anything other than English l e t t e r s or space
46 removeNumPunct <− function (x) gsub ( ” [ ˆ [ : alpha : ] [ : space : ] ] ∗ ” , ”” , x)
47 myCorpus <− tm map(myCorpus , content transformer (removeNumPunct ))
48 # remove punctuation
49 # myCorpus <− tm map(myCorpus , removePunctuation )
50 # remove numbers
51 # myCorpus <− tm map(myCorpus , removeNumbers )
52 # add two extra stop words : ”pray ” and ”f o r ”
53 myStopwords <− c ( stopwords ( ’ e n g l i s h ’ ) , ”pray ” , ”f o r ”)
54 # remove ”ISIS ” and ”Paris ” from stopwords
55 myStopwords <− s e t d i f f ( myStopwords , c ( ”ISIS ” , ”Paris ”))
56 # remove stopwords from corpus
57 myCorpus <− tm map(myCorpus , removeWords , myStopwords )
58 # remove extra whitespace
59 myCorpus <− tm map(myCorpus , stripWhitespace )
60 # keep a copy of corpus to use l a t e r as a dictionary
61 #f o r stem completion
62 myCorpusCopy <− myCorpus
63 # stem words
64 myCorpus <− tm map(myCorpus , stemDocument )
65 # inspect the f i r s t 5 documents ( tweets )
66 # inspect (myCorpus [ 1 : 5 ] )
67 # The code below i s used f o r to make text f i t f o r paper width
68 f o r ( i in c ( 1 : 2 , 320)) {
69 cat ( paste0 ( ” [ ” , i , ” ] ”))
9. 1
2 writeLines ( strwrap ( as . character (myCorpus [ [ i ] ] ) , 60))}
3 #[ 1 ] RT BahutConfess PrayForPari
4 #[ 2 ] FCBayern dontbombsyria i s i PrayForUmmah i s r a i l spdbpt bbc
5 #PrayforPari Merkel franc BVBPAOK saudi
6 #[ 3 2 0 ] RT RodrigueDLG Rip aux victim du bataclan AMAs PrayForParid
7 # tm v0.5−10
8 # myCorpus <− tm map(myCorpus , stemCompletion )
9 # tm v0 .6
10 stemCompletion2 <− function (x , dictionary ) {
11 x <− u n l i s t ( s t r s p l i t ( as . character (x ) , ” ”))
12 # Unexpectedly , stemCompletion completes an empty s t r i n g to
13 # a word in dictionary . Remove empty s t r i n g to avoid above i s s u e .
14 x <− x [ x != ”” ]
15 x <− stemCompletion (x , dictionary=dictionary )
16 x <− paste (x , sep=”” , c o l l a p s e=” ”)
17 PlainTextDocument ( stripWhitespace (x ))
18 }
19 myCorpus <− lapply (myCorpus , stemCompletion2 ,
20 +dictionary=myCorpusCopy)
21 myCorpus <− Corpus ( VectorSource (myCorpus ))
22 # count frequency of ”ISIS ”
23 ISISCases <− lapply (myCorpusCopy ,
24 function (x) { grep ( as . character (x ) , pattern = ” <ISIS ”) } )
25 sum( u n l i s t ( ISISCases ))
26 ## [ 1 ] 8
27 # count frequency of ”pray ”
28 prayCases <− lapply (myCorpusCopy ,
29 function (x) { grep ( as . character (x ) , pattern = ” <pray ”) } )
30 sum( u n l i s t ( prayCases ))
31 ## [ 1 ] 1136
32 # replace ”Islam ” with ”ISIS ”
33 myCorpus <− tm map(myCorpus , content transformer ( gsub ) ,
34 pattern = ”Islam ” , replacement = ”ISIS ”)
35 tdm <− TermDocumentMatrix (myCorpus , control =
36 +l i s t ( wordLengths = c (1 , Inf ) ) )
37 tdm
38 #<<TermDocumentMatrix ( terms : 3621 , documents : 3200)>>
39 #Non−/sparse e n t r i e s : 27543/11559657
40 #Sparsity : 100%
41 #Maximal term length : 38
42 #Weighting : term frequency ( t f )
43 #Frequent Words and Asso c i a t i o n s
44 idx <− which ( dimnames (tdm) $Terms == ”pray ”)
45 inspect (tdm [ idx + ( 0 : 5) , 1 0 : 1 6 ] )
46 #############
47 <<TermDocumentMatrix ( terms : 6 , documents : 7)>>
48 Non−/sparse e n t r i e s : 2/40
49 Sparsity : 95%
50 Maximal term length : 14
51 Weighting : term frequency ( t f )
52 Docs
53 Terms 10 11 12 13 14 15 16
54 pray 0 1 0 0 0 0 0
55 prayed 0 0 0 0 0 0 0
56 prayer 0 0 0 0 1 0 0
57 prayersburundi 0 0 0 0 0 0 0
58 p r a y e r s f o r f r 0 0 0 0 0 0 0
59 p r a y e r s f o r p a r i 0 0 0 0 0 0 0
60 ##########
61 # inspect frequent words
62 findFreqTerms (tdm , lowfreq =100)
63 termFrequency <− rowSums( as . matrix (tdm))
64 termFrequency <− subset ( termFrequency , termFrequency >=100)
65 # inspect frequent words
66 ( freq . terms <− findFreqTerms (tdm , lowfreq = 100))
67 term . freq <− rowSums( as . matrix (tdm ))
68 term . freq <− subset ( term . freq , term . freq >= 100)
69 df <− data . frame ( term = names ( term . freq ) , freq = term . freq )
10. 1
2 l i b r a r y ( ggplot2 )
3 ggplot ( df , aes (x = term , y = freq )) + geom bar ( stat = ”i d e n t i t y ”) +
4 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ()
5 #s e l e c t some terms
6 ggplot ( df [ 3 0 : 6 0 , 4 0 : 8 0 ] , aes (x = term , y = freq )) +
7 +geom bar ( stat = ”i d e n t i t y ”) +
8 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ()
9 # which words are associated with ”pray ”?
10 findAssocs (tdm , ’ pray ’ , 0.25)
11 #c l u s t i n g words
12 # remove sparse terms
13 tdm2 <− removeSparseTerms (tdm , sparse =0.95)
14 m2 <− as . matrix (tdm2)
15 #### c l u s t e r terms
16 distMatrix <− d i s t ( s c a l e (m2))
17 f i t <− hclust ( distMatrix , method=”ward .D2”)
18 #other methods : complete average centroid
19 plot ( f i t )
20 # cut tree into 10 c l u s t e r s
21 rect . hclust ( f i t , k=10)
22 ( groups <− cutree ( f i t , k=10))
23 ##############################
24 > ( groups <− cutree ( f i t , k=10))
25 ´ld’ attentat ?a d´lej´ld’ et everyon
26 1 2 2 3 1 4
27 f a i t i l jamai l e s moi noubliera
28 2 5 2 2 6 2
29 pari parisattack pour prayforpari rt simoncr
30 7 2 1 8 9 2
31 thought un victim y ytbclara
32 4 10 1 5 1
33 ##################
34 #change tdm to a Boolean matrix
35 termDocMatrix=as . matrix (tdm)
36 #termDocMatrix=as . matrix (tdm [ 4 0 : 240 ,40:240])
37 #remove ”r ” , ”data ” and ”mining ”
38 idx <− which ( dimnames ( termDocMatrix ) $Terms %in% c ( ”pray ” , ”par i s ” , ”shoot ”))
39 M <− termDocMatrix[−idx , ]
40 # build a tweet−tweet adjacency matrix
41 tweetMatrix <− t (M) %∗% M
42 l i b r a r y ( igraph )
43 g <− graph . adjacency ( tweetMatrix , weighted=T, mode = ”undirected ”)
44 V( g ) $degree <− degree ( g )
45 g <− s i m p l i f y ( g )
46 #set l a b e l s of v e r t i c e s to tweet IDs
47 V( g ) $ l a b e l <− V( g )$name
48 V( g ) $ l a b e l . cex <− 1
49 V( g ) $ l a b e l . c o l o r <− rgb ( . 4 , 0 , 0 , . 7 )
50 V( g ) $ s i z e <− 2
51 V( g ) $frame . c o l o r <− NA
52 barplot ( table (V( g ) $degree ))
53 tdm=tdm [ 1 : 2 0 0 , 1 : 2 0 0 ]
54 idx <− V( g ) $degree == 0
55 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )
56 #load t w i t t e r text
57 #l i b r a r y ( twitteR)# load ( f i l e = ”data/rdmTweets . RData”)
58 #convert tweets to a data frame
59 df <− do . c a l l ( ”rbind ” , lapply (tdm , as . data . frame ))
60 #set l a b e l s to the IDs and the f i r s t 20 characters of tweets
61 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
62 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”)
63 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
64 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam)
65 E( g ) $width <− egam
66 set . seed (3152)
67 layout2 <− layout . fruchterman . reingold ( g )
68 plot (g , layout=layout2 )
69 #termDocMatrix=as . matrix (tdm [ 4 0 : 1 0 0 , 1 4 0 : 2 0 0 ] )
70 dim( termDocMatrix )
11. 1
2 termDocMatrix [ termDocMatrix >=1] <− 1
3 # transform into a term−term adjacency matrix
4 termMatrix <− termDocMatrix %∗% t ( termDocMatrix )
5 # inspect terms numbered 5 to 10
6 dim( termMatrix )
7 # [ 1 ] 3642 3200
8 termMatrix [ 5 : 1 0 , 5 : 1 0 ]
9 ################
10 Terms abrahammateomus abzzni accept account acontecem across
11 abrahammateomus 1 0 0 0 0 0
12 abzzni 0 1 0 0 0 0
13 accept 0 0 2 0 0 0
14 account 0 0 0 1 0 0
15 acontecem 0 0 0 0 2 0
16 across 0 0 0 0 0 2
17 ##############
18 l i b r a r y ( igraph )
19 # build a graph from the above matrix
20 g <− graph . adjacency ( termMatrix , weighted=T, mode=”undirected ”)
21 # remove loops
22 g <− s i m p l i f y ( g )
23 # set l a b e l s and degrees of v e r t i c e s
24 V( g ) $ l a b e l <− V( g )$name
25 V( g ) $degree <− degree ( g )
26 # set seed to make the layout reproducible set . seed (30)
27 layout1 <− layout . fruchterman . reingold ( g )
28 plot (g , layout=layout1 )
29 set . seed (3000) #3152
30 layout2 <− layout . fruchterman . reingold ( g )
31 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
32 +substr ( def$text [ idx ] , 1 , 20) , sep=” : ”)
33 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
34 E( g ) $color <− rb ( . 5 , . 5 , 0 , egam)
35 E( g ) $width <− egam
36 set . seed (3152)
37 layout2 <− layout . fruchterman . reingold ( g )
38 plot (g , layout=layout2 )
39 #########################################
40 termMatrix <− termMatrix [1500:2000 ,1500:2000]
41 # create a graph
42 #g <− graph . incidence ( termDocMatrix , mode=c ( ” a l l ”))
43 g <− graph . incidence ( termMatrix , mode=c ( ” a l l ”))
44 # get index f o r term v e r t i c e s and tweet v e r t i c e s
45 nTerms <− nrow (M)
46 nDocs <− ncol (M)
47 idx . terms <− 1: nTerms
48 idx . docs <− (nTerms +1):( nTerms+nDocs )
49 # set c o l o r s and s i z e s f o r v e r t i c e s
50 V( g ) $degree <− degree ( g )
51 V( g ) $color [ idx . terms ] <− rgb (0 , 1 , 0 , . 5 )
52 V( g ) $ s i z e [ idx . terms ] <− 6
53 V( g ) $color [ idx . docs ] <− rgb (1 , 0 , 0 , . 4 )
54 V( g ) $ s i z e [ idx . docs ] <− 4
55 V( g ) $frame . c o l o r <− NA
56 # set vertex l a b e l s and t h e i r c o l o r s and s i z e s
57 V( g ) $ l a b e l <− V( g )$name
58 V( g ) $ l a b e l . c o l o r <− rgb (0 , 0 , 0 , 0.5)
59 V( g ) $ l a b e l . cex <− 1.4∗V( g ) $degree /max(V( g ) $degree ) + 1
60 # set edge width and c o l o r
61 E( g ) $width <− .3
62 E( g ) $color <− rgb ( . 5 , . 5 , 0 , . 3 )
63 set . seed (1500)
64 plot (g , layout=layout . fruchterman . reingold )
65 idx <− V( g ) $degree == 0
66 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )
12. 1 # convert tweets to a data frame
2 df <− do . c a l l ( ”rbind ” , lapply ( termMatrix , as . data . frame ))
3 # set l a b e l s to the IDs and the f i r s t 20 characters of tweets
4 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
5 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”)
6 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
7 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam)
8 E( g ) $width <− egam
9 set . seed (3152)
10 layout2 <− layout . fruchterman . reingold ( g )
11 plot (g , layout=layout2 )
12 ###############sentiment a n a l y s i s #############
13 # harvest some tweets
14 some tweets = searchTwitter ( ”#pr ay f or par i s ” , n=10000, lang=”en ”)
15 # get the text
16 some txt = sapply ( some tweets , function (x) x$getText ( ) )
17 # remove retweet e n t i t i e s
18 some txt = gsub ( ”(RT| via ) ( ( ? : bW∗@w+)+)” , ”” , some txt )
19 # remove at people
20 some txt = gsub ( ”@w+” , ”” , some txt )
21 # remove punctuation
22 some txt = gsub ( ” [ [ : punct : ] ] ” , ”” , some txt )
23 # remove numbers
24 some txt = gsub ( ” [ [ : d i g i t : ] ] ” , ”” , some txt )
25 # remove html l i n k s
26 some txt = gsub ( ”http w+” , ”” , some txt )
27 # remove unnecessary spaces
28 some txt = gsub ( ” [ t ]{2 ,} ” , ”” , some txt )
29 some txt = gsub ( ”ˆ s +| s+$ ” , ”” , some txt )
30 # define ”tolower e rr or handling ” function
31 try . er r or = function (x)
32 {
33 # create missing value
34 y = NA
35 # tryCatch er r or
36 t r y e r r o r = tryCatch ( tolower (x ) , e r ror=function ( e ) e )
37 # i f not an er r or
38 i f ( ! i n h e r i t s ( try error , ”e r r or ”))
39 y = tolower (x)
40 # r e s u l t
41 return (y)
42 }
43 # lower case using try . e r ror with sapply
44 some txt = sapply ( some txt , try . er r or )
45 # remove NAs in some txt
46 some txt = some txt [ ! i s . na ( some txt ) ]
47 names ( some txt ) = NULL
48 # c l a s s i f y emotion
49 class emo = c l a s s i f y e m o t i o n ( some txt , algorithm=”bayes ” , p r i o r =1.0)
50 # get emotion best f i t
51 emotion = class emo [ , 7 ]
52 # s u b s t i t u t e NA’ s by ”unknown”
53 emotion [ i s . na ( emotion ) ] = ”unknown”
54 # c l a s s i f y p o l a r i t y
55 c l a s s p o l = c l a s s i f y p o l a r i t y ( some txt , algorithm=”bayes ”)
56 # get p o l a r i t y best f i t
57 p o l a r i t y = c l a s s p o l [ , 4 ]
58 # data frame with r e s u l t s
59 sent df = data . frame ( text=some txt , emotion=emotion ,
60 p o l a r i t y=polarity , stringsAsFactors=FALSE)
61 # sort data frame
62 sent df = within ( sent df ,
63 +emotion <− f a c t o r ( emotion ,
64 +l e v e l s=names ( sort ( table ( emotion ) , decreasing=TRUE) ) ) )
13. 1
2 # plot d i s t r i b u t i o n of emotions
3 ggplot ( sent df , aes (x=emotion )) +
4 geom bar ( aes (y =.. count . . , f i l l =emotion )) +
5 s c a l e f i l l b r e w e r ( p a l e t t e=”Dark2 ”) +
6 labs (x=”emotion c a t e g o r i e s ” , y=”number of tweets ”) +
7 labs ( t i t l e = ”Sentiment Analysis of Tweets about
8 +Starbucks n( c l a s s i f i c a t i o n by emotion ) ” ,
9 +plot . t i t l e = element text ( s i z e =12))
10 # plot d i s t r i b u t i o n of p o l a r i t y
11 ggplot ( sent df , aes (x=p o l a r i t y )) +
12 geom bar ( aes (y =.. count . . , f i l l =p o l a r i t y )) +
13 s c a l e f i l l b r e w e r ( p a l e t t e=”RdGy”) +
14 labs (x=”p o l a r i t y c a t e g o r i e s ” , y=”number of tweets ”) +
15 labs ( t i t l e = ”Sentiment Analysis of Tweets about
16 +#pr a yf or pa r is n( c l a s s i f i c a t i o n by p o l a r i t y ) ” ,
17 +plot . t i t l e = element text ( s i z e =12))
18 # separating text by emotion
19 emos = l e v e l s ( f a c t o r ( sent df$emotion ))
20 nemo = length ( emos )
21 emo . docs = rep ( ”” , nemo)
22 f o r ( i in 1: nemo)
23 {
24 tmp = some txt [ emotion == emos [ i ] ]
25 emo . docs [ i ] = paste (tmp , c o l l a p s e=” ”)
26 }
27 # remove stopwords
28 emo . docs = removeWords (emo . docs , stopwords ( ”e n g l i s h ”))
29 # create corpus
30 corpus = Corpus ( VectorSource (emo . docs ))
31 tdm = TermDocumentMatrix ( corpus )
32 tdm = as . matrix (tdm)
33 colnames (tdm) = emos
34 # comparison word cloud
35 comparison . cloud (tdm , c o l o r s = brewer . pal (nemo , ”Dark2 ”) ,
36 +s c a l e = c ( 3 , . 5 ) , random . order = FALSE, t i t l e . s i z e = 1. 5)