Twitter data analysis using R

Twitter Data Analytics using R
By:
Santoshi Kumari
RUAS

Outline
• Big data and Social Media data
• Social media analytics
• Why Social media Analytics
• Real time twitter data analytics
• Text Mining for Tweeter Data
• Basics of Twitter data analytics using R
• Summary

Introduction
• The enormous data produced online,
increasing by seconds to minute
• Humanly difficult to keep up with the rate of
data generation on Twitter and Facebook.
• Device advance analytical model by combining
and implementing Machine learning, data
mining and NLP algorithm to make cognizant
decisions for time sensitive events.
Big Data
Analytics
Volume
• Petabyte
• Exabyte
•Faster processing
Velocity
• Batch
• Near real time
• Real time
•Improve
performance
Variety
• Structured
• Semi structured
• Unstructured
•Increase accuracy
Positive 
Neutral :-/
Negative 
Analytics
Social media Sentiment

Social media New way to predict future by understanding present and take cognizant decision
SOCIAL MEDIA REVOLUTION

Social Media Analytics
Social media has given new way of
communication technology for people to
share their opinion, interest, sentiment to the
world.
Huge amount of unstructured data is
generated over social media like Facebook,
twitter, LinkedIn
Social Media Analytics deals with
development and evaluation of tools and
frameworks to collect, monitor, analyze,
summarize, and visualize social media data
Extracts useful patterns and information

Why Social Media Analytics?
• Social media – An integral part of daily routine, changing the way of communication
across the globe
• Opinion of the mass is important – Political Party; Government Policies; Movies;
Products and Services; Individual(s) ; Organizations
• Trending topics can reveal people’s intentions and their interests and importantly
current happenings

Applications of Social Media Analytics
Retail companies - To harness their brand awareness, service improvement,
advertising/marketing strategies, identifying influencers
Finance: to determine market sentiment, news data for trading
Government and public officials
• Monitoring public perception on political candidates, election campaigns and
announcements
• Prediction at national level of happiness, unemployment etc.
• Social media job loss index: econprediction.eecs.umich.edu
• An article on real world applications
• Sudden change in behavior

Real time analytics
• The pulse of society can be found in social-media in real-time.
• Analyzing social-media content in real-time helps social scientists to
predict future and take quick relevant action in time !!!
What’s trending right now may not be popular
about an hour ago or hour before on social
media

Why Twitter?
• Twitter is a social microblog platform (Short Text Messages of 140 characters)
• 500 million tweets are generated everyday (http://www.internetlivestats.com/twitter-statistics/)
• Users often discuss current affairs and share personal views on various subjects
• views and sentiments on any subject from new products launched, to favorite movies , music to
political decisions.
• Twitter audience varies from common man to celebrities
• The tweets are also public and hence accessible to researchers unlike most social
network sites.
• Tweets are reliably time stamped so that they can be analyzed from a temporal
perspective.

Facts• The male vs. female ratio of social media users is as follows:
• Facebook – 60% female/40% male;
• Twitter – 60% female/40% male;
• Pinterest – 79% female/21% male;
• Google Plus – 29% female/71% male;
• LinkedIn – 55% female/45% male.
• YouTube has over 1 billion unique visitors per month
• 91% of mobile Internet access is for social activities with 73% of smartphone owners accessing social networks
through apps at least once per day.
• There are 684,478 pieces of content shared on Facebook; 3,600 new photos on Instagram; 2,083 check-ins on
Foursquare during every minute of every day.
• LinkedIn has over 3 million company pages
• According to this study, mothers with children under the age of 5 are the most active on social media.

Challenges
• Tweets are highly unstructured and also non-grammatical
• Out of Vocabulary Words
• Lexical Variation
• Extensive usage of acronyms like asap, lol, afaik

Text Analysis
• Text analysis : extract or classify information from text, like tweets, emails, chats,
documents, etc.
• Some popular examples are:
• Spam filtering: One of the most known and used text classification applications
(assign a category to a text). Spam filters learn to classify an email or message as
spam depending on the content and the subject.
• Sentiment Analysis: another application is text classification where an algorithm
must learn to classify an opinion as positive, neutral or negative depending on
the mood expressed by the writer.
• Information Extraction: From a text, learn to extract a particular piece of
information or data, for example, extracting addresses, entities, keywords, etc

Why is Sentiment Analysis Important?• 93% of marketers are using social media. However, only 9% of marketing companies have full-time bloggers
• Around 46% of web users will look towards social media when making a purchase.
• Government or Political party may want to know whether people support their program or not.
• Before investing into a company, one can leverage the sentiment of the people for the company to find out
where it stands.
• A company might want find out the reviews of its products like Amazon
• Economics: Predicting financial market. Used by corporates to monitor stock markets.
• Election :
1. Analyzing election related chatter
2. Find Party / Person wise sentiment
3. Find what people likes dislikes about Party/Person
4. Find major reasons behind success or failure
5. Find major trends in election
6. Analysing impact of non political movements which links to politics (Anna and Ramdev like movements)

Text Mining
https://manoharswamynathan.wordpress.com/2015/03/01/text-mining-101/

Data Processing steps• Explore Corpus – Understand the types of variables, their functions, permissible values, and so on.
• Some formats including html and xml contain tags and other data structures that provide more metadata.
• Convert text to lowercase – This is to avoid distinguish between words simply on case.
• Remove Number(if required) – Numbers may or may not be relevant to our analyses.
• Remove English stop words – Stop words are common words found in a language.
• Words like for, of, are etc are common stop words.
• Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our
own stop words.
• Strip white space – Eliminate extra white spaces.
• Stemming – Transforms to root word.
• Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”.
• For example i.e., 1) “computer” & “computers” become “comput”
• Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”
• Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be
removed from the document term matrix.

Document Term Matrix
• Document term matrix – A document term matrix is a matrix with documents as the rows and terms as the
columns and a count of the frequency of words as the cells of the matrix.
• Calculate Term Weight – TF-IDF
• How frequently term appears?
Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document)
• How important a term is?
DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection
of documents)
• To normalize : Essentially we are compressing the scale of values so that very large or very small quantities are
smoothly compared
• IDF: Inverse Document Frequency
IDF(t) = log(Total number of documents / Number of documents with term t in it)
Example:
Consider a document containing 100 words wherein the word CAR appears 3 times
TF(CAR) = 3 / 100 = 0.03
Now, assume we have 10 million documents and the word CAR appears in one thousand of these
IDF(CAR) = log(10,000,000 / 1,000) = 4
TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12

Similarity Distance Measure (Cosine)
• Why Cosine?
• General observation is that the Cosine similarity works better than the
Euclidean for text data.

Calculate Cosine similarity
• Example:
• Text 1: statistics skills and programming skills are equally important for analytics
• Text 2: statistics skills and domain knowledge are important for analytics
• Text 3 : I like reading books and travelling
• Document Term Matrix for the above 3 text would be:
• The three vectors are:
• T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)
• T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)
• T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)
• Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%
• Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%

Reference
• http://www.rdatamining.com/docs

Thank You
Final Thought ;-)
word is mightier than the sword
tweet is mightier than the sword"

Twitter data analysis using R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Twitter data analysis using R

Similar to Twitter data analysis using R (20)

Recently uploaded

Recently uploaded (20)

Twitter data analysis using R