Analyzing Realtime News
Raffaele Lorusso – Marco Fusi
Milan, November 2015 #RateMe
This project has been realized during the 2015-2016 master “Business Intelligence
and Big Data Analytics” at Università di Milano - BicoccaCONTEXT	
#RateMe
Twitter as an example of new media and realtime news sharingTWITTER	
#RateMe
TIMELINE
NEWS	LIFECYCLE	 How news spreads on Twitter and other new-media
News	
#RateMe
TIMELINE
NEWS	LIFECYCLE	 How news spreads on Twitter and other new-media
Tweet	 News	
#RateMe
TIMELINE
NEWS	LIFECYCLE	 How news spreads on Twitter and other new-media
News	Tweet	
Tweet	Tweet	
Tweet	
Tweet	Tweet	Tweet	
Tweet	
#RateMe
TIMELINE
NEWS	LIFECYCLE	 How news spreads on Twitter and other new-media
News	Tweet	
Tweet	Tweet	
Tweet	
Tweet	Tweet	Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
#RateMe
TIMELINE
NEWS	LIFECYCLE	 How news spreads on Twitter and other new-media
News	
Tweet	Tweet	
Tweet	
Tweet	Tweet	Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
#RateMe
TIMELINE
NEWS	LIFECYCLE	 How news spreads on Twitter and other new-media
Tweet	
Tweet	Tweet	
Tweet	
Tweet	Tweet	Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	Tweet	
Tweet	
Tweet	 Tweet	
News	
#RateMe
Twitter is an easy way to create and share news and opinions.
It’s a new flow of content and information associated with huge opportunities.
With the collected data it’s possible to conduct statystical analysis that allow us to
extrapolate quantitative and qualitative indicators in order to identify trends, correlations,
flows, sentiment,….
CREATE	
ANALYZE	
FOLLOW	
Follow the news evolution during the time by analyzing and contextualyizing it in the reality
and comparing the externals events that can contribute to generete and modify the news
itself.
#RateMe
ARCHITECTURE	
#RateMe	
Main Components
BATCH
LAYER
SPEED
LAYER
DATA
SOURCES
Machine
Learning
PRESENTATION
LAYER
	
ARCHITECTURE	 The Lambda Architecture
#RateMe
Case Study: Big Data Ecosystem on Twitter
#RateMe
BIG DATA
FRONTEND	
Big Data Ecosystem
BIG DATA
BACKEND	
#RateMe
Big Data Ecosystem at a glance
40k	 1	Month	
100	k	
28	k	
170	k	
1.2	k	
30	k	
#RateMe
Big Data Ecosystem
#RateMe
SENTIMENT	
ANALYSIS	
From the text of the Tweets it’s possible to compute a measure relative to the sentiment
associated with it.
In this project we have built two different models.
CLUSTER
THEN
PREDICT	
DICTIONARY
ALGORITM	
#RateMe
SENTIMENT	
ANALYSIS	
This model concept is to split a Tweet into tokens composed by the single words, and then
associate a score to each word by looking in a dictionary table containing positive and
negative words and a numerical score.
DICTIONARY
ALGORITM	
#RateMe
SENTIMENT	
ANALYSIS	
This model is based upon clustering Tweets with similar words and then applying a
Random Forest algorithm on each cluster
CLUSTER
THEN
PREDICT	
#RateMe	“Improved Twitter Sentiment prediction through Cluster then Predict Model”
International Journal of Computer Science and Network, August 2015
DASHBOARD	
*LIVE	DEMO	
#RateMe
CREARE	LA	
NOTIZIA	
CONCLUSIONS	
• The «Lambda Architecture» seems a good approach thanks to the tradeoff between the need of RealTime Analysis
and Batch computations
• The Big Data Ecosystem is composed by etherogeneous technologies and each of them solve just a part of the
whole problem
• Many technlogies are easily interoperable and composable
• There are many first mover in the Big Data market but also consolidated ones that are nowdays a must have in a
Big Data Architecture
Big Data Ecosystem - Architecture
#RateMe
CONCLUSIONS	
•  The most twitted technlogies are not always the ones that has the largest market share
•  It seems there’s no correlation between real Big Data Events and tweets volumes
•  In this case study the sentiment analysis made with the cluster then predict model is worse than the one made
with the dictionary algorithm
•  The dictionary algorithm approach is very susceptible to the usage of a good dictionary with a lot of words.
With the dictionary we used only 42% tweets were scored
•  The analysis between the senders and the mentioned users underlyned that there are many influencers who
are actually closely connected to the technologies or even the official accounts of that technlogy
•  45% of the tweets were sent by official apps from Web platform, Android and IOS
Big Data Ecosystem – Data Analysis
#RateMe
Case Study: Data Science seminar @masterBIBDA
Milan, 19 November 2015 #RateMe
Game
Rate this seminar
Players
Our speakers and YOU!
Objectives
Have Fun!
#RateMe Rules
#RateMe
Tweet to
@masterbibda
Reference the keyword
by using an hashtag
#datascientistprofiles
Vote
alto – medio - basso
Example#RateMe
#RateMe
and…
Feel free to Tweet your toughts @masterbibda!
Every Tweet will be analyzed!
#RateMe
#RateMe
DASHBOARD	
*LIVE	DEMO	
#RateMe
Tweet	
Tweet	Tweet	
Tweet	
Tweet	Tweet	Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	
Tweet	 Tweet	 Tweet	
Tweet	Tweet	
Tweet	
Tweet	 Tweet	
News	
Enjoy #RateMe
#RateMe
Raffaele Lorusso – Marco Fusi
Milan, November 2015
THANKS!
Analyzing Realtime News
#RateMe

Analyzing Real Time News

  • 1.
    Analyzing Realtime News RaffaeleLorusso – Marco Fusi Milan, November 2015 #RateMe
  • 2.
    This project hasbeen realized during the 2015-2016 master “Business Intelligence and Big Data Analytics” at Università di Milano - BicoccaCONTEXT #RateMe
  • 3.
    Twitter as anexample of new media and realtime news sharingTWITTER #RateMe
  • 4.
    TIMELINE NEWS LIFECYCLE How newsspreads on Twitter and other new-media News #RateMe
  • 5.
    TIMELINE NEWS LIFECYCLE How newsspreads on Twitter and other new-media Tweet News #RateMe
  • 6.
    TIMELINE NEWS LIFECYCLE How newsspreads on Twitter and other new-media News Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet #RateMe
  • 7.
    TIMELINE NEWS LIFECYCLE How newsspreads on Twitter and other new-media News Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet #RateMe
  • 8.
    TIMELINE NEWS LIFECYCLE How newsspreads on Twitter and other new-media News Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet #RateMe
  • 9.
    TIMELINE NEWS LIFECYCLE How newsspreads on Twitter and other new-media Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet News #RateMe
  • 10.
    Twitter is aneasy way to create and share news and opinions. It’s a new flow of content and information associated with huge opportunities. With the collected data it’s possible to conduct statystical analysis that allow us to extrapolate quantitative and qualitative indicators in order to identify trends, correlations, flows, sentiment,…. CREATE ANALYZE FOLLOW Follow the news evolution during the time by analyzing and contextualyizing it in the reality and comparing the externals events that can contribute to generete and modify the news itself. #RateMe
  • 11.
  • 12.
  • 13.
    Case Study: BigData Ecosystem on Twitter #RateMe
  • 14.
    BIG DATA FRONTEND Big DataEcosystem BIG DATA BACKEND #RateMe
  • 15.
    Big Data Ecosystemat a glance 40k 1 Month 100 k 28 k 170 k 1.2 k 30 k #RateMe
  • 16.
  • 17.
    SENTIMENT ANALYSIS From the textof the Tweets it’s possible to compute a measure relative to the sentiment associated with it. In this project we have built two different models. CLUSTER THEN PREDICT DICTIONARY ALGORITM #RateMe
  • 18.
    SENTIMENT ANALYSIS This model conceptis to split a Tweet into tokens composed by the single words, and then associate a score to each word by looking in a dictionary table containing positive and negative words and a numerical score. DICTIONARY ALGORITM #RateMe
  • 19.
    SENTIMENT ANALYSIS This model isbased upon clustering Tweets with similar words and then applying a Random Forest algorithm on each cluster CLUSTER THEN PREDICT #RateMe “Improved Twitter Sentiment prediction through Cluster then Predict Model” International Journal of Computer Science and Network, August 2015
  • 20.
  • 21.
    CREARE LA NOTIZIA CONCLUSIONS • The «Lambda Architecture»seems a good approach thanks to the tradeoff between the need of RealTime Analysis and Batch computations • The Big Data Ecosystem is composed by etherogeneous technologies and each of them solve just a part of the whole problem • Many technlogies are easily interoperable and composable • There are many first mover in the Big Data market but also consolidated ones that are nowdays a must have in a Big Data Architecture Big Data Ecosystem - Architecture #RateMe
  • 22.
    CONCLUSIONS •  The mosttwitted technlogies are not always the ones that has the largest market share •  It seems there’s no correlation between real Big Data Events and tweets volumes •  In this case study the sentiment analysis made with the cluster then predict model is worse than the one made with the dictionary algorithm •  The dictionary algorithm approach is very susceptible to the usage of a good dictionary with a lot of words. With the dictionary we used only 42% tweets were scored •  The analysis between the senders and the mentioned users underlyned that there are many influencers who are actually closely connected to the technologies or even the official accounts of that technlogy •  45% of the tweets were sent by official apps from Web platform, Android and IOS Big Data Ecosystem – Data Analysis #RateMe
  • 23.
    Case Study: DataScience seminar @masterBIBDA Milan, 19 November 2015 #RateMe
  • 24.
    Game Rate this seminar Players Ourspeakers and YOU! Objectives Have Fun! #RateMe Rules #RateMe
  • 25.
    Tweet to @masterbibda Reference thekeyword by using an hashtag #datascientistprofiles Vote alto – medio - basso Example#RateMe #RateMe
  • 26.
    and… Feel free toTweet your toughts @masterbibda! Every Tweet will be analyzed! #RateMe #RateMe
  • 27.
  • 28.
  • 29.
    Raffaele Lorusso –Marco Fusi Milan, November 2015 THANKS! Analyzing Realtime News #RateMe