Mining Big Data in Real Time
Albert Bifet
Motivation
• BIG DATA is an OPEN SOURCE
Software Revolution
• BIG DATA Analytics 2.0
• What is happening right now
• Why we need new tools?
• Improve decision making:
• Measure and react in REAL-TIME
2 7/6/2013
Real Time Decision Making
3 7/6/2013
Companies need to know:
• what is happening right now,
in real time, to be able to
• react
• anticipate and detect new
business opportunities.
Big Data 6 Vs
• Volume
• Variety
• Velocity
• Value
• Variability
• Veracity
4 7/6/2013
Controversy of Big Data
• All data is BIG now
• Hype to sell Hadoop
based systems
• Ethical concerns about
accessibility
• Limited access to Big
Data creates new digital
divides
5 7/6/2013
Controversy of Big Data
• Statistical Significance:
– When the number of
variables grow, the
number of fake
correlations also grow
– Leinweber: S&P 500
stock index correlated
with butter production
in Bangladesh
6 7/6/2013
Need for Big Data
• McKinsey Global Institute
(MGI) Report on Big
Data, 2011
7 7/6/2013
Need for Big Data
8 7/6/2013
• McKinsey Global Institute
(MGI) Report on Big
Data, 2011
More data or better models?
9 7/6/2013
Xavier Amatriain
Netflix Research/Engineering Director
http://recsys.acm.org/more-data-or-better-models/
Future Challenges for Big Data
• Evaluation
• Time evolving data
• Distributed mining
• Compression
• Visualization
• Hidden Big Data
10 7/6/2013
HADOOPArchitecture
11 7/6/2013
Apache Mahout
12 7/6/2013
Pig
13 7/6/2013
Pig Similar to SQL
Apache S4
14 7/6/2013
Twitter Storm
15 7/6/2013
Runaway Complexity
16 7/6/2013
Tools
All
data
Precomputed
batch view
Query
Precomputed
realtime view
New data stream
Hadoop
Storm
“Lambda Architecture”
Storm
ElephantDB, Voldemort
Cassandra, Riak, HBase
Kafka
What is SAMOA?
17 7/6/2013
• NEW Software framework for mining distributed data streams
• Big Data mining for evolving streams in REAL-TIME
18 7/6/2013
Big Data Stream Mining
BIG DATA Streams
• Sequence is potentially infinite
• High amount of data, high speed of arrival
• Change over time
• Process elements from a data stream in only one pass
• Approximation algorithms
– Small error rate with high probability
19 7/6/2013
Big Data Stream Mining
Distributed BIG DATA
• BIG DATA Analytics 2.0
– Apache S4
• Yahoo! 2010
– Storm
• Twitter 2011
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,…
Stream
MOA
SAMOAArchitecture
Use S4, Storm, or other distributed stream processing platform
Use MOA, or other streaming machine learning library
Easy to extend through PACKAGES
20 7/6/2013
SAMOA
S4 Storm …
SAMOA
Classifier
Methods
Clustering
Methods
Frequent
Pattern
Mining
Thanks!
http://samoa-project.net/
G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams
Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and
Mining of Social Streams @WWW, Rio De Janeiro, 2013.
21 7/6/2013

Mining Big Data in Real Time

  • 1.
    Mining Big Datain Real Time Albert Bifet
  • 2.
    Motivation • BIG DATAis an OPEN SOURCE Software Revolution • BIG DATA Analytics 2.0 • What is happening right now • Why we need new tools? • Improve decision making: • Measure and react in REAL-TIME 2 7/6/2013
  • 3.
    Real Time DecisionMaking 3 7/6/2013 Companies need to know: • what is happening right now, in real time, to be able to • react • anticipate and detect new business opportunities.
  • 4.
    Big Data 6Vs • Volume • Variety • Velocity • Value • Variability • Veracity 4 7/6/2013
  • 5.
    Controversy of BigData • All data is BIG now • Hype to sell Hadoop based systems • Ethical concerns about accessibility • Limited access to Big Data creates new digital divides 5 7/6/2013
  • 6.
    Controversy of BigData • Statistical Significance: – When the number of variables grow, the number of fake correlations also grow – Leinweber: S&P 500 stock index correlated with butter production in Bangladesh 6 7/6/2013
  • 7.
    Need for BigData • McKinsey Global Institute (MGI) Report on Big Data, 2011 7 7/6/2013
  • 8.
    Need for BigData 8 7/6/2013 • McKinsey Global Institute (MGI) Report on Big Data, 2011
  • 9.
    More data orbetter models? 9 7/6/2013 Xavier Amatriain Netflix Research/Engineering Director http://recsys.acm.org/more-data-or-better-models/
  • 10.
    Future Challenges forBig Data • Evaluation • Time evolving data • Distributed mining • Compression • Visualization • Hidden Big Data 10 7/6/2013
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Runaway Complexity 16 7/6/2013 Tools All data Precomputed batchview Query Precomputed realtime view New data stream Hadoop Storm “Lambda Architecture” Storm ElephantDB, Voldemort Cassandra, Riak, HBase Kafka
  • 17.
    What is SAMOA? 177/6/2013 • NEW Software framework for mining distributed data streams • Big Data mining for evolving streams in REAL-TIME
  • 18.
    18 7/6/2013 Big DataStream Mining BIG DATA Streams • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time • Process elements from a data stream in only one pass • Approximation algorithms – Small error rate with high probability
  • 19.
    19 7/6/2013 Big DataStream Mining Distributed BIG DATA • BIG DATA Analytics 2.0 – Apache S4 • Yahoo! 2010 – Storm • Twitter 2011 Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA,… Stream MOA
  • 20.
    SAMOAArchitecture Use S4, Storm,or other distributed stream processing platform Use MOA, or other streaming machine learning library Easy to extend through PACKAGES 20 7/6/2013 SAMOA S4 Storm … SAMOA Classifier Methods Clustering Methods Frequent Pattern Mining
  • 21.
    Thanks! http://samoa-project.net/ G. De FrancisciMorales SAMOA: A Platform for Mining Big Data Streams Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams @WWW, Rio De Janeiro, 2013. 21 7/6/2013

Editor's Notes