Mining Big Data in Real Time           Albert Bifet  Turing/SLAIS 2012 Conference
BIG Data   BIG DATA           Measure and React
Motivation     Source: IDC’s Digital Universe Study (EMC), June 2011               Data is growing
Motivation             Memory unit        Size   Binary size             kilobyte (kB/KB)   103        210             meg...
Motivation     Source: IDC’s Digital Universe Study (EMC), June 2011               Data is growing
Motivation     Source: IDC’s Digital Universe Study (EMC), June 2011               Data is growing
Motivation     Source: IDC’s Digital Universe Study (EMC), June 2011               Data is growing
Streaming Data       Big Data & Real Time
Big Data           McKinsey Global Institute (MGI) Report on Big Data, 2011.       Big data refers to datasets whose size ...
Big Data           McKinsey Global Institute (MGI) Report on Big Data, 2011.       Big data refers to datasets whose size ...
BIG Data     Volume     Variety     Velocity                3 Vs
Methodology              Sampling and distributed systems
Methodology                     Paolo Boldi          Facebook Four degrees of separation          Big Data does not need b...
Real time analytics          We want to analyze what is happening now.
Real time analytics          We want to analyze what is happening now.
Time and Memory                    Number 8 Wire Mentality      Time and memory are the resource dimensions of            ...
Time and Memory      Time and memory are the resource dimensions of                     the process.
Algorithms        Classification, Regression, Clustering, Frequent                        Pattern Mining.
Applications      sensor data: industry, cities      telecomm data      social networks: twitter, facebook, yahoo      mar...
New applications: social networks   Twitter: A Massive Data Stream      Micro-blogging service      Built to discover what...
Sentiment Analysis on Twitter   Sentiment analysis   Classifying messages into two categories depending on   whether they ...
New problem: structured classification   New methods for structured classification                         D                ...
New problem: structured classification   New methods for structured classification             D           D                ...
New problem: structured classification   New methods for structured classification             D           D                ...
New Techniques: Distributed Systems              Hadoop, S4 and Storm
Hadoop         Hadoop
Hadoop         Hadoop architecture
Apache Mahout            Mahout: open source framework
Pig      Pig: Similar to SQL
Pig      A = LOAD ’data’ USING PigStorage() AS      (f1:int, f2:int, f3:int);      B = GROUP A BY f1;      C = FOREACH B G...
Apache S4            Apache S4
Apache S4
Storm        Storm from Twitter
Storm        Stream, Spout, Bolt, Topology
Storm        Tools                                            ElephantDB, Voldemort                       All     Hadoop P...
Data Streams       Big Data & Real Time
Data Streams               Thanks!
Upcoming SlideShare
Loading in …5
×

Mining Big Data in Real Time

4,316 views

Published on

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.

Mining Big Data in Real Time

  1. 1. Mining Big Data in Real Time Albert Bifet Turing/SLAIS 2012 Conference
  2. 2. BIG Data BIG DATA Measure and React
  3. 3. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  4. 4. Motivation Memory unit Size Binary size kilobyte (kB/KB) 103 210 megabyte (MB) 106 220 gigabyte (GB) 109 230 terabyte (TB) 1012 240 petabyte (PB) 1015 250 exabyte (EB) 1018 260 zettabyte (ZB) 1021 270 yottabyte (YB) 1024 280 Data is growing
  5. 5. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  6. 6. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  7. 7. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  8. 8. Streaming Data Big Data & Real Time
  9. 9. Big Data McKinsey Global Institute (MGI) Report on Big Data, 2011. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.
  10. 10. Big Data McKinsey Global Institute (MGI) Report on Big Data, 2011. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.
  11. 11. BIG Data Volume Variety Velocity 3 Vs
  12. 12. Methodology Sampling and distributed systems
  13. 13. Methodology Paolo Boldi Facebook Four degrees of separation Big Data does not need big machines, it needs big intelligence
  14. 14. Real time analytics We want to analyze what is happening now.
  15. 15. Real time analytics We want to analyze what is happening now.
  16. 16. Time and Memory Number 8 Wire Mentality Time and memory are the resource dimensions of the process.
  17. 17. Time and Memory Time and memory are the resource dimensions of the process.
  18. 18. Algorithms Classification, Regression, Clustering, Frequent Pattern Mining.
  19. 19. Applications sensor data: industry, cities telecomm data social networks: twitter, facebook, yahoo marketing: sales business Data may come from: humans, sensors, or machines.
  20. 20. New applications: social networks Twitter: A Massive Data Stream Micro-blogging service Built to discover what is happening at any moment in time, anywhere in the world. 3 billion requests a day via its API. MOA-TweetReader: a real-time system to read tweets in real time detect changes find the terms whose frequency changed
  21. 21. Sentiment Analysis on Twitter Sentiment analysis Classifying messages into two categories depending on whether they convey positive or negative feelings Emoticons are visual cues associated with emotional states, which can be used to define class labels for sentiment classification Positive Emoticons Negative Emoticons :) :( :-) :-( :) :( :D =) Table : List of positive and negative emoticons.
  22. 22. New problem: structured classification New methods for structured classification D D B B B → C C C C A sequences, trees, graphs
  23. 23. New problem: structured classification New methods for structured classification D D D D B , B B B , B → C C C C C A sequences, trees, graphs frequent pattern mining techniques
  24. 24. New problem: structured classification New methods for structured classification D D D D B , B B B , B → C C C C C A a,b → class1, class2 sequences, trees, graphs frequent pattern mining techniques multi-label data mining Example: Lord of the Rings → Action, Adventure, Fantasy
  25. 25. New Techniques: Distributed Systems Hadoop, S4 and Storm
  26. 26. Hadoop Hadoop
  27. 27. Hadoop Hadoop architecture
  28. 28. Apache Mahout Mahout: open source framework
  29. 29. Pig Pig: Similar to SQL
  30. 30. Pig A = LOAD ’data’ USING PigStorage() AS (f1:int, f2:int, f3:int); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); DUMP C; Pig: Similar to SQL
  31. 31. Apache S4 Apache S4
  32. 32. Apache S4
  33. 33. Storm Storm from Twitter
  34. 34. Storm Stream, Spout, Bolt, Topology
  35. 35. Storm Tools ElephantDB, Voldemort All Hadoop Precomputed batch view data Storm Query Precomputed realtime view New data stream Storm Cassandra, Riak, HBase Kafka “Lambda Architecture” Runaway complexity in Big Data Nathan Marz, 2012
  36. 36. Data Streams Big Data & Real Time
  37. 37. Data Streams Thanks!

×