Mining Big Data in Real Time

4,594 views
4,275 views

Published on

Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.

Published in: Technology, Business

Mining Big Data in Real Time

  1. 1. Mining Big Data in Real Time Albert Bifet
  2. 2. Motivation • BIG DATA is an OPEN SOURCE Software Revolution • BIG DATA Analytics 2.0 • What is happening right now • Why we need new tools? • Improve decision making: • Measure and react in REAL-TIME 2 7/6/2013
  3. 3. Real Time Decision Making 3 7/6/2013 Companies need to know: • what is happening right now, in real time, to be able to • react • anticipate and detect new business opportunities.
  4. 4. Big Data 6 Vs • Volume • Variety • Velocity • Value • Variability • Veracity 4 7/6/2013
  5. 5. Controversy of Big Data • All data is BIG now • Hype to sell Hadoop based systems • Ethical concerns about accessibility • Limited access to Big Data creates new digital divides 5 7/6/2013
  6. 6. Controversy of Big Data • Statistical Significance: – When the number of variables grow, the number of fake correlations also grow – Leinweber: S&P 500 stock index correlated with butter production in Bangladesh 6 7/6/2013
  7. 7. Need for Big Data • McKinsey Global Institute (MGI) Report on Big Data, 2011 7 7/6/2013
  8. 8. Need for Big Data 8 7/6/2013 • McKinsey Global Institute (MGI) Report on Big Data, 2011
  9. 9. More data or better models? 9 7/6/2013 Xavier Amatriain Netflix Research/Engineering Director http://recsys.acm.org/more-data-or-better-models/
  10. 10. Future Challenges for Big Data • Evaluation • Time evolving data • Distributed mining • Compression • Visualization • Hidden Big Data 10 7/6/2013
  11. 11. HADOOPArchitecture 11 7/6/2013
  12. 12. Apache Mahout 12 7/6/2013
  13. 13. Pig 13 7/6/2013 Pig Similar to SQL
  14. 14. Apache S4 14 7/6/2013
  15. 15. Twitter Storm 15 7/6/2013
  16. 16. Runaway Complexity 16 7/6/2013 Tools All data Precomputed batch view Query Precomputed realtime view New data stream Hadoop Storm “Lambda Architecture” Storm ElephantDB, Voldemort Cassandra, Riak, HBase Kafka
  17. 17. What is SAMOA? 17 7/6/2013 • NEW Software framework for mining distributed data streams • Big Data mining for evolving streams in REAL-TIME
  18. 18. 18 7/6/2013 Big Data Stream Mining BIG DATA Streams • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time • Process elements from a data stream in only one pass • Approximation algorithms – Small error rate with high probability
  19. 19. 19 7/6/2013 Big Data Stream Mining Distributed BIG DATA • BIG DATA Analytics 2.0 – Apache S4 • Yahoo! 2010 – Storm • Twitter 2011 Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA,… Stream MOA
  20. 20. SAMOAArchitecture Use S4, Storm, or other distributed stream processing platform Use MOA, or other streaming machine learning library Easy to extend through PACKAGES 20 7/6/2013 SAMOA S4 Storm … SAMOA Classifier Methods Clustering Methods Frequent Pattern Mining
  21. 21. Thanks! http://samoa-project.net/ G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams @WWW, Rio De Janeiro, 2013. 21 7/6/2013

×