The big data dead valley dilemma and much more.

1,028
-1

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,028
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The big data dead valley dilemma and much more.

  1. 1. The Big Data Dead Valley Dilemma and Much More francis@qmining.com Founder QMining @fraka6
  2. 2. Unhidden Agenda ● Big Data Big Picture ● Big Data Dead Valley Dilemma ● Elastic Map Reduce (EMR) numbers ● Scaling Learning (MPI & hadoop)
  3. 3. Big Data = Lot of Data (evidence) + CPU bounded (forgotten)
  4. 4. Big Data = Lot of Data (evidence) - IO bounded (reality)
  5. 5. IO bounded CPU <100%Data ● HD/Bus speed ● Network ● File server
  6. 6. Big Data Scalability (ex: hadoop) = Cluster + Locality+ node failure (Data move close to CPU)
  7. 7. The Big Data Dilemma
  8. 8. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise size SMB Enterprise Start-ups Techno Maturity Risk
  9. 9. Big Data = SMALL MARKET (B2B vs B2C)
  10. 10. Small Market......hum?
  11. 11. WHY????? Maturity Data, Process, QA, infra, talent, $, Long term vision
  12. 12. Data->Analytics ->BI-> Big-Data -> Data-Mining ->
  13. 13. Data Access & Quality User data privacy, IT outsourcing protection, Data Quality
  14. 14. Enterprise Slowness 1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit) Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group) Hierarchy vs network
  15. 15. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise Maturity SMB Enterprise Start-ups Techno Maturity Risk
  16. 16. QMarketing example Leveraging hadoop ● map = hits to session ● reduce = sessions to ROI
  17. 17. Online Marketing Management Channel % budget ROI ---------------------------------------------- PPC 50% ? Organic 20% ? Email Campaign 20% ? Social Media 10% ?
  18. 18. ROI Dashboard
  19. 19. All abstractions leak Abstract -> Procrastinate! http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
  20. 20. Minimize A Tower of Abstraction Simplify & lower the layer of abstraction Examples: ● Work on file not BD if possible ● HD direct connect on server ● Low level linux command lines (cut, grep, sed etc.) ● High level languages : python Abstraction = 20X benefits
  21. 21. EMR vs AWS & S3 1.0 (no data locality optimization + network & ~IO bounded) EMR = 45 min AWS = 4 min
  22. 22. EMR vs AWS & S3 2.0 EMR = 5+10 min* AWS = ~4 min *30 min prepro ;) EMR = 5+4 if (big files & compress files)
  23. 23. Scaling Machine Learning ● Scaling Data-Preprocessing = Hadoop ● Small dataset = GPU ● Train with Big Dataset = ?? Communication Infrastructures = MPI & MapReduce (John Langford http://hunch.net/?p=2094)
  24. 24. MPI allreduce
  25. 25. Hadoop vs MPI MPI ● No fault tolerance by default ● Poor understanding of where data is (manual split on nodes + bad communication & prog complexity) ● Limit scale to ~100 nodes in practice (sharing unavoidable) ● Cluster shared -> slower nodes issues before disk/node failure MapReduce ● Setup and teardown costs are significant (interaction schedular & communicating the prog + large number of node) ● Worst: mapreduce wait for free nodes + many mapreduce iteration + reach high quality prediction ● Flaw: required refactoring code in map/reduce
  26. 26. Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI) ● MPI = All reduce (all nodes same state) ● MapReduce = Conceptual Simplicity ● MPI: No need to refactor code ● MapReduce: Data Locality (Map only) ● MPI: Ability to use local storage (or RAM): temp file on local disk + allow to be cached in RAM by OS ● MapReduce: Automatic cleanup of local resources (tmp files) ● MPI: Fast Optimization approach remain within the conceptual scope: AllReduce = fct call ● MapReduce robustness (speculative execution to deal with slow nodes)
  27. 27. Summary ● Big Data Big Picture ○ BigData : Cluster + IO bounded (Locality) ● Big Data Dead Valley Dilemma (MMID) ○ Small Market/Maturity/Data:access,quality/Slowness ● EMR (aws) = Slow ● Minimize Tower or abstraction ● Scaling MP: bottleneck = ML ○ MPI:no fault tolerance + where is the data? ○ Hadoop: slow setup & teardown + Require Refactoring ○ Hadoop compatible AllReduce
  28. 28. Reference MPI & hadoop blog: http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html http://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full) CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research Slides: http://lisaweb.iro.umontrea... Implementation : vowpal_wabbit
  29. 29. hum... Questions? francis@qmining.com

×