The Big Data Dead Valley Dilemma
and Much More
francis@qmining.com
Founder QMining
@fraka6
Unhidden Agenda
● Big Data Big Picture
● Big Data Dead Valley Dilemma
● Elastic Map Reduce (EMR) numbers
● Scaling Learnin...
Big Data
=
Lot of Data
(evidence)
+
CPU bounded
(forgotten)
Big Data
=
Lot of Data
(evidence)
-
IO bounded
(reality)
IO bounded
CPU
<100%Data
● HD/Bus speed
● Network
● File server
Big Data Scalability
(ex: hadoop)
=
Cluster
+
Locality+ node failure
(Data move close to CPU)
The Big Data Dilemma
Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise size
SMB
Enterprise
Start-ups
Techno Maturity
Risk
Big Data
=
SMALL
MARKET
(B2B vs B2C)
Small Market......hum?
WHY?????
Maturity
Data, Process, QA, infra, talent, $, Long term vision
Data->Analytics ->BI-> Big-Data -> Data-Mining ->
Data Access & Quality
User data privacy, IT outsourcing protection, Data Quality
Enterprise Slowness
1. Boston CXO Forum 24 October : Best Practice on Global
Innovation (IBM, EMC, P&G, Intuit)
Exploit vs...
Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise Maturity
SMB
Enterprise
Start-ups
Techno Maturity
Risk
QMarketing example
Leveraging hadoop
● map = hits to session
● reduce = sessions to ROI
Online Marketing
Management
Channel % budget ROI
----------------------------------------------
PPC 50% ?
Organic 20% ?
Em...
ROI Dashboard
All abstractions leak
Abstract -> Procrastinate!
http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Lev...
Minimize A Tower of Abstraction
Simplify & lower the layer of abstraction
Examples:
● Work on file not BD if possible
● HD...
EMR vs AWS & S3 1.0
(no data locality optimization + network &
~IO bounded)
EMR = 45 min
AWS = 4 min
EMR vs AWS & S3 2.0
EMR = 5+10 min*
AWS = ~4 min
*30 min prepro ;)
EMR = 5+4 if (big files & compress files)
Scaling Machine Learning
● Scaling Data-Preprocessing = Hadoop
● Small dataset = GPU
● Train with Big Dataset = ?? Communi...
MPI allreduce
Hadoop vs MPI
MPI
● No fault tolerance by default
● Poor understanding of where data is (manual split on nodes + bad
commu...
Hadoop-compatible AllReduce -
Vowpall Rabbit (Hadoop + MPI)
● MPI = All reduce (all nodes same state)
● MapReduce = Concep...
Summary
● Big Data Big Picture
○ BigData : Cluster + IO bounded (Locality)
● Big Data Dead Valley Dilemma (MMID)
○ Small M...
Reference MPI & hadoop
blog:
http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html
http://hunch.net/?p=2094
Video & slides...
hum...
Questions?
francis@qmining.com
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
Upcoming SlideShare
Loading in …5
×

The big data dead valley dilemma and much more.

1,258 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,258
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The big data dead valley dilemma and much more.

  1. 1. The Big Data Dead Valley Dilemma and Much More francis@qmining.com Founder QMining @fraka6
  2. 2. Unhidden Agenda ● Big Data Big Picture ● Big Data Dead Valley Dilemma ● Elastic Map Reduce (EMR) numbers ● Scaling Learning (MPI & hadoop)
  3. 3. Big Data = Lot of Data (evidence) + CPU bounded (forgotten)
  4. 4. Big Data = Lot of Data (evidence) - IO bounded (reality)
  5. 5. IO bounded CPU <100%Data ● HD/Bus speed ● Network ● File server
  6. 6. Big Data Scalability (ex: hadoop) = Cluster + Locality+ node failure (Data move close to CPU)
  7. 7. The Big Data Dilemma
  8. 8. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise size SMB Enterprise Start-ups Techno Maturity Risk
  9. 9. Big Data = SMALL MARKET (B2B vs B2C)
  10. 10. Small Market......hum?
  11. 11. WHY????? Maturity Data, Process, QA, infra, talent, $, Long term vision
  12. 12. Data->Analytics ->BI-> Big-Data -> Data-Mining ->
  13. 13. Data Access & Quality User data privacy, IT outsourcing protection, Data Quality
  14. 14. Enterprise Slowness 1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit) Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group) Hierarchy vs network
  15. 15. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise Maturity SMB Enterprise Start-ups Techno Maturity Risk
  16. 16. QMarketing example Leveraging hadoop ● map = hits to session ● reduce = sessions to ROI
  17. 17. Online Marketing Management Channel % budget ROI ---------------------------------------------- PPC 50% ? Organic 20% ? Email Campaign 20% ? Social Media 10% ?
  18. 18. ROI Dashboard
  19. 19. All abstractions leak Abstract -> Procrastinate! http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
  20. 20. Minimize A Tower of Abstraction Simplify & lower the layer of abstraction Examples: ● Work on file not BD if possible ● HD direct connect on server ● Low level linux command lines (cut, grep, sed etc.) ● High level languages : python Abstraction = 20X benefits
  21. 21. EMR vs AWS & S3 1.0 (no data locality optimization + network & ~IO bounded) EMR = 45 min AWS = 4 min
  22. 22. EMR vs AWS & S3 2.0 EMR = 5+10 min* AWS = ~4 min *30 min prepro ;) EMR = 5+4 if (big files & compress files)
  23. 23. Scaling Machine Learning ● Scaling Data-Preprocessing = Hadoop ● Small dataset = GPU ● Train with Big Dataset = ?? Communication Infrastructures = MPI & MapReduce (John Langford http://hunch.net/?p=2094)
  24. 24. MPI allreduce
  25. 25. Hadoop vs MPI MPI ● No fault tolerance by default ● Poor understanding of where data is (manual split on nodes + bad communication & prog complexity) ● Limit scale to ~100 nodes in practice (sharing unavoidable) ● Cluster shared -> slower nodes issues before disk/node failure MapReduce ● Setup and teardown costs are significant (interaction schedular & communicating the prog + large number of node) ● Worst: mapreduce wait for free nodes + many mapreduce iteration + reach high quality prediction ● Flaw: required refactoring code in map/reduce
  26. 26. Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI) ● MPI = All reduce (all nodes same state) ● MapReduce = Conceptual Simplicity ● MPI: No need to refactor code ● MapReduce: Data Locality (Map only) ● MPI: Ability to use local storage (or RAM): temp file on local disk + allow to be cached in RAM by OS ● MapReduce: Automatic cleanup of local resources (tmp files) ● MPI: Fast Optimization approach remain within the conceptual scope: AllReduce = fct call ● MapReduce robustness (speculative execution to deal with slow nodes)
  27. 27. Summary ● Big Data Big Picture ○ BigData : Cluster + IO bounded (Locality) ● Big Data Dead Valley Dilemma (MMID) ○ Small Market/Maturity/Data:access,quality/Slowness ● EMR (aws) = Slow ● Minimize Tower or abstraction ● Scaling MP: bottleneck = ML ○ MPI:no fault tolerance + where is the data? ○ Hadoop: slow setup & teardown + Require Refactoring ○ Hadoop compatible AllReduce
  28. 28. Reference MPI & hadoop blog: http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html http://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full) CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research Slides: http://lisaweb.iro.umontrea... Implementation : vowpal_wabbit
  29. 29. hum... Questions? francis@qmining.com

×