Your SlideShare is downloading. ×
0
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
The big data dead valley dilemma and much more.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The big data dead valley dilemma and much more.

937

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
937
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Big Data Dead Valley Dilemma and Much More francis@qmining.com Founder QMining @fraka6
  • 2. Unhidden Agenda ● Big Data Big Picture ● Big Data Dead Valley Dilemma ● Elastic Map Reduce (EMR) numbers ● Scaling Learning (MPI & hadoop)
  • 3. Big Data = Lot of Data (evidence) + CPU bounded (forgotten)
  • 4. Big Data = Lot of Data (evidence) - IO bounded (reality)
  • 5. IO bounded CPU <100%Data ● HD/Bus speed ● Network ● File server
  • 6. Big Data Scalability (ex: hadoop) = Cluster + Locality+ node failure (Data move close to CPU)
  • 7. The Big Data Dilemma
  • 8. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise size SMB Enterprise Start-ups Techno Maturity Risk
  • 9. Big Data = SMALL MARKET (B2B vs B2C)
  • 10. Small Market......hum?
  • 11. WHY????? Maturity Data, Process, QA, infra, talent, $, Long term vision
  • 12. Data->Analytics ->BI-> Big-Data -> Data-Mining ->
  • 13. Data Access & Quality User data privacy, IT outsourcing protection, Data Quality
  • 14. Enterprise Slowness 1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit) Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group) Hierarchy vs network
  • 15. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise Maturity SMB Enterprise Start-ups Techno Maturity Risk
  • 16. QMarketing example Leveraging hadoop ● map = hits to session ● reduce = sessions to ROI
  • 17. Online Marketing Management Channel % budget ROI ---------------------------------------------- PPC 50% ? Organic 20% ? Email Campaign 20% ? Social Media 10% ?
  • 18. ROI Dashboard
  • 19. All abstractions leak Abstract -> Procrastinate! http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
  • 20. Minimize A Tower of Abstraction Simplify & lower the layer of abstraction Examples: ● Work on file not BD if possible ● HD direct connect on server ● Low level linux command lines (cut, grep, sed etc.) ● High level languages : python Abstraction = 20X benefits
  • 21. EMR vs AWS & S3 1.0 (no data locality optimization + network & ~IO bounded) EMR = 45 min AWS = 4 min
  • 22. EMR vs AWS & S3 2.0 EMR = 5+10 min* AWS = ~4 min *30 min prepro ;) EMR = 5+4 if (big files & compress files)
  • 23. Scaling Machine Learning ● Scaling Data-Preprocessing = Hadoop ● Small dataset = GPU ● Train with Big Dataset = ?? Communication Infrastructures = MPI & MapReduce (John Langford http://hunch.net/?p=2094)
  • 24. MPI allreduce
  • 25. Hadoop vs MPI MPI ● No fault tolerance by default ● Poor understanding of where data is (manual split on nodes + bad communication & prog complexity) ● Limit scale to ~100 nodes in practice (sharing unavoidable) ● Cluster shared -> slower nodes issues before disk/node failure MapReduce ● Setup and teardown costs are significant (interaction schedular & communicating the prog + large number of node) ● Worst: mapreduce wait for free nodes + many mapreduce iteration + reach high quality prediction ● Flaw: required refactoring code in map/reduce
  • 26. Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI) ● MPI = All reduce (all nodes same state) ● MapReduce = Conceptual Simplicity ● MPI: No need to refactor code ● MapReduce: Data Locality (Map only) ● MPI: Ability to use local storage (or RAM): temp file on local disk + allow to be cached in RAM by OS ● MapReduce: Automatic cleanup of local resources (tmp files) ● MPI: Fast Optimization approach remain within the conceptual scope: AllReduce = fct call ● MapReduce robustness (speculative execution to deal with slow nodes)
  • 27. Summary ● Big Data Big Picture ○ BigData : Cluster + IO bounded (Locality) ● Big Data Dead Valley Dilemma (MMID) ○ Small Market/Maturity/Data:access,quality/Slowness ● EMR (aws) = Slow ● Minimize Tower or abstraction ● Scaling MP: bottleneck = ML ○ MPI:no fault tolerance + where is the data? ○ Hadoop: slow setup & teardown + Require Refactoring ○ Hadoop compatible AllReduce
  • 28. Reference MPI & hadoop blog: http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html http://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full) CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research Slides: http://lisaweb.iro.umontrea... Implementation : vowpal_wabbit
  • 29. hum... Questions? francis@qmining.com

×