The big data dead valley dilemma and much more.
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The big data dead valley dilemma and much more.

on

  • 1,035 views

 

Statistics

Views

Total Views
1,035
Views on SlideShare
1,035
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The big data dead valley dilemma and much more. Presentation Transcript

  • 1. The Big Data Dead Valley Dilemma and Much More francis@qmining.com Founder QMining @fraka6
  • 2. Unhidden Agenda ● Big Data Big Picture ● Big Data Dead Valley Dilemma ● Elastic Map Reduce (EMR) numbers ● Scaling Learning (MPI & hadoop)
  • 3. Big Data = Lot of Data (evidence) + CPU bounded (forgotten)
  • 4. Big Data = Lot of Data (evidence) - IO bounded (reality)
  • 5. IO bounded CPU <100%Data ● HD/Bus speed ● Network ● File server
  • 6. Big Data Scalability (ex: hadoop) = Cluster + Locality+ node failure (Data move close to CPU)
  • 7. The Big Data Dilemma
  • 8. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise size SMB Enterprise Start-ups Techno Maturity Risk
  • 9. Big Data = SMALL MARKET (B2B vs B2C)
  • 10. Small Market......hum?
  • 11. WHY????? Maturity Data, Process, QA, infra, talent, $, Long term vision
  • 12. Data->Analytics ->BI-> Big-Data -> Data-Mining ->
  • 13. Data Access & Quality User data privacy, IT outsourcing protection, Data Quality
  • 14. Enterprise Slowness 1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit) Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group) Hierarchy vs network
  • 15. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise Maturity SMB Enterprise Start-ups Techno Maturity Risk
  • 16. QMarketing example Leveraging hadoop ● map = hits to session ● reduce = sessions to ROI
  • 17. Online Marketing Management Channel % budget ROI ---------------------------------------------- PPC 50% ? Organic 20% ? Email Campaign 20% ? Social Media 10% ?
  • 18. ROI Dashboard
  • 19. All abstractions leak Abstract -> Procrastinate! http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
  • 20. Minimize A Tower of Abstraction Simplify & lower the layer of abstraction Examples: ● Work on file not BD if possible ● HD direct connect on server ● Low level linux command lines (cut, grep, sed etc.) ● High level languages : python Abstraction = 20X benefits
  • 21. EMR vs AWS & S3 1.0 (no data locality optimization + network & ~IO bounded) EMR = 45 min AWS = 4 min
  • 22. EMR vs AWS & S3 2.0 EMR = 5+10 min* AWS = ~4 min *30 min prepro ;) EMR = 5+4 if (big files & compress files)
  • 23. Scaling Machine Learning ● Scaling Data-Preprocessing = Hadoop ● Small dataset = GPU ● Train with Big Dataset = ?? Communication Infrastructures = MPI & MapReduce (John Langford http://hunch.net/?p=2094)
  • 24. MPI allreduce
  • 25. Hadoop vs MPI MPI ● No fault tolerance by default ● Poor understanding of where data is (manual split on nodes + bad communication & prog complexity) ● Limit scale to ~100 nodes in practice (sharing unavoidable) ● Cluster shared -> slower nodes issues before disk/node failure MapReduce ● Setup and teardown costs are significant (interaction schedular & communicating the prog + large number of node) ● Worst: mapreduce wait for free nodes + many mapreduce iteration + reach high quality prediction ● Flaw: required refactoring code in map/reduce
  • 26. Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI) ● MPI = All reduce (all nodes same state) ● MapReduce = Conceptual Simplicity ● MPI: No need to refactor code ● MapReduce: Data Locality (Map only) ● MPI: Ability to use local storage (or RAM): temp file on local disk + allow to be cached in RAM by OS ● MapReduce: Automatic cleanup of local resources (tmp files) ● MPI: Fast Optimization approach remain within the conceptual scope: AllReduce = fct call ● MapReduce robustness (speculative execution to deal with slow nodes)
  • 27. Summary ● Big Data Big Picture ○ BigData : Cluster + IO bounded (Locality) ● Big Data Dead Valley Dilemma (MMID) ○ Small Market/Maturity/Data:access,quality/Slowness ● EMR (aws) = Slow ● Minimize Tower or abstraction ● Scaling MP: bottleneck = ML ○ MPI:no fault tolerance + where is the data? ○ Hadoop: slow setup & teardown + Require Refactoring ○ Hadoop compatible AllReduce
  • 28. Reference MPI & hadoop blog: http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html http://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full) CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research Slides: http://lisaweb.iro.umontrea... Implementation : vowpal_wabbit
  • 29. hum... Questions? francis@qmining.com