Dataiku big data paris - the rise of the hadoop ecosystem

2,022 views

Published on

Snapshot of the hadoop ecosystem at the beginning of 2014, with the rise of real time and in memory processing distributed frameworks that complement and supplant the Map Reduce paradigm

Published in: Technology, Business
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total views
2,022
On SlideShare
0
From Embeds
0
Number of Embeds
338
Actions
Shares
0
Downloads
45
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide
  • EVERYTHING IS ABOUT PRICE / PERFORMANCE RATIO OF MEMORY CPU DISK
  • Dataiku big data paris - the rise of the hadoop ecosystem

    1. 1. The Rise of the Hadoop Ecosystem
    2. 2. Florian Douetteau CEO Dataiku DATA PREPARATION MODELING STATISTICS VISUALIZATION ALL-IN-ONE DATA SCIENCE STUDIO
    3. 3. DRIVERS FOR THE NEW “REAL-TIME“ HADOOP ECOSYSTEM KEY TOOLS AND FRAMEWORKS TO BE AWARE OF
    4. 4. RAM - CPU - DISK
    5. 5. 2000 2013 1000$ / GB 6$ / GB $10 / GB $0.06 / GB memory divided by 150 disk cost divided by 250 MAP REDUCE times HACK REDUCE times
    6. 6. WHOLE DATA REFINED DATA
    7. 7. NEEDLE IN HAYSTACK ?
    8. 8. REFINE BEFORE USE
    9. 9. Web Site – $1B revenue per year – 10 Millions Unique Visitor per month – 100.Millions orders / actions / per day 10TB RAW DATA 1TB REFINE DATA
    10. 10. FITS IN MEMORY 1TB
    11. 11. • GOOGLE • 1 Circle OPEN SOURCE – YAHOO – IBM – LINKEDIN - FACEBOOK • 2 Circle – STANDFORD BERKELEY – STARTUPS
    12. 12. 64m$ 6.75m$ 14m$ 2m$ 40m$ 20m$ 20.5m$ 19m$ 4m$ 100m$ 1.8m$ 17m$ 11m$ 7.75m$ 1.7m$ 2013 2012 2011 2010 2009 $1B per year Invested in Big Data TECH 223m$ 301m$
    13. 13. HDFS MAP REDUCE 1. Safe Large Storage (HDFS) 2. Distributed computation paradigm (Map Reduce) 3. Resilient long job 1. Disk-CPU locality aware resource allocation HADOOP =
    14. 14. LOVELY TANGLED TOGETHER
    15. 15. HDFS YARN map reduce provider 1 Other cluster provider … THE NEW ECOSYSTEM
    16. 16. REALLY FASTER ?
    17. 17. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING
    18. 18. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING
    19. 19. DEVELOPPER CAN WAIT
    20. 20. BUSINESS WON’T WAIT
    21. 21. Not All Queries are born equals
    22. 22. MPP Database like performance for Hadoop - Created in 2012 by Cloudera - x100 performance over Hive (for certain queries)
    23. 23. Extensible architecture for SQL Querying • Started in 2013 • Apache Incubated Project • Lucidworks • Mapr • ElasticSearch • … • Alpha Status • Open architecture for supporting SQL like queries to various data sources: • Cassandra • MongoDB • HDFS • HBase Apache DRILL
    24. 24. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING
    25. 25. Update the Model Once per week using the whole history Apply the model for each user using the very last events Real-Time Navigation Real-Time Recommendation
    26. 26. STORM Reliable Distributed Real-Time Computations - Connect to a variety of data sources (HDFS, RabbitMQ, JMS etc..) - Run Computation in java (native) or python, ruby, perl … - Guarantees that events are taken processed - Distributes workload
    27. 27. Write Map-Reduce like program and executing either in • Batch • Real-Time • Hybrid Batch / Real-Time • Open Sourced By Twitter in 2013 • Built on top of Storm (and Cascading) • Program in Scala
    28. 28. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING
    29. 29. GOOD PUPILS ITERATE
    30. 30. …….. …….. Stochastic Gradient Descent : ITERATE K-Means : ITERATE Pages Rank: ITERATE ……..
    31. 31. “Graph” Analytics in Memory • Created at Carnegie-Mellon in 2009 • Generic Graph Traversal framework • Packaged Machine Learning - Recommender Systems - Graph Analytics - Clustering • Easy Python Integration
    32. 32. In-Memory Distribution Prediction Engine Machine Learning - Classification - Regression - Clustering - R/Python easy integration
    33. 33. Real-Time Resilient Distributed Memory Framework • Abstraction with any DAG operation on data: - Filter - Map - Reduce - Cache
    34. 34. SHARK MLBASE STREAMING Real-Time Queries Real-Time Updates In-Memory Learning SPARK
    35. 35. HDFS YARN map reduce SPARK GRAPHLAB H2OSTREAMING MLBASE SHARK PIG HIVE CASCADING STORM DRILL otherstorage IMPALA
    36. 36. dataiku.com DATAIKU STAND A4 DEMO DATA SCIENCE STUDIO Questions now or later florian.douetteau@dataiku.com

    ×