Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

State of the BDAS Union

1,622 views

Published on

by Ion Stoica

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

State of the BDAS Union

  1. 1. State of the BDAS Union Ion Stoica November 19th, 2015 UC  BERKELEY  
  2. 2. We Came a Long Way August 2012: AMP Camp 1 Since then we trained 10,000s people! •  AMP Camps, Spark Summits, MOOCs Today: AMP Camp 6 •  210+ people
  3. 3. AMPLab: Public/Private Partnership (2011-2017) Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)
  4. 4. BDAS Stack Processing Layer Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage Succinct BDAS Stack 3rd party MesosMesos Hadoop Yarn Res. Mgmnt
  5. 5. BDAS Stack Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KeystoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage Succinct BDAS Stack 3rd party MesosMesos Hadoop Yarn Res. Mgmnt AMP Camp 6
  6. 6. BDAS Stack Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KeystoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage Succinct BDAS Stack 3rd party MesosMesos Hadoop Yarn Res. Mgmnt AMP Camp 6
  7. 7. Industry Impact Accelerating Thousands of companies using BDAS components Three startups behind BDAS main components Mesos Spark Tachyon
  8. 8. Spark Unifies batch, interactive, streaming computations Easy to build sophisticated applications •  Support iterative, graph-parallel algorithms •  Powerful APIs in Scala, Python, Java, R Spark Core Spark Streaming SparkSQL MLlib GraphX SparkR
  9. 9. Meetup Groups: January 2015 source: meetup.com
  10. 10. Meetup Groups: October 2015 source: meetup.com
  11. 11. Community Growth 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Developers Contributing 3900 1100 42K 12K 350 600
  12. 12. Massive Open Online Courses (MOOCs) “Intro to Big Data with Apache Spark” •  Anthony Joseph, UC Berkeley •  June 1st, 5 weeks •  78,000+ registrations, 12% finishing (2x average) “Scalable Machine Learning with Apache Spark” •  Ameet Talwalkar, UCLA •  June 22nd, 5 weeks •  55,000+ registrations, 15% finishing (2.5x average)
  13. 13. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk sort record
  14. 14. Spark Ecosystem Distributions Applications
  15. 15. Databricks Survey: Spark Summit SF ‘15 1400 respondents from 840 companies Three trends: 1)  Diverse applications 2)  More runtime environments 3)  More types of users
  16. 16. Top Applications Faud Detection / Security User-Facing Services Log Processing Recommendation Data Warehousing Business Intelligence
  17. 17. Spark Components Used MLlib + GraphX Spark Streaming DataFrames Spark SQL 75 % of users use more than one component
  18. 18. Diverse Storages Hadoop: combined compute + storage HDFS MapReduc e Spark: independent of storage layer Spark HDFS SQL e.g. Oracle NoSQL e.g. Cassandra
  19. 19. Diverse Storages 2014 Hadoop Use a little Use a lot 61% 31% HDFS 2015 Hadoop NoSQL Proprietary SQL 46% 34% 43% 36% 37% 21% HDFS
  20. 20. Diverse Runtime Environments HOW RESPONDENTS ARE RUNNING SPARK 51% on a public cloud MOST COMMON SPARK DEPLOYMENT ENVIRONMENTS (CLUSTER MANAGERS) 48% 40% 11% Standalone mode YARN Mesos Cluster Managers
  21. 21. Diversity of Users 84% 38% 38% 71% 31% 58% 18% Languages Used: 2014 Languages Used: 2015
  22. 22. Fastest Growing User Segments +280% increase in Windows users +56% production use of Streaming +380% production use of SQL
  23. 23. What Next? Easy of use: Data Frames and Datasets Performance: Tungsten Integration •  Rich, powerful libraries •  Data sources SQLStreaming ML Graph …  
  24. 24. Storage Layer Succinct Tachyon Processing Layer Resource Management Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt Non-persistent storage engine (in-memory, SSDs) •  Support a variety of APIs •  Support a variety of underlying file systems Enable innovation in storage •  Don’t need to change existing persistent storage systems Tachyon
  25. 25. Succinct Processing Layer Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt Succinct Queries on compressed data •  Arbitrary substring searches •  Gzip level of compression Numerous applications •  Regex support •  Graph query engine:
  26. 26. Storage Layer Succinct KeystoneML Processing Layer Resource Management Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib BlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt Simplify building ML pipelines Rich set of operators Type safe interface KestoneML
  27. 27. Storage Layer Succinct Velox Processing Layer Resource Management Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib BlinkDB Sample Clean SparkR Velox Processing Tachyon HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt KestoneMLServing layer Online management and maintenance of models Support a variety of predictive models Velox
  28. 28. Today Learn about latest developments in BDAS •  Spark, Tachyon, Succinct, KeystoneML Applications & tools for BDAS •  ADAM: framework for fast genomic processing •  Plank: predict optimal number & type of nodes to run parallel apps •  Splash: Easy to use API for stochastic ML
  29. 29. Summary Adoption is accelerating •  E.g., Spark increased 2-4x YoY on all adoption metrics Large scale production deployments Deployed by major enterprises Impact well beyond our expectations
  30. 30. Thanks!
  31. 31. MesosMesos Hadoop Yarn Res. Mgmnt Tachyon HDFS, S3, Ceph, … Storage Succinct Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox AMPLab Still Driving Many Projects BDAS Stack 3rd party
  32. 32. MesosMesos Hadoop Yarn Res. Mgmnt Tachyon HDFS, S3, Ceph, … Storage Succinct Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox AMPLab Still Driving Many Projects BDAS Stack Components Driven by AMPLab

×