State of the BDAS Union
Ion Stoica
November 19th, 2015
UC	
  BERKELEY	
  
We Came a Long Way
August 2012: AMP Camp 1
Since then we trained 10,000s people!
•  AMP Camps, Spark Summits, MOOCs
Today:...
AMPLab: Public/Private Partnership (2011-2017)
Goal: Next generation of open source data
analytics stack for industry & ac...
BDAS Stack
Processing Layer
Resource Management Layer
Storage Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
MLlib
Kest...
BDAS Stack
Resource Management Layer
Storage Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
MLlib
KeystoneMLBlinkDB
Sam...
BDAS Stack
Resource Management Layer
Storage Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
MLlib
KeystoneMLBlinkDB
Sam...
Industry Impact Accelerating
Thousands of companies using BDAS components
Three startups behind BDAS main components
Mesos...
Spark
Unifies batch, interactive, streaming computations
Easy to build sophisticated applications
•  Support iterative, gr...
Meetup Groups: January 2015
source: meetup.com
Meetup Groups: October 2015
source: meetup.com
Community Growth
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Developers
Contributing
3900
1100
42K
12K
3...
Massive Open Online Courses (MOOCs)
“Intro to Big Data with Apache Spark”
•  Anthony Joseph, UC Berkeley
•  June 1st, 5 we...
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk ...
Spark Ecosystem
Distributions Applications
Databricks Survey: Spark Summit SF ‘15
1400 respondents from 840 companies
Three trends:
1)  Diverse applications
2)  More...
Top Applications
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
Business In...
Spark Components Used
MLlib + GraphX
Spark Streaming
DataFrames
Spark SQL
75
%
of users use more
than one component
Diverse Storages
Hadoop: combined
compute + storage
HDFS
MapReduc
e
Spark: independent
of storage layer
Spark
HDFS SQL
e.g...
Diverse Storages
2014
Hadoop
Use a
little
Use a
lot
61%
31%
HDFS	

2015
Hadoop NoSQL Proprietary
SQL
46%
34%
43%
36% 37%
2...
Diverse Runtime Environments
HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONM...
Diversity of Users
84%
38% 38%
71%
31%
58%
18%
Languages Used: 2014 Languages Used: 2015
Fastest Growing User Segments
+280%
increase in
Windows users
+56%
production use
of Streaming
+380%
production
use of SQL
What Next?
Easy of use: Data Frames and Datasets
Performance: Tungsten
Integration
•  Rich, powerful libraries
•  Data sou...
Storage Layer
Succinct
Tachyon
Processing Layer
Resource Management Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
MLli...
Succinct
Processing Layer
Resource Management Layer
Storage Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
MLlib
Keston...
Storage Layer
Succinct
KeystoneML
Processing Layer
Resource Management Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
M...
Storage Layer
Succinct
Velox
Processing Layer
Resource Management Layer
Spark Core
Spark
Streamin
g
SparkSQL
GraphX
MLlib
...
Today
Learn about latest developments in BDAS
•  Spark, Tachyon, Succinct, KeystoneML
Applications & tools for BDAS
•  ADA...
Summary
Adoption is accelerating
•  E.g., Spark increased 2-4x YoY on all adoption metrics
Large scale production deployme...
Thanks!
MesosMesos Hadoop Yarn
Res.	

Mgmnt	

Tachyon HDFS, S3, Ceph, …
Storage	

Succinct
Spark Core
Spark
Streamin
g
SparkSQL
Gr...
MesosMesos Hadoop Yarn
Res.	

Mgmnt	

Tachyon HDFS, S3, Ceph, …
Storage	

Succinct
Spark Core
Spark
Streamin
g
SparkSQL
Gr...
Upcoming SlideShare
Loading in …5
×

State of the BDAS Union

1,121 views

Published on

by Ion Stoica

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,121
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
56
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

State of the BDAS Union

  1. 1. State of the BDAS Union Ion Stoica November 19th, 2015 UC  BERKELEY  
  2. 2. We Came a Long Way August 2012: AMP Camp 1 Since then we trained 10,000s people! •  AMP Camps, Spark Summits, MOOCs Today: AMP Camp 6 •  210+ people
  3. 3. AMPLab: Public/Private Partnership (2011-2017) Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)
  4. 4. BDAS Stack Processing Layer Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage Succinct BDAS Stack 3rd party MesosMesos Hadoop Yarn Res. Mgmnt
  5. 5. BDAS Stack Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KeystoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage Succinct BDAS Stack 3rd party MesosMesos Hadoop Yarn Res. Mgmnt AMP Camp 6
  6. 6. BDAS Stack Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KeystoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage Succinct BDAS Stack 3rd party MesosMesos Hadoop Yarn Res. Mgmnt AMP Camp 6
  7. 7. Industry Impact Accelerating Thousands of companies using BDAS components Three startups behind BDAS main components Mesos Spark Tachyon
  8. 8. Spark Unifies batch, interactive, streaming computations Easy to build sophisticated applications •  Support iterative, graph-parallel algorithms •  Powerful APIs in Scala, Python, Java, R Spark Core Spark Streaming SparkSQL MLlib GraphX SparkR
  9. 9. Meetup Groups: January 2015 source: meetup.com
  10. 10. Meetup Groups: October 2015 source: meetup.com
  11. 11. Community Growth 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Developers Contributing 3900 1100 42K 12K 350 600
  12. 12. Massive Open Online Courses (MOOCs) “Intro to Big Data with Apache Spark” •  Anthony Joseph, UC Berkeley •  June 1st, 5 weeks •  78,000+ registrations, 12% finishing (2x average) “Scalable Machine Learning with Apache Spark” •  Ameet Talwalkar, UCLA •  June 22nd, 5 weeks •  55,000+ registrations, 15% finishing (2.5x average)
  13. 13. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk sort record
  14. 14. Spark Ecosystem Distributions Applications
  15. 15. Databricks Survey: Spark Summit SF ‘15 1400 respondents from 840 companies Three trends: 1)  Diverse applications 2)  More runtime environments 3)  More types of users
  16. 16. Top Applications Faud Detection / Security User-Facing Services Log Processing Recommendation Data Warehousing Business Intelligence
  17. 17. Spark Components Used MLlib + GraphX Spark Streaming DataFrames Spark SQL 75 % of users use more than one component
  18. 18. Diverse Storages Hadoop: combined compute + storage HDFS MapReduc e Spark: independent of storage layer Spark HDFS SQL e.g. Oracle NoSQL e.g. Cassandra
  19. 19. Diverse Storages 2014 Hadoop Use a little Use a lot 61% 31% HDFS 2015 Hadoop NoSQL Proprietary SQL 46% 34% 43% 36% 37% 21% HDFS
  20. 20. Diverse Runtime Environments HOW RESPONDENTS ARE RUNNING SPARK 51% on a public cloud MOST COMMON SPARK DEPLOYMENT ENVIRONMENTS (CLUSTER MANAGERS) 48% 40% 11% Standalone mode YARN Mesos Cluster Managers
  21. 21. Diversity of Users 84% 38% 38% 71% 31% 58% 18% Languages Used: 2014 Languages Used: 2015
  22. 22. Fastest Growing User Segments +280% increase in Windows users +56% production use of Streaming +380% production use of SQL
  23. 23. What Next? Easy of use: Data Frames and Datasets Performance: Tungsten Integration •  Rich, powerful libraries •  Data sources SQLStreaming ML Graph …  
  24. 24. Storage Layer Succinct Tachyon Processing Layer Resource Management Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt Non-persistent storage engine (in-memory, SSDs) •  Support a variety of APIs •  Support a variety of underlying file systems Enable innovation in storage •  Don’t need to change existing persistent storage systems Tachyon
  25. 25. Succinct Processing Layer Resource Management Layer Storage Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt Succinct Queries on compressed data •  Arbitrary substring searches •  Gzip level of compression Numerous applications •  Regex support •  Graph query engine:
  26. 26. Storage Layer Succinct KeystoneML Processing Layer Resource Management Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib BlinkDB Sample Clean SparkR Velox Processing Velox Tachyon HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt Simplify building ML pipelines Rich set of operators Type safe interface KestoneML
  27. 27. Storage Layer Succinct Velox Processing Layer Resource Management Layer Spark Core Spark Streamin g SparkSQL GraphX MLlib BlinkDB Sample Clean SparkR Velox Processing Tachyon HDFS, S3, Ceph, … Storage MesosMesos Hadoop Yarn Res. Mgmnt KestoneMLServing layer Online management and maintenance of models Support a variety of predictive models Velox
  28. 28. Today Learn about latest developments in BDAS •  Spark, Tachyon, Succinct, KeystoneML Applications & tools for BDAS •  ADAM: framework for fast genomic processing •  Plank: predict optimal number & type of nodes to run parallel apps •  Splash: Easy to use API for stochastic ML
  29. 29. Summary Adoption is accelerating •  E.g., Spark increased 2-4x YoY on all adoption metrics Large scale production deployments Deployed by major enterprises Impact well beyond our expectations
  30. 30. Thanks!
  31. 31. MesosMesos Hadoop Yarn Res. Mgmnt Tachyon HDFS, S3, Ceph, … Storage Succinct Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox AMPLab Still Driving Many Projects BDAS Stack 3rd party
  32. 32. MesosMesos Hadoop Yarn Res. Mgmnt Tachyon HDFS, S3, Ceph, … Storage Succinct Spark Core Spark Streamin g SparkSQL GraphX MLlib KestoneMLBlinkDB Sample Clean SparkR Velox Processing Velox AMPLab Still Driving Many Projects BDAS Stack Components Driven by AMPLab

×