The BDAS Open Source 
Community 
UC 
BERKELEY 
Ion Stoica 
UC Berkeley and Databricks
Growing Beyond AMPLab 
As software matures and becomes successful, 
more and more contributors outside AMPLab 
New startups have anchored development 
» Databricks (Spark Stack) 
» Mesosphere (Mesos) 
» … 
Enables AMPLab to focus more resources on 
future systems instead of software maintenance
Apache Spark 
Cancer Genomics, Energy Debugging, Smart Buildings 
MLBase SparkR 
Velox Model Serving 
Sample 
Clean 
Spark 
Streaming SparkSQL 
Tachyon 
BlinkDB 
GraphX MLlib 
Apache Spark (core) 
Tachyon 
HDFS, S3, 
Apache Meso…s Yarn
Apache Spark 
Open Source: end of 2010 
Apache Project: 2013 
Over time has grown to include key libraries 
» SparkStreaming, SparkSQL, MLlib, GraphX 
Becoming a platform for Big Data apps
Apache Spark Today 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
2000 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
350000 
300000 
250000 
200000 
150000 
100000 
50000 
0 
2-3x more activity than: Hadoop, Storm, 
Commits Lines of Code Changed 
MongoDB, NumPy, D3, Julia, … 
Activity in past 6 months
Meetups Around the World
Monthly Contributors 
100 
75 
50 
25 
0 
Databricks 
founded 
2011 2012 2013 2014 
370+ contributors for last 12 months
Spark Stack (2013) 
Cancer Genomics, Energy Debugging, Smart Buildings 
Tachyon 
BlinkDB 
Spark 
Streaming 
MLlib 
MLBase 
Sample 
Clean 
Shark 
Apache Spark (core) 
Tachyon 
HDFS, S3, 
Apache Meso…s Yarn
Last Year Developments 
Tachyon 
Cancer Genomics, Energy Debugging, Smart Buildings 
UC 
BERKELEY 
BlinkDB 
MLBase 
SparkR 
SpSahrkaSrkQL GraphX MLlib 
Tachyon 
Spark 
Streaming 
Sample 
Clean 
Apache Spark (core) 
Tachyon 
HDFS, S3, 
Tachyon 
Apache Mesos… Yarn 
Tachyon UC 
BERKELEY 
… 
UC 
BERKELEY 
Velox Model Serving
Wide Adoption 
All major Hadoop distributions include Spark 
Beyond Hadoop
Wide Adoption 
All major Hadoop distributions include Spark 
Beyond Hadoop 
partners 
partners 
Databricks: spurred Spark’s enterprise growth
Apache Mesos 
Cancer Genomics, Energy Debugging, Smart Buildings 
MLBase SparkR 
Velox Model Serving 
Sample 
Clean 
Spark 
Streaming SparkSQL 
Tachyon 
BlinkDB 
GraphX MLlib 
Apache Spark 
Tachyon 
HDFS, S3, 
Apache Meso…s Yarn
Apache Mesos 
Open Source: 2010 
Apache Project: 2012 
Used in production at Twitter for past 2.5 years 
» +10,000 machines 
» +500 engineers using it 
Most development moved outside Berkeley 
starting with 2012
Monthly Contributors 
Mesosphere 
founded 
65 contributors for last 12 months
BDAS Stack 
Cancer Genomics, Energy Debugging, Smart Buildings 
MLBase SparkR 
Velox Model Serving 
Sample 
Clean 
Spark 
Streaming SparkSQL 
Tachyon 
BlinkDB 
GraphX MLlib 
Apache Spark 
HDFS, S3, 
Apache Meso…s Yarn
Release Growth 
Tachyon 0.2: 
- 3 contributors 
Apr ‘13Oct‘13 
Tachyon 0.5: 
- 46 contributors 
Tachyon 0.4: 
- 30 contributors 
Feb ‘14 
Tachyon 0.3: 
- 15 contributors 
16 
July ‘14 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Fast Growing Community 
Berkeley 
Contributors 
Non-Berkeley 
Contributors 
(20+ companies) 
~80% contributors already outside AMPLab
Reaching Tipping Point 
18
Research to Real-World Impact 
MLlib 
Spark 
Streaming 
Spark 
SQL 
Apache Spark (core) 
Apache Mesos 
GraphX 
Tachyon 
Succinct 
Velox 
ADAM 
BlinkDB 
Research 
Real-world Impact 
AMPLab/Berkeley 
Non-Berkeley 
committers / commits
Impact on AMPLab 
Created blue-print & ecosystem for other 
BDAS components to succeed 
» MLlib, GraphX, Tachyon, … 
Enabled AMPLab to increase focus on new 
research projects 
» Velox, ADAM, Succinct, …

The BDAS Open Source Community

  • 1.
    The BDAS OpenSource Community UC BERKELEY Ion Stoica UC Berkeley and Databricks
  • 2.
    Growing Beyond AMPLab As software matures and becomes successful, more and more contributors outside AMPLab New startups have anchored development » Databricks (Spark Stack) » Mesosphere (Mesos) » … Enables AMPLab to focus more resources on future systems instead of software maintenance
  • 3.
    Apache Spark CancerGenomics, Energy Debugging, Smart Buildings MLBase SparkR Velox Model Serving Sample Clean Spark Streaming SparkSQL Tachyon BlinkDB GraphX MLlib Apache Spark (core) Tachyon HDFS, S3, Apache Meso…s Yarn
  • 4.
    Apache Spark OpenSource: end of 2010 Apache Project: 2013 Over time has grown to include key libraries » SparkStreaming, SparkSQL, MLlib, GraphX Becoming a platform for Big Data apps
  • 5.
    Apache Spark Today MapReduce YARN HDFS Storm Spark 2000 1800 1600 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 350000 300000 250000 200000 150000 100000 50000 0 2-3x more activity than: Hadoop, Storm, Commits Lines of Code Changed MongoDB, NumPy, D3, Julia, … Activity in past 6 months
  • 6.
  • 7.
    Monthly Contributors 100 75 50 25 0 Databricks founded 2011 2012 2013 2014 370+ contributors for last 12 months
  • 8.
    Spark Stack (2013) Cancer Genomics, Energy Debugging, Smart Buildings Tachyon BlinkDB Spark Streaming MLlib MLBase Sample Clean Shark Apache Spark (core) Tachyon HDFS, S3, Apache Meso…s Yarn
  • 9.
    Last Year Developments Tachyon Cancer Genomics, Energy Debugging, Smart Buildings UC BERKELEY BlinkDB MLBase SparkR SpSahrkaSrkQL GraphX MLlib Tachyon Spark Streaming Sample Clean Apache Spark (core) Tachyon HDFS, S3, Tachyon Apache Mesos… Yarn Tachyon UC BERKELEY … UC BERKELEY Velox Model Serving
  • 10.
    Wide Adoption Allmajor Hadoop distributions include Spark Beyond Hadoop
  • 11.
    Wide Adoption Allmajor Hadoop distributions include Spark Beyond Hadoop partners partners Databricks: spurred Spark’s enterprise growth
  • 12.
    Apache Mesos CancerGenomics, Energy Debugging, Smart Buildings MLBase SparkR Velox Model Serving Sample Clean Spark Streaming SparkSQL Tachyon BlinkDB GraphX MLlib Apache Spark Tachyon HDFS, S3, Apache Meso…s Yarn
  • 13.
    Apache Mesos OpenSource: 2010 Apache Project: 2012 Used in production at Twitter for past 2.5 years » +10,000 machines » +500 engineers using it Most development moved outside Berkeley starting with 2012
  • 14.
    Monthly Contributors Mesosphere founded 65 contributors for last 12 months
  • 15.
    BDAS Stack CancerGenomics, Energy Debugging, Smart Buildings MLBase SparkR Velox Model Serving Sample Clean Spark Streaming SparkSQL Tachyon BlinkDB GraphX MLlib Apache Spark HDFS, S3, Apache Meso…s Yarn
  • 16.
    Release Growth Tachyon0.2: - 3 contributors Apr ‘13Oct‘13 Tachyon 0.5: - 46 contributors Tachyon 0.4: - 30 contributors Feb ‘14 Tachyon 0.3: - 15 contributors 16 July ‘14 Tachyon 0.1: -1 contributor Dec ‘12
  • 17.
    Fast Growing Community Berkeley Contributors Non-Berkeley Contributors (20+ companies) ~80% contributors already outside AMPLab
  • 18.
  • 19.
    Research to Real-WorldImpact MLlib Spark Streaming Spark SQL Apache Spark (core) Apache Mesos GraphX Tachyon Succinct Velox ADAM BlinkDB Research Real-world Impact AMPLab/Berkeley Non-Berkeley committers / commits
  • 20.
    Impact on AMPLab Created blue-print & ecosystem for other BDAS components to succeed » MLlib, GraphX, Tachyon, … Enabled AMPLab to increase focus on new research projects » Velox, ADAM, Succinct, …