Spark Community Update
Matei Zaharia & Patrick Wendell
June 15th, 2015
A Great Year for Spark
Most active open source project in data processing
New language: R
Many new features & community projects
Community Growth
June 2014 June 2015
total contributors 255 730
contributors/month 75 135
lines of code 175,000 400,000
Community Growth
June 2014 June 2015
total contributors 255 730
contributors/month 75 135
lines of code 175,000 400,000
Mostly in libraries
Users
1000+ companies
…
Distributors + Apps
50+ companies
…
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk sort record
Open Source
Ecosystem
Applications
Environments Data Sources
Current Spark Components
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Major Directions in 2015
Data Science
Similar interfaces to
single-node tools
Platform APIs
Growing the ecosystem
Data Science
DataFrames: popular API for data transformation
Machine Learning Pipelines: inspired by scikit-learn
R Language
1.3
1.4
1.4
Platform APIs
{JSON}
Data Sources
•  Uniform interface to diverse
sources (DataFrames + SQL)
Spark Packages
•  Community site with 70+ libraries
•  spark-packages.org
…
Spark
Your app
Ongoing Engine Improvements
Project Tungsten
•  Code generation, binary processing, off-heap memory
DAG visualization & debugging tools
Spark’s 1.4 Release
R Language Support
14	
  
R API based on Spark’s
DataFrames
An R Runtime for Big Data
15	
  
Spark’s scale
Thousands of machines and
cores
Spark’s performance
Runtime optimizer, code
generation, memory
management
Access to Spark’s I/O Packages
#	
  Dataframe	
  from	
  JSON	
  
>	
  people	
  <-­‐	
  read.df("people.json",	
  "json")	
  
	
  
#	
  ...	
  from	
  MySQL	
  
>	
  people	
  <-­‐	
  read.df("jdbc:mysql://sql01",	
  "jdbc")	
  
	
  
#	
  ...	
  from	
  Hive	
  
>	
  people	
  <-­‐	
  read.table("orders")	
  
16	
  
CSV ElasticSearch
Avro Cassandra
Parquet MongoDB
SequoiaDB HBase
ML Pipelines
17	
  
Data
Frame
tokenizer hashingTF lr
Pipeline Model	
  
//	
  create	
  pipeline	
  
tok	
  =	
  Tokenizer(in="text",	
  out="words”)	
  
tf	
  =	
  HashingTF(in=“words”,	
  out="features”)	
  
lr	
  =	
  LogisticRegression(maxIter=10,	
  regParam=0.01)	
  
pipeline	
  =	
  Pipeline(stages=[tok,	
  tf,	
  lr])	
  
Data
Frame
//	
  train	
  pipeline	
  
df	
  =	
  sqlCtx.table(”training”)	
  
model	
  =	
  pipeline.fit(df)	
  
	
  
//	
  make	
  predictions	
  
df	
  =	
  sqlCtx.read.json("/path/to/test”)	
  
model.transform(df)	
  
	
  	
  .select(“id”,	
  “text”,	
  “prediction”)	
  
	
  
18	
  
ML Pipelines
Stable API with hooks for third party pipeline components
Feature transformers New algorithms
VectorAssembler
String/VectorIndexer
OneHotEncoder
PolynomialExpansion
….
GLM with elastic-net
Tree classifiers
Tree regressors
OneVsRest
…
19	
  
Performance and Project Tungsten
Managed memory for aggregations
Memory efficient shuffles
1.3 1.4
Customer Pipeline
Latency
What’s Coming in Spark 1.5+?
Project Tungsten: Code generation, improved sort + aggregation
Spark Streaming: Flow control, optimized state management
ML: Single machine solvers, scalability to many features
SparkR: Integration with Spark’s machine learning APIs
20	
  
Join Us Today at Office Hours!
21	
  
Area
1:00-1:45 Spark Core, YARN
Spark Streaming
1:45-2:30 Spark SQL
3:00-3:40 Spark Ops
3:40-4:15 Spark SQL
4:30-5:15 Spark Core, PySpark
Spark MLlib
5:15-6:00 Spark MLlib
Databricks booth (A1)
More tomorrow…
Thanks!

Spark Community Update - Spark Summit San Francisco 2015

  • 1.
    Spark Community Update MateiZaharia & Patrick Wendell June 15th, 2015
  • 2.
    A Great Yearfor Spark Most active open source project in data processing New language: R Many new features & community projects
  • 3.
    Community Growth June 2014June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000
  • 4.
    Community Growth June 2014June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000 Mostly in libraries
  • 5.
  • 6.
    Large-Scale Usage Largest cluster:8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk sort record
  • 7.
  • 8.
    Current Spark Components SparkCore Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments
  • 9.
    Major Directions in2015 Data Science Similar interfaces to single-node tools Platform APIs Growing the ecosystem
  • 10.
    Data Science DataFrames: popularAPI for data transformation Machine Learning Pipelines: inspired by scikit-learn R Language 1.3 1.4 1.4
  • 11.
    Platform APIs {JSON} Data Sources • Uniform interface to diverse sources (DataFrames + SQL) Spark Packages •  Community site with 70+ libraries •  spark-packages.org … Spark Your app
  • 12.
    Ongoing Engine Improvements ProjectTungsten •  Code generation, binary processing, off-heap memory DAG visualization & debugging tools
  • 13.
  • 14.
    R Language Support 14   R API based on Spark’s DataFrames
  • 15.
    An R Runtimefor Big Data 15   Spark’s scale Thousands of machines and cores Spark’s performance Runtime optimizer, code generation, memory management
  • 16.
    Access to Spark’sI/O Packages #  Dataframe  from  JSON   >  people  <-­‐  read.df("people.json",  "json")     #  ...  from  MySQL   >  people  <-­‐  read.df("jdbc:mysql://sql01",  "jdbc")     #  ...  from  Hive   >  people  <-­‐  read.table("orders")   16   CSV ElasticSearch Avro Cassandra Parquet MongoDB SequoiaDB HBase
  • 17.
    ML Pipelines 17   Data Frame tokenizerhashingTF lr Pipeline Model   //  create  pipeline   tok  =  Tokenizer(in="text",  out="words”)   tf  =  HashingTF(in=“words”,  out="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tok,  tf,  lr])   Data Frame //  train  pipeline   df  =  sqlCtx.table(”training”)   model  =  pipeline.fit(df)     //  make  predictions   df  =  sqlCtx.read.json("/path/to/test”)   model.transform(df)      .select(“id”,  “text”,  “prediction”)    
  • 18.
    18   ML Pipelines StableAPI with hooks for third party pipeline components Feature transformers New algorithms VectorAssembler String/VectorIndexer OneHotEncoder PolynomialExpansion …. GLM with elastic-net Tree classifiers Tree regressors OneVsRest …
  • 19.
    19   Performance andProject Tungsten Managed memory for aggregations Memory efficient shuffles 1.3 1.4 Customer Pipeline Latency
  • 20.
    What’s Coming inSpark 1.5+? Project Tungsten: Code generation, improved sort + aggregation Spark Streaming: Flow control, optimized state management ML: Single machine solvers, scalability to many features SparkR: Integration with Spark’s machine learning APIs 20  
  • 21.
    Join Us Todayat Office Hours! 21   Area 1:00-1:45 Spark Core, YARN Spark Streaming 1:45-2:30 Spark SQL 3:00-3:40 Spark Ops 3:40-4:15 Spark SQL 4:30-5:15 Spark Core, PySpark Spark MLlib 5:15-6:00 Spark MLlib Databricks booth (A1) More tomorrow…
  • 22.