A look ahead at Spark 2.0
Reynold Xin @rxin
2016-03-30,Strata Conference
About Databricks
Founded by creatorsof Spark in 2013
Cloud enterprisedata platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance,security, …
Today’s Talk
Looking back last12 months
Looking forward to Spark 2.0
• Project Tungsten,Phase 2
• Structured Streaming
• Unifying DataFrame & Dataset
Best resourceforlearning Spark
A slide from 2013 …
Programmability
5
WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark
What is Spark?
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …
2015: A Great Year for Spark
Most active open source projectin (big) data
• 1000+ code contributors
New language: R
Widespread industry support& adoption
“Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune
Top Applications
29%
36%
40%
44%
52%
68%
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
BusinessIntelligence
Diverse Runtime Environments
HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONMENTS (CLUSTER MANAGERS)
48%
40%
11%
Standalone mode YARN Mesos
Cluster Managers
Spark 2.0
Next major release, coming in May
Builds on all we learned in past 2 years
Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor dependency conflicts(e.g.Guava) and experimental APIs
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)
Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames
Datasets & DataFrames
API foundation for the future
Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets
Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams
Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames
Structured Streaming
How do we simplify streaming?
Integration Example
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API !
Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
See Michael/TD’stalks tomorrow for a deep dive!
Tungsten Phase 2
Can we speed up Spark by 10X?
Demo
Run a join on a large table with 1 billion recordsand a small table with
1000 records
In Spark 1.6, took 60+ seconds.
In Spark 2.0, took ~3 seconds.
Scan
Filter
Project
Aggregate
select count(*) from store_sales
where ss_item_sk = 1000
Volcano Iterator Model
Standard for 30 years: almost
all databases do it
Each operatoris an “iterator”
that consumes recordsfrom
its input operator
class Filter {
def next(): Boolean = {
var found = false
while (!found && child.next()) {
found = predicate(child.fetch())
}
return found
}
def fetch(): InternalRow = {
child.fetch()
}
…
}
What if we hire a collegefreshmanto
implement this queryin Java in 10 mins?
select count(*) from store_sales
where ss_item_sk = 1000
var count = 0
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1
}
}
Volcano model
30+ years of database research
college freshman
hand-written code in 10 mins
vs
Volcano 13.95 million
rows/sec
college
freshman
125 million
rows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput
How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (orcache)
3. No loop unrolling,SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compilerloop unrolling,SIMD,
pipelining
Take advantage of all the information that is known after query compilation
Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a generalpurpose
execution engine; performanceas if
hand built system just to run your query
Databricks
Community Edition
Best place to try & learn Spark.
Today’s talk
Spark has been growing explosively
Spark 2.0 doubles down on what made Spark attractive:
• elegantAPIs
• cutting-edge performance
Learn Spark on Databricks Community Edition
• join beta waitlist https://databricks.com/
Thank you.
@rxin

A look ahead at spark 2.0

  • 1.
    A look aheadat Spark 2.0 Reynold Xin @rxin 2016-03-30,Strata Conference
  • 2.
    About Databricks Founded bycreatorsof Spark in 2013 Cloud enterprisedata platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance,security, …
  • 3.
    Today’s Talk Looking backlast12 months Looking forward to Spark 2.0 • Project Tungsten,Phase 2 • Structured Streaming • Unifying DataFrame & Dataset Best resourceforlearning Spark
  • 4.
    A slide from2013 …
  • 5.
    Programmability 5 WordCount in 50+lines of Java MR WordCount in 3 lines of Spark
  • 6.
    What is Spark? Unifiedengineacross data workloads and platforms … SQLStreaming ML Graph Batch …
  • 7.
    2015: A GreatYear for Spark Most active open source projectin (big) data • 1000+ code contributors New language: R Widespread industry support& adoption
  • 8.
    “Spark is theTaylor Swift of big data software.” - Derrick Harris, Fortune
  • 11.
    Top Applications 29% 36% 40% 44% 52% 68% Faud Detection/ Security User-Facing Services Log Processing Recommendation Data Warehousing BusinessIntelligence
  • 12.
    Diverse Runtime Environments HOWRESPONDENTS ARE RUNNING SPARK 51% on a public cloud MOST COMMON SPARK DEPLOYMENT ENVIRONMENTS (CLUSTER MANAGERS) 48% 40% 11% Standalone mode YARN Mesos Cluster Managers
  • 13.
    Spark 2.0 Next majorrelease, coming in May Builds on all we learned in past 2 years
  • 14.
    Versioning in Spark Inreality, we hate breaking APIs! Will notdo so exceptfor dependency conflicts(e.g.Guava) and experimental APIs 1.6.0 Patch version (only bug fixes) Major version (may change APIs) Minor version (addsAPIs/ features)
  • 15.
    Major Features in2.0 TungstenPhase 2 speedupsof 5-10x StructuredStreaming real-time engine on SQL/DataFrames Unifying Datasets and DataFrames
  • 16.
    Datasets & DataFrames APIfoundation for the future
  • 17.
    Datasets and DataFrames In2015, we added DataFrames & Datasets as structured data APIs • DataFrames are collections of rows with a schema • Datasets add static types,e.g. Dataset[Person] • Both run on Tungsten Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
  • 18.
    Example case class User(name:String, id: Int) case class Message(user: User, text: String) dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row] messages = dataframe.as[Message] // Dataset[Message] users = messages.filter(m => m.text.contains(“Spark”)) .map(m => m.user) // Dataset[User] pipeline.train(users) // MLlib takes either DataFrames or Datasets
  • 19.
    Benefits Simpler to understand •Onlykept Dataset separate to keep binary compatibility in 1.x Libraries can take data of both forms With Streaming, same API will also work on streams
  • 20.
    Long-Term RDD will remainthe low-levelAPIin Spark Datasets & DataFrames give richer semanticsand optimizations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames
  • 21.
    Structured Streaming How dowe simplify streaming?
  • 22.
    Integration Example Streaming engine Stream (home.html, 10:08) (product.html,10:09) (home.html, 10:10) . . . What can go wrong? • Late events • Partial outputs to MySQL • State recovery on failure • Distributed reads/writes • ... MySQL Page Minute Visits home 10:09 21 pricing 10:10 30 ... ... ...
  • 23.
    Processing Businesslogic change &new ops (windows,sessions) Complex Programming Models Output How do we define outputover time & correctness? Data Late arrival, varying distribution overtime, …
  • 24.
    The simplest wayto perform streaming analytics is not having to reason about streaming.
  • 25.
    Spark 2.0 Infinite DataFrames Spark1.3 Static DataFrames Single API !
  • 26.
    Structured Streaming High-level streamingAPI built on SparkSQL engine • Runsthe same querieson DataFrames • Eventtime, windowing,sessions,sources& sinks Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queriesatruntime • Build and apply ML models See Michael/TD’stalks tomorrow for a deep dive!
  • 27.
    Tungsten Phase 2 Canwe speed up Spark by 10X?
  • 28.
    Demo Run a joinon a large table with 1 billion recordsand a small table with 1000 records In Spark 1.6, took 60+ seconds. In Spark 2.0, took ~3 seconds.
  • 29.
    Scan Filter Project Aggregate select count(*) fromstore_sales where ss_item_sk = 1000
  • 30.
    Volcano Iterator Model Standardfor 30 years: almost all databases do it Each operatoris an “iterator” that consumes recordsfrom its input operator class Filter { def next(): Boolean = { var found = false while (!found && child.next()) { found = predicate(child.fetch()) } return found } def fetch(): InternalRow = { child.fetch() } … }
  • 31.
    What if wehire a collegefreshmanto implement this queryin Java in 10 mins? select count(*) from store_sales where ss_item_sk = 1000 var count = 0 for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1 } }
  • 32.
    Volcano model 30+ yearsof database research college freshman hand-written code in 10 mins vs
  • 33.
    Volcano 13.95 million rows/sec college freshman 125million rows/sec Note: End-to-end, single thread, single column, and data originated in Parquet on disk High throughput
  • 34.
    How does astudent beat 30 years of research? Volcano 1. Many virtual function calls 2. Data in memory (orcache) 3. No loop unrolling,SIMD, pipelining hand-written code 1. No virtual function calls 2. Data in CPU registers 3. Compilerloop unrolling,SIMD, pipelining Take advantage of all the information that is known after query compilation
  • 35.
    Scan Filter Project Aggregate long count =0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } } Tungsten Phase 2: Spark as a “Compiler” Functionality of a generalpurpose execution engine; performanceas if hand built system just to run your query
  • 36.
  • 38.
    Today’s talk Spark hasbeen growing explosively Spark 2.0 doubles down on what made Spark attractive: • elegantAPIs • cutting-edge performance Learn Spark on Databricks Community Edition • join beta waitlist https://databricks.com/
  • 39.