A look ahead at spark 2.0

A look ahead at Spark 2.0
Reynold Xin @rxin
2016-03-30,Strata Conference

About Databricks
Founded by creatorsof Spark in 2013
Cloud enterprisedata platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance,security, …

Today’s Talk
Looking back last12 months
Looking forward to Spark 2.0
• Project Tungsten,Phase 2
• Structured Streaming
• Unifying DataFrame & Dataset
Best resourceforlearning Spark

Programmability
5
WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark

What is Spark?
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …

2015: A Great Year for Spark
Most active open source projectin (big) data
• 1000+ code contributors
New language: R
Widespread industry support& adoption

“Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune

Top Applications
29%
36%
40%
44%
52%
68%
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
BusinessIntelligence

Diverse Runtime Environments
HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONMENTS (CLUSTER MANAGERS)
48%
40%
11%
Standalone mode YARN Mesos
Cluster Managers

Spark 2.0
Next major release, coming in May
Builds on all we learned in past 2 years

Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor dependency conflicts(e.g.Guava) and experimental APIs
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)

Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames

Datasets & DataFrames
API foundation for the future

Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets

Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams

Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames

Structured Streaming
How do we simplify streaming?

Integration Example
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...

Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …

The simplest way to perform streaming analytics
is not having to reason about streaming.

Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API !

Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
See Michael/TD’stalks tomorrow for a deep dive!

Tungsten Phase 2
Can we speed up Spark by 10X?

Demo
Run a join on a large table with 1 billion recordsand a small table with
1000 records
In Spark 1.6, took 60+ seconds.
In Spark 2.0, took ~3 seconds.

Scan
Filter
Project
Aggregate
select count(*) from store_sales
where ss_item_sk = 1000

Volcano Iterator Model
Standard for 30 years: almost
all databases do it
Each operatoris an “iterator”
that consumes recordsfrom
its input operator
class Filter {
def next(): Boolean = {
var found = false
while (!found && child.next()) {
found = predicate(child.fetch())
}
return found
}
def fetch(): InternalRow = {
child.fetch()
}
…
}

What if we hire a collegefreshmanto
implement this queryin Java in 10 mins?
select count(*) from store_sales
where ss_item_sk = 1000
var count = 0
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1
}
}

Volcano model
30+ years of database research
college freshman
hand-written code in 10 mins
vs

Volcano 13.95 million
rows/sec
college
freshman
125 million
rows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput

How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (orcache)
3. No loop unrolling,SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compilerloop unrolling,SIMD,
pipelining
Take advantage of all the information that is known after query compilation

Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a generalpurpose
execution engine; performanceas if
hand built system just to run your query

Databricks
Community Edition
Best place to try & learn Spark.

Today’s talk
Spark has been growing explosively
Spark 2.0 doubles down on what made Spark attractive:
• elegantAPIs
• cutting-edge performance
Learn Spark on Databricks Community Edition
• join beta waitlist https://databricks.com/

A look ahead at spark 2.0

More Related Content

What's hot

Viewers also liked

Similar to A look ahead at spark 2.0

More from Databricks

Recently uploaded

A look ahead at spark 2.0