Jump Start with
Apache® Spark™ 2.x
on Databricks
Jules S. Damji
Spark Community Evangelist
Big Data Trunk Meetup, Fremont 12/17/2016
@2twitme
I have used Apache Spark Before…
I know the difference between
DataFrame and RDDs…
$ whoami
Spark Community Evangelist @ Databricks
Developer Advocate @ Hortonworks
Software engineering @: Sun Microsystems,
Netscape, @Home, VeriSign, Scalix, Centrify,
LoudCloud/Opsware, ProQuest
https://www.linkedin.com/in/dmatrix
@2twitme
Agenda for the next 3+ hours
• Get to know Databricks
• Overview of Spark
Fundamentals& Architecture
• What’s New in Spark 2.0
• UnifiedAPIs: SparkSessions,
SQL, DataFrames, Datasets…
• Workshop Notebook1
• Lunch
Hour	1.5
• Introduction to DataFrames,
DataSets and Spark SQL
• Workshop Notebook2
• Break
• Introduction to Structured
StreamingConcepts
• Workshop Notebook3
• Go Home…
Hour	1.5
Get to know Databricks
• Get Databricks communityedition http://databricks.com/try-databricks
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
7
Data Value
Created Databricks on top of Spark to make big data simple.
The Best Place to Run Apache Spark
MapReduce
Generalbatch
processing
Pregel
Dremel Mahout
Drill
Giraph
ImpalaStorm
S4 . . .
Specialized systems
for newworkloads
Why Spark? Big Data Systems of
Yesterday…
Hard to manage, tune, deployHard to combine in pipelines
MapReduce
Generalbatch
processing
Unified engine
Why Spark? Big Data Systems
Yesterday..
?
Pregel
Dremel
Drill
Giraph
Impala
Storm
Mahout. . .
Specialized systems
for newworkloads
An Analogy
Specialized devices Unified device
New applications
Unified engine across diverse workloads& environments
Unified engineacross diverse workloads &
environments
Apache Spark
Fundamentals
&
Architecture
A Resilient Distributed Dataset
(RDD)
2 kinds of Actions
collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
Apache Spark Architecture
Deployments	
Modes
• Local
• Standalone
• YARN
• Mesos
Driver
+
Executor
Driver
+
Executor
Container
EC2 Machine
Student-1 Notebook
Student-2 Notebook
Container
JVM
JVM
30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
S
S
S
S
Ex.
Ex.
30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
Dr
Ex.
... ...
Standalone Mode:
Apache Spark Architecture
An Anatomy ofan Application
Spark	Application
• Jobs
• Stages
• Tasks
S S
Container
S*
*
*
*
*
*
*
*
JVM
T
*
*
DF/RDD
How did we Get Here..?
Where we Going..?
A Brief History
29
2012
Started
@
UC Berkeley
2010 2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQL, ML,GraphX)
2015
DataFrames/Datasets
Tungsten
ML Pipelines
2016
Apache Spark 2.0
Easier
Smarter
Faster
Apache Spark 2.0
• Steps to Bigger& Better Things….
Builds on all we learned in past 2 years
Major Themes in Apache Spark 2.0
TungstenPhase 2
speedupsof 5-10x
& Catalyst Optimizer
Faster
StructuredStreaming
real-time engine
on SQL / DataFrames
Smarter
Unifying Datasets
and DataFrames &
SparkSessions
Easier
Unified API Foundation for the
Future: Spark Sessions, DataFrame
Dataset, MLlib, Structured Streaming…
SparkSession – A Unified entry point to
Spark
• Conduit to Spark
– Creates Datasets/DataFrames
– Reads/writes data
– Works with metadata
– Sets/gets Spark Configuration
– Driver uses for Cluster
resource management
SparkSession vs SparkContext
SparkSessions Subsumes
• SparkContext
• SQLContext
• HiveContext
• StreamingContext
• SparkConf
SparkSession – A Unified entry point to
Spark
Datasets and DataFrames
Impose Structure
Long Term
• RDD as the low-levelAPI in Spark
• For controland certain type-safety inJava/Scala
• Datasets & DataFrames give richer semantics&
optimizations
• For semi-structureddataand DSL like operations
• New libraries will increasingly use these as interchange
format
• Examples: Structured Streaming,MLlib, GraphFrames
Spark 1.6 vs Spark 2.x
Spark 1.6 vs Spark 2.x
Towards SQL 2003
• Today, Spark can run all 99 TPC-DS queries!
- New standard compliant parser(with good error
messages!)
- Subqueries(correlated& uncorrelated)
- Approximate aggregatestats
- https://databricks.com/blog/2016/07/26/introducing-apache-
spark-2-0.html
- https://databricks.com/blog/2016/06/17/sql-subqueries-in-
apache-spark-2-0.html
0
100
200
300
400
500
600
Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better
Time (1.6)
Time (2.0)
Other notable API improvements
• DataFrame-based ML pipeline API becomingthe main MLlib
API
• ML model & pipeline persistence with almost complete
coverage
• In all programming languages: Scala, Java, Python, R
• ImprovedR support
• (Parallelizable) User-defined functions in R
• Generalized Linear Models (GLMs), Naïve Bayes, Survival
Regression, K-Means
Workshop: Notebook on SparkSession
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/sswksh1
– http://docs.databricks.com
– http://spark.apache.org/docs/latest/api/scala/index.html#org.a
pache.spark.sql.SparkSession
• Familiarize your self with Databricks Notebook environment
• Work through each cell
• CNTR +<return> / Shift +Return
• Try challenges
• Break…
DataFrames/Datasets & Spark
SQL & Catalyst Optimizer
The not so secret truth…
SQL
is not about SQL
is about more thanSQL
Spark SQL: The whole story
10
Is About Creating and Running Spark Programs
Faster:
•  Write less code
•  Read less data
•  Do less work
• optimizerdo the hard work
Spark SQL Architecture
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL
DataFrames
Code
Generator
Datasets
52
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzinga logicalplan to resolve references
Logical Optimization: logicalplan optimization
Physical Planning: Physical planning
Code Generation:Compileparts of the query to Java bytecode
SQL AST
DataFrame
Datasets
Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to data source, so
irrelevant data can be skipped
• Parquet: skip entire blocks, turn comparisons
on strings into cheaper integer comparisons
via dictionary encoding
• RDBMS: reduce amount of data traffic by
pushing predicates down
• Catalyst compiles operations into physical
plans for execution and generates JVM
bytecode
• Intelligently choose betweenbroadcast joins
and shuffle joins to reduce network traffic
• Lower level optimizations: eliminate expensive
object allocations and reduce virtual function
calls
# Load partitioned Hive table
def add_demographics(events):
u = sqlCtx.table(" users")
events 
. jo in ( u , events.user_id == u.user_id) 
.withColumn("c ity" , zipToCity( df .z ip))
# Join on user_id
# Run udf to add c it y column
PhysicalPlan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "New York").select(even ts. times ta mp) .co llec t()
LogicalPlan
filter
join
PhysicalPlan
join
scan
(users)events file userstable
54
scan
(events)
filter
Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "people")
.load()
.where($"name" === "michael")
55
You Write
Spark Translates
For Postgres SELECT * FROM people WHERE name = 'michael'
43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
MLPipelines
Structured
Streaming
{ JSON }
JDBC
andmore…
FoundationalSpark2.0 Components
Spark SQL
GraphFrames
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Dataset Spark 2.0 APIs
Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Iterator[T]
Opaque Computation
& Opaque Data
Structured APIs In Spark
60
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors are reported before a distributed job starts
Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.0
val df = spark.read.j son("people.json")
/ / Convert data to domain obj ects.
case cl ass Person(name: Stri ng, age: I n t )
val ds: Dataset[Person] = df.as[Person]
val fi l terD S = d s . f i l t er ( _ . age > 30)
// will return DataFrame=Dataset[Row]
val groupDF = ds.filter(p=>
p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)
Source: michaelmalak
Project Tungsten II
Project Tungsten
• Substantially speed up execution by optimizing CPU
efficiency, via: SPARK-12795
(1) Runtime code generation
(2) Exploiting cachelocality
(3) Off-heap memory management
6 “bricks”
Tungsten’s Compact Row Format
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Nullbitmap
Offset to data
Offset to data Fieldlengths
20
Encoders
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “ br i c ks”)
Encoders translate between domain
objects and Spark's internal format
Datasets: Lightning-fast Serialization with Encoders
Performance of Core Primitives
cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8 bit entropy) 620 ns 5.3 ns
sort (64 bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
Workshop: Notebook on
DataFrames/Datasets & Spark SQL
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/sswksh2A
– http://dbricks.co/sswksh2
– https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.
Dataset
• Workthrough each Notebookcell
• Try challenges
• Break..
Introduction to Structured
Streaming
Streaming in Apache Spark
Streaming demands newtypes of streaming requirements…
3
SQL MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 51%users say most important partof Apache Spark
Spark Streaming in production jumped to 22%from 14%
Streaming
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
• Need to interact with batch data,
interactive analysis, machine learning, etc.
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Aggregateon windows
on eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning
Use case: IoT Device Monitoring
Anomalydetection
- Learn modelsoffline
- Useonline + continuous
learning
IoT events event stream
fromKafka
ETL into long term storage
- Prevent dataloss
Status monitoring - Preventduplicates Interactively
- Handle late data debugissues
- Aggregateon windows -consistency
on eventtime
Continuous Applications
Not just streaming any more
The simplest way to perform streaming analytics
is not having to reason about streaming at all
Static,
bounded table
Stream as a unbound DataFrame
Streaming,
unbounded table
Single API !
Stream as unbounded DataFrame
Gist of Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesin Datasets / DataFrames
Eventtime, windowing,sessions,sources& sinks
Guaranteesan end-to-end exactlyonce semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML modelsto your Stream
Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and interactive
3. End-to-endexactly-once guaranteesfromthe system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
82
Structured Streaming ModelTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: newrows appended to
table
Query: operations on input
usual map/filter/reduce
newwindow, session ops
Structured Streaming ModelTrigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output
Structured Streaming ModelTrigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode] output only new rows
since last trigger
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible withall queries
Example WordCount
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Batch ETL with DataFrame
inputDF = spark.read
.format("json")
.load("source-path")
resultDF = input
.select("device", "signal")
.where("signal > 15")
resultDF.write
.format("parquet")
.save("dest-path")
Read from JSON file
Select some devices
Write to parquet file
Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing
Streaming ETL with DataFrames
1 2 3
Result
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Continuous Aggregations
Continuously compute average
signal of each type of device
90
input.groupBy("device-type")
.avg("signal")
Continuous Windowed Aggregations
91
input.groupBy(
$"device-type",
window($"event-time", "10 min"))
.avg("signal")
Continuously compute
average signal of each type
of device in last 10 minutes
using event-time column
Joining streams with static data
kafkaDataset = spark.readStream
.format("kafka")
.option("subscribe", "iot-updates")
.option("kafka.bootstrap.servers", ...)
.load()
staticDataset = ctxt.read
.jdbc("jdbc://", "iot-device-info")
joinedDataset =
kafkaDataset.join(staticDataset,
"device-type")
92
Join streaming data from Kafka
with static data via JDBC to enrich
the streaming data …
… withouthaving to thinkthat you
are joining streaming data
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
93
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack
Watermarking for handling late data
[Spark 2.1]
94
input.withWatermark("event-time", "15 min")
.groupBy(
$"device-type",
window($"event-time", "10 min"))
.avg("signal")
Continuously compute 10
minute average while data
can be 15 minutes late
Structured Streaming
Underneath the Hood
Logically:
Dataset operations on table
(i.e. as easy to understand asbatch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally andcontinuously)
DataFrame
LogicalPlan
Catalystoptimizer
Continuous,
incrementalexecution
Query Execution
Batch/Streaming Execution on Spark SQL
97
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalotofmagic!
Batch Execution on Spark SQL
98
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobsto compute results
Bytecode generation
JVM intrinsics, vectorization
Operations on serialized data
Code Optimizations MemoryOptimizations
Compact and fastencoding
Offheap memory
Project Tungsten -Phase 1 and 2
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans
99
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4
Continuous Incremental Execution
100
Planner
Incremental
Execution 2
Offsets:[106-197] Count: 92
Plannerpollsfor
new data from
sources
Incremental
Execution 1
Offsets:[19-105] Count: 87
Incrementally executes
new data and writesto sink
Continuous Aggregations
Maintain runningaggregate as in-memory state
backed by WAL in file system for fault-tolerance
101
state data generated and used
across incremental executions
Incremental
Execution 1
state:
87
Offsets:[19-105] Running Count: 87
memory
Incremental
Execution 2
state:
179
Offsets:[106-179] Count: 87+92 = 179
Structured Streaming: Recap
• High-level streaming API built on Datasets/DataFrames
• Eventtime, windowing,sessions,sources&
sinks End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serveusing
JDBC Add, remove,change queriesat runtime
• Build and applyML models
Current Status: Spark 2.0.2
Basic infrastructureand API
- Eventtime, windows,aggregations
- Append and Complete output modes
- Support for a subsetof batch queries
Sourceand sink
- Sources: Files,Kafka, Flume,Kinesis
- Sinks: Files,in-memory table,
unmanaged sinks(foreach)
Experimental releaseto
set the future direction
Not ready for production
but good for experiment
Coming Soon (Dec 2016): Spark 2.1.0
Event-time watermarks and old state evictions
Status monitoring and metrics
- Current status (processing,waiting fordata, etc.)
- Current metrics (inputrates, processing rates, state data size, etc.)
- Codahale / Dropwizard metrics (reporting to Ganglia,Graphite, etc.)
Many stability & performance improvements
104
Future Direction
Stability, stability, stability
Supportfor more queries
Sessionization
More outputmodes
Supportfor more sinks
ML integrations
Make Structured
Streaming readyfor
production workloads
by Spark 2.2/2.3
Blogs for Deeper Technical Dive
Blog1:
https://databricks.com/blog/2016/07/28/
continuous-applications-evolving-
streaming-in-apache-spark-2-0.html
Blog2:
https://databricks.com/blog/2016/07/28/
structured-streaming-in-apache-
spark.html
Resources
• Getting Started Guide w ApacheSpark on Databricks
• docs.databricks.com
• Spark Programming Guide
• StructuredStreaming Programming Guide
• Databricks EngineeringBlogs
• sparkhub.databricks.com
• spark-packages.org
https://spark-summit.org/east-2017/
Do you have any questions
for my prepared answers?
Demo & Workshop: Structured Streaming
• Import Notebook into your Spark 2.0 Cluster
• http://dbricks.co/sswksh3 (Demo)
• http://dbricks.co/sswksh4 (Workshop)
• Done!
1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperatestreaming with batch AND interactive
- RDD/DStream hassimilar API, butstill requirestranslation
3. Reasoning about end-to-end guarantees
- Requirescarefully constructing sinks that handle failurescorrectly
- Data consistency in the storage while being updated
Pain points with DStreams
Output Modes
Defines what is written every time there is a trigger
Different output modes make sense for differentqueries
22
i n p u t.select ("dev ic e", "s i g n al ")
.w r i te
.outputMode("append")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Append modewith
non-aggregationqueries
i n p u t.agg( cou nt("* ") )
.w r i te
.outputMode("complete")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Complete mode with
aggregationqueries

Jump Start with Apache Spark 2.0 on Databricks

  • 1.
    Jump Start with Apache®Spark™ 2.x on Databricks Jules S. Damji Spark Community Evangelist Big Data Trunk Meetup, Fremont 12/17/2016 @2twitme
  • 2.
    I have usedApache Spark Before…
  • 3.
    I know thedifference between DataFrame and RDDs…
  • 4.
    $ whoami Spark CommunityEvangelist @ Databricks Developer Advocate @ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
  • 5.
    Agenda for thenext 3+ hours • Get to know Databricks • Overview of Spark Fundamentals& Architecture • What’s New in Spark 2.0 • UnifiedAPIs: SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook1 • Lunch Hour 1.5 • Introduction to DataFrames, DataSets and Spark SQL • Workshop Notebook2 • Break • Introduction to Structured StreamingConcepts • Workshop Notebook3 • Go Home… Hour 1.5
  • 6.
    Get to knowDatabricks • Get Databricks communityedition http://databricks.com/try-databricks
  • 7.
    We are Databricks,the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 7 Data Value Created Databricks on top of Spark to make big data simple. The Best Place to Run Apache Spark
  • 8.
    MapReduce Generalbatch processing Pregel Dremel Mahout Drill Giraph ImpalaStorm S4 .. . Specialized systems for newworkloads Why Spark? Big Data Systems of Yesterday… Hard to manage, tune, deployHard to combine in pipelines
  • 9.
    MapReduce Generalbatch processing Unified engine Why Spark?Big Data Systems Yesterday.. ? Pregel Dremel Drill Giraph Impala Storm Mahout. . . Specialized systems for newworkloads
  • 10.
    An Analogy Specialized devicesUnified device New applications
  • 11.
    Unified engine acrossdiverse workloads& environments
  • 12.
    Unified engineacross diverseworkloads & environments
  • 13.
  • 14.
  • 17.
    2 kinds ofActions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
  • 21.
    Apache Spark Architecture Deployments Modes •Local • Standalone • YARN • Mesos
  • 22.
  • 23.
    30 GB Container 30GB Container 22 GB JVM 22 GB JVM S S S S S S S S Ex. Ex. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S Dr Ex. ... ... Standalone Mode:
  • 24.
    Apache Spark Architecture AnAnatomy ofan Application Spark Application • Jobs • Stages • Tasks
  • 25.
  • 28.
    How did weGet Here..? Where we Going..?
  • 29.
    A Brief History 29 2012 Started @ UCBerkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten ML Pipelines 2016 Apache Spark 2.0 Easier Smarter Faster
  • 30.
    Apache Spark 2.0 •Steps to Bigger& Better Things…. Builds on all we learned in past 2 years
  • 33.
    Major Themes inApache Spark 2.0 TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  • 34.
    Unified API Foundationfor the Future: Spark Sessions, DataFrame Dataset, MLlib, Structured Streaming…
  • 35.
    SparkSession – AUnified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
  • 36.
    SparkSession vs SparkContext SparkSessionsSubsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  • 37.
    SparkSession – AUnified entry point to Spark
  • 38.
  • 39.
  • 40.
    Long Term • RDDas the low-levelAPI in Spark • For controland certain type-safety inJava/Scala • Datasets & DataFrames give richer semantics& optimizations • For semi-structureddataand DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames
  • 41.
    Spark 1.6 vsSpark 2.x
  • 42.
    Spark 1.6 vsSpark 2.x
  • 44.
    Towards SQL 2003 •Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser(with good error messages!) - Subqueries(correlated& uncorrelated) - Approximate aggregatestats - https://databricks.com/blog/2016/07/26/introducing-apache- spark-2-0.html - https://databricks.com/blog/2016/06/17/sql-subqueries-in- apache-spark-2-0.html
  • 45.
    0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DSSpark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  • 46.
    Other notable APIimprovements • DataFrame-based ML pipeline API becomingthe main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programming languages: Scala, Java, Python, R • ImprovedR support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-Means
  • 47.
    Workshop: Notebook onSparkSession • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR +<return> / Shift +Return • Try challenges • Break…
  • 48.
    DataFrames/Datasets & Spark SQL& Catalyst Optimizer
  • 49.
    The not sosecret truth… SQL is not about SQL is about more thanSQL
  • 50.
    Spark SQL: Thewhole story 10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdo the hard work
  • 51.
  • 52.
    52 Using Catalyst inSpark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
  • 53.
    Catalyst Optimizations Logical Optimizations CreatePhysical Plan & generate JVM bytecode • Push filter predicates down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons on strings into cheaper integer comparisons via dictionary encoding • RDBMS: reduce amount of data traffic by pushing predicates down • Catalyst compiles operations into physical plans for execution and generates JVM bytecode • Intelligently choose betweenbroadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual function calls
  • 54.
    # Load partitionedHive table def add_demographics(events): u = sqlCtx.table(" users") events . jo in ( u , events.user_id == u.user_id) .withColumn("c ity" , zipToCity( df .z ip)) # Join on user_id # Run udf to add c it y column PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "New York").select(even ts. times ta mp) .co llec t() LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 54 scan (events) filter
  • 55.
    Columns: Predicate pushdown spark.read .format("jdbc") .option("url","jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 55 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  • 56.
    43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming {JSON } JDBC andmore… FoundationalSpark2.0 Components Spark SQL GraphFrames
  • 57.
  • 58.
  • 59.
    Background: What isin an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
  • 60.
    Structured APIs InSpark 60 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  • 61.
    Type-safe:operate on domain objects withcompiled lambda functions 8 Dataset API in Spark 2.0 val df = spark.read.j son("people.json") / / Convert data to domain obj ects. case cl ass Person(name: Stri ng, age: I n t ) val ds: Dataset[Person] = df.as[Person] val fi l terD S = d s . f i l t er ( _ . age > 30) // will return DataFrame=Dataset[Row] val groupDF = ds.filter(p=> p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)
  • 62.
  • 63.
  • 64.
    Project Tungsten • Substantiallyspeed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cachelocality (3) Off-heap memory management
  • 65.
    6 “bricks” Tungsten’s CompactRow Format 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20
  • 66.
    Encoders 6 “bricks”0x0 12332L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format
  • 67.
  • 68.
    Performance of CorePrimitives cost per row (single thread) primitive Spark 1.6 Spark 2.0 filter 15 ns 1.1 ns sum w/o group 14 ns 0.9 ns sum w/ group 79 ns 10.7 ns hash join 115 ns 4.0 ns sort (8 bit entropy) 620 ns 5.3 ns sort (64 bit entropy) 620 ns 40 ns sort-merge join 750 ns 700 ns Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
  • 70.
    Workshop: Notebook on DataFrames/Datasets& Spark SQL • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh2A – http://dbricks.co/sswksh2 – https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql. Dataset • Workthrough each Notebookcell • Try challenges • Break..
  • 71.
  • 72.
    Streaming in ApacheSpark Streaming demands newtypes of streaming requirements… 3 SQL MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 51%users say most important partof Apache Spark Spark Streaming in production jumped to 22%from 14% Streaming
  • 73.
  • 74.
    Streaming computations don’t runin isolation • Need to interact with batch data, interactive analysis, machine learning, etc.
  • 75.
    Use case: IoTDevice Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning
  • 76.
    Use case: IoTDevice Monitoring Anomalydetection - Learn modelsoffline - Useonline + continuous learning IoT events event stream fromKafka ETL into long term storage - Prevent dataloss Status monitoring - Preventduplicates Interactively - Handle late data debugissues - Aggregateon windows -consistency on eventtime Continuous Applications Not just streaming any more
  • 78.
    The simplest wayto perform streaming analytics is not having to reason about streaming at all
  • 79.
    Static, bounded table Stream asa unbound DataFrame Streaming, unbounded table Single API !
  • 80.
  • 81.
    Gist of StructuredStreaming High-level streaming API built on SparkSQL engine Runs the same computation as batch queriesin Datasets / DataFrames Eventtime, windowing,sessions,sources& sinks Guaranteesan end-to-end exactlyonce semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML modelsto your Stream
  • 82.
    Advantages over DStreams 1.Processingwith event-time,dealingwith late data 2. Exactly same API for batch,streaming,and interactive 3. End-to-endexactly-once guaranteesfromthe system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 82
  • 83.
    Structured Streaming ModelTrigger:every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: newrows appended to table Query: operations on input usual map/filter/reduce newwindow, session ops
  • 84.
    Structured Streaming ModelTrigger:every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  • 85.
    Structured Streaming ModelTrigger:every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [complete mode] output only new rows since last trigger Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible withall queries
  • 86.
  • 87.
    Batch ETL withDataFrame inputDF = spark.read .format("json") .load("source-path") resultDF = input .select("device", "signal") .where("signal > 15") resultDF.write .format("parquet") .save("dest-path") Read from JSON file Select some devices Write to parquet file
  • 88.
    Streaming ETL withDataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  • 89.
    Streaming ETL withDataFrames 1 2 3 Result Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path")
  • 90.
    Continuous Aggregations Continuously computeaverage signal of each type of device 90 input.groupBy("device-type") .avg("signal")
  • 91.
    Continuous Windowed Aggregations 91 input.groupBy( $"device-type", window($"event-time","10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event-time column
  • 92.
    Joining streams withstatic data kafkaDataset = spark.readStream .format("kafka") .option("subscribe", "iot-updates") .option("kafka.bootstrap.servers", ...) .load() staticDataset = ctxt.read .jdbc("jdbc://", "iot-device-info") joinedDataset = kafkaDataset.join(staticDataset, "device-type") 92 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthaving to thinkthat you are joining streaming data
  • 93.
    Query Management query =result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 93 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  • 94.
    Watermarking for handlinglate data [Spark 2.1] 94 input.withWatermark("event-time", "15 min") .groupBy( $"device-type", window($"event-time", "10 min")) .avg("signal") Continuously compute 10 minute average while data can be 15 minutes late
  • 95.
  • 96.
    Logically: Dataset operations ontable (i.e. as easy to understand asbatch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally andcontinuously) DataFrame LogicalPlan Catalystoptimizer Continuous, incrementalexecution Query Execution
  • 97.
    Batch/Streaming Execution onSpark SQL 97 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  • 98.
    Batch Execution onSpark SQL 98 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobsto compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations MemoryOptimizations Compact and fastencoding Offheap memory Project Tungsten -Phase 1 and 2
  • 99.
    Continuous Incremental Execution Plannerknows how to convert streaming logical plans to a continuous series of incremental execution plans 99 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  • 100.
    Continuous Incremental Execution 100 Planner Incremental Execution2 Offsets:[106-197] Count: 92 Plannerpollsfor new data from sources Incremental Execution 1 Offsets:[19-105] Count: 87 Incrementally executes new data and writesto sink
  • 101.
    Continuous Aggregations Maintain runningaggregateas in-memory state backed by WAL in file system for fault-tolerance 101 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets:[19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets:[106-179] Count: 87+92 = 179
  • 102.
    Structured Streaming: Recap •High-level streaming API built on Datasets/DataFrames • Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serveusing JDBC Add, remove,change queriesat runtime • Build and applyML models
  • 103.
    Current Status: Spark2.0.2 Basic infrastructureand API - Eventtime, windows,aggregations - Append and Complete output modes - Support for a subsetof batch queries Sourceand sink - Sources: Files,Kafka, Flume,Kinesis - Sinks: Files,in-memory table, unmanaged sinks(foreach) Experimental releaseto set the future direction Not ready for production but good for experiment
  • 104.
    Coming Soon (Dec2016): Spark 2.1.0 Event-time watermarks and old state evictions Status monitoring and metrics - Current status (processing,waiting fordata, etc.) - Current metrics (inputrates, processing rates, state data size, etc.) - Codahale / Dropwizard metrics (reporting to Ganglia,Graphite, etc.) Many stability & performance improvements 104
  • 105.
    Future Direction Stability, stability,stability Supportfor more queries Sessionization More outputmodes Supportfor more sinks ML integrations Make Structured Streaming readyfor production workloads by Spark 2.2/2.3
  • 107.
    Blogs for DeeperTechnical Dive Blog1: https://databricks.com/blog/2016/07/28/ continuous-applications-evolving- streaming-in-apache-spark-2-0.html Blog2: https://databricks.com/blog/2016/07/28/ structured-streaming-in-apache- spark.html
  • 108.
    Resources • Getting StartedGuide w ApacheSpark on Databricks • docs.databricks.com • Spark Programming Guide • StructuredStreaming Programming Guide • Databricks EngineeringBlogs • sparkhub.databricks.com • spark-packages.org
  • 109.
  • 111.
    Do you haveany questions for my prepared answers?
  • 112.
    Demo & Workshop:Structured Streaming • Import Notebook into your Spark 2.0 Cluster • http://dbricks.co/sswksh3 (Demo) • http://dbricks.co/sswksh4 (Workshop) • Done!
  • 113.
    1. Processing withevent-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperatestreaming with batch AND interactive - RDD/DStream hassimilar API, butstill requirestranslation 3. Reasoning about end-to-end guarantees - Requirescarefully constructing sinks that handle failurescorrectly - Data consistency in the storage while being updated Pain points with DStreams
  • 114.
    Output Modes Defines whatis written every time there is a trigger Different output modes make sense for differentqueries 22 i n p u t.select ("dev ic e", "s i g n al ") .w r i te .outputMode("append") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Append modewith non-aggregationqueries i n p u t.agg( cou nt("* ") ) .w r i te .outputMode("complete") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Complete mode with aggregationqueries