Jump Start with Apache Spark 2.0 on Databricks

Jump Start with
Apache® Spark™ 2.x
on Databricks
Jules S. Damji
Spark Community Evangelist
Big Data Trunk Meetup, Fremont 12/17/2016
@2twitme

I have used Apache Spark Before…

I know the difference between
DataFrame and RDDs…

$ whoami
Spark Community Evangelist @ Databricks
Developer Advocate @ Hortonworks
Software engineering @: Sun Microsystems,
Netscape, @Home, VeriSign, Scalix, Centrify,
LoudCloud/Opsware, ProQuest
https://www.linkedin.com/in/dmatrix
@2twitme

Agenda for the next 3+ hours
• Get to know Databricks
• Overview of Spark
Fundamentals& Architecture
• What’s New in Spark 2.0
• UnifiedAPIs: SparkSessions,
SQL, DataFrames, Datasets…
• Workshop Notebook1
• Lunch
Hour 1.5
• Introduction to DataFrames,
DataSets and Spark SQL
• Break
• Introduction to Structured
StreamingConcepts
• Go Home…
Hour 1.5

Get to know Databricks
• Get Databricks communityedition http://databricks.com/try-databricks

We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
7
Data Value
Created Databricks on top of Spark to make big data simple.
The Best Place to Run Apache Spark

MapReduce
Generalbatch
processing
Pregel
Dremel Mahout
Drill
Giraph
ImpalaStorm
S4 . . .
Specialized systems
for newworkloads
Why Spark? Big Data Systems of
Yesterday…
Hard to manage, tune, deployHard to combine in pipelines

MapReduce
Generalbatch
processing
Unified engine
Why Spark? Big Data Systems
Yesterday..
?
Pregel
Dremel
Drill
Giraph
Impala
Storm
Mahout. . .
Specialized systems
for newworkloads

An Analogy
Specialized devices Unified device
New applications

Unified engine across diverse workloads& environments

Unified engineacross diverse workloads &
environments

Apache Spark
Fundamentals
&
Architecture

A Resilient Distributed Dataset
(RDD)

2 kinds of Actions
collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)

Apache Spark Architecture
Deployments
Modes
• Local
• Standalone
• YARN
• Mesos

Driver
+
Executor
Driver
+
Executor
Container
EC2 Machine
Student-1 Notebook
Student-2 Notebook
Container
JVM
JVM

30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
S
S
S
S
Ex.
Ex.
30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
Dr
Ex.
... ...
Standalone Mode:

Apache Spark Architecture
An Anatomy ofan Application
Spark Application
• Jobs
• Stages
• Tasks

S S
Container
S*
*
*
*
*
*
*
*
JVM
T
*
*
DF/RDD

How did we Get Here..?
Where we Going..?

A Brief History
29
2012
Started
@
UC Berkeley
2010 2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQL, ML,GraphX)
2015
DataFrames/Datasets
Tungsten
ML Pipelines
2016
Apache Spark 2.0
Easier
Smarter
Faster

Apache Spark 2.0
• Steps to Bigger& Better Things….
Builds on all we learned in past 2 years

Major Themes in Apache Spark 2.0
TungstenPhase 2
speedupsof 5-10x
& Catalyst Optimizer
Faster
StructuredStreaming
real-time engine
on SQL / DataFrames
Smarter
Unifying Datasets
and DataFrames &
SparkSessions
Easier

Unified API Foundation for the
Future: Spark Sessions, DataFrame
Dataset, MLlib, Structured Streaming…

SparkSession – A Unified entry point to
Spark
• Conduit to Spark
– Creates Datasets/DataFrames
– Reads/writes data
– Works with metadata
– Sets/gets Spark Configuration
– Driver uses for Cluster
resource management

SparkSession vs SparkContext
SparkSessions Subsumes
• SparkContext
• SQLContext
• HiveContext
• StreamingContext
• SparkConf

SparkSession – A Unified entry point to
Spark

Long Term
• RDD as the low-levelAPI in Spark
• For controland certain type-safety inJava/Scala
• Datasets & DataFrames give richer semantics&
optimizations
• For semi-structureddataand DSL like operations
• New libraries will increasingly use these as interchange
format
• Examples: Structured Streaming,MLlib, GraphFrames

Towards SQL 2003
• Today, Spark can run all 99 TPC-DS queries!
- New standard compliant parser(with good error
messages!)
- Subqueries(correlated& uncorrelated)
- Approximate aggregatestats
- https://databricks.com/blog/2016/07/26/introducing-apache-
spark-2-0.html
- https://databricks.com/blog/2016/06/17/sql-subqueries-in-
apache-spark-2-0.html

0
100
200
300
400
500
600
Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better
Time (1.6)
Time (2.0)

Other notable API improvements
• DataFrame-based ML pipeline API becomingthe main MLlib
API
• ML model & pipeline persistence with almost complete
coverage
• In all programming languages: Scala, Java, Python, R
• ImprovedR support
• (Parallelizable) User-defined functions in R
• Generalized Linear Models (GLMs), Naïve Bayes, Survival
Regression, K-Means

Workshop: Notebook on SparkSession
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/sswksh1
– http://docs.databricks.com
– http://spark.apache.org/docs/latest/api/scala/index.html#org.a
pache.spark.sql.SparkSession
• Familiarize your self with Databricks Notebook environment
• Work through each cell
• CNTR +<return> / Shift +Return
• Try challenges
• Break…

DataFrames/Datasets & Spark
SQL & Catalyst Optimizer

The not so secret truth…
SQL
is not about SQL
is about more thanSQL

Spark SQL: The whole story
10
Is About Creating and Running Spark Programs
Faster:
•  Write less code
•  Read less data
•  Do less work
• optimizerdo the hard work

Spark SQL Architecture
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL
DataFrames
Code
Generator
Datasets

52
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzinga logicalplan to resolve references
Logical Optimization: logicalplan optimization
Physical Planning: Physical planning
Code Generation:Compileparts of the query to Java bytecode
SQL AST
DataFrame
Datasets

Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to data source, so
irrelevant data can be skipped
• Parquet: skip entire blocks, turn comparisons
on strings into cheaper integer comparisons
via dictionary encoding
• RDBMS: reduce amount of data traffic by
pushing predicates down
• Catalyst compiles operations into physical
plans for execution and generates JVM
bytecode
• Intelligently choose betweenbroadcast joins
and shuffle joins to reduce network traffic
• Lower level optimizations: eliminate expensive
object allocations and reduce virtual function
calls

# Load partitioned Hive table
def add_demographics(events):
u = sqlCtx.table(" users")
events
. jo in ( u , events.user_id == u.user_id)
.withColumn("c ity" , zipToCity( df .z ip))
# Join on user_id
# Run udf to add c it y column
PhysicalPlan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "New York").select(even ts. times ta mp) .co llec t()
LogicalPlan
filter
join
PhysicalPlan
join
scan
(users)events file userstable
54
scan
(events)
filter

Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "people")
.load()
.where($"name" === "michael")
55
You Write
Spark Translates
For Postgres SELECT * FROM people WHERE name = 'michael'

43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
MLPipelines
Structured
Streaming
{ JSON }
JDBC
andmore…
FoundationalSpark2.0 Components
Spark SQL
GraphFrames

http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Iterator[T]
Opaque Computation
& Opaque Data

Structured APIs In Spark
60
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors are reported before a distributed job starts

Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.0
val df = spark.read.j son("people.json")
/ / Convert data to domain obj ects.
case cl ass Person(name: Stri ng, age: I n t )
val ds: Dataset[Person] = df.as[Person]
val fi l terD S = d s . f i l t er ( _ . age > 30)
// will return DataFrame=Dataset[Row]
val groupDF = ds.filter(p=>
p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)

Project Tungsten
• Substantially speed up execution by optimizing CPU
efficiency, via: SPARK-12795
(1) Runtime code generation
(2) Exploiting cachelocality
(3) Off-heap memory management

6 “bricks”
Tungsten’s Compact Row Format
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Nullbitmap
Offset to data
Offset to data Fieldlengths
20

Encoders
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “ br i c ks”)
Encoders translate between domain
objects and Spark's internal format

Datasets: Lightning-fast Serialization with Encoders

Performance of Core Primitives
cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8 bit entropy) 620 ns 5.3 ns
sort (64 bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11

Workshop: Notebook on
DataFrames/Datasets & Spark SQL
– http://dbricks.co/sswksh2A
– http://dbricks.co/sswksh2
– https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.
Dataset
• Workthrough each Notebookcell
• Try challenges
• Break..

Introduction to Structured
Streaming

Streaming in Apache Spark
Streaming demands newtypes of streaming requirements…
3
SQL MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 51%users say most important partof Apache Spark
Spark Streaming in production jumped to 22%from 14%
Streaming

Streaming apps are
growing more complex
4

Streaming computations
don’t run in isolation
• Need to interact with batch data,
interactive analysis, machine learning, etc.

Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Aggregateon windows
on eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning

Use case: IoT Device Monitoring
Anomalydetection
- Learn modelsoffline
- Useonline + continuous
learning
IoT events event stream
fromKafka
ETL into long term storage
- Prevent dataloss
Status monitoring - Preventduplicates Interactively
- Handle late data debugissues
- Aggregateon windows -consistency
on eventtime
Continuous Applications
Not just streaming any more

The simplest way to perform streaming analytics
is not having to reason about streaming at all

Static,
bounded table
Stream as a unbound DataFrame
Streaming,
unbounded table
Single API !

Gist of Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesin Datasets / DataFrames
Eventtime, windowing,sessions,sources& sinks
Guaranteesan end-to-end exactlyonce semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML modelsto your Stream

Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and interactive
3. End-to-endexactly-once guaranteesfromthe system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
82

Structured Streaming ModelTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: newrows appended to
table
Query: operations on input
usual map/filter/reduce
newwindow, session ops

1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output

1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode] output only new rows
since last trigger
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible withall queries

Example WordCount
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

Batch ETL with DataFrame
inputDF = spark.read
.format("json")
.load("source-path")
resultDF = input
.select("device", "signal")
.where("signal > 15")
resultDF.write
.format("parquet")
.save("dest-path")
Read from JSON file
Select some devices
Write to parquet file

Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing

Streaming ETL with DataFrames
1 2 3
Result
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
input = spark.readStream
.format("json")
.load("source-path")
result = input
result.writeStream
.format("parquet")
.start("dest-path")

Continuous Aggregations
Continuously compute average
signal of each type of device
90
input.groupBy("device-type")
.avg("signal")

Continuous Windowed Aggregations
91
input.groupBy(
$"device-type",
window($"event-time", "10 min"))
.avg("signal")
Continuously compute
average signal of each type
of device in last 10 minutes
using event-time column

Joining streams with static data
kafkaDataset = spark.readStream
.format("kafka")
.option("subscribe", "iot-updates")
.option("kafka.bootstrap.servers", ...)
.load()
staticDataset = ctxt.read
.jdbc("jdbc://", "iot-device-info")
joinedDataset =
kafkaDataset.join(staticDataset,
"device-type")
92
Join streaming data from Kafka
with static data via JDBC to enrich
the streaming data …
… withouthaving to thinkthat you
are joining streaming data

Query Management
query = result.write
.format("parquet")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
93
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack

Watermarking for handling late data
[Spark 2.1]
94
input.withWatermark("event-time", "15 min")
.groupBy(
$"device-type",
window($"event-time", "10 min"))
.avg("signal")
Continuously compute 10
minute average while data
can be 15 minutes late

Structured Streaming
Underneath the Hood

Logically:
Dataset operations on table
(i.e. as easy to understand asbatch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally andcontinuously)
DataFrame
LogicalPlan
Catalystoptimizer
Continuous,
incrementalexecution
Query Execution

Batch/Streaming Execution on Spark SQL
97
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalotofmagic!

Batch Execution on Spark SQL
98
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobsto compute results
Bytecode generation
JVM intrinsics, vectorization
Operations on serialized data
Code Optimizations MemoryOptimizations
Compact and fastencoding
Offheap memory
Project Tungsten -Phase 1 and 2

Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans
99
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4

Continuous Incremental Execution
100
Planner
Incremental
Execution 2
Offsets:[106-197] Count: 92
Plannerpollsfor
new data from
sources
Incremental
Execution 1
Offsets:[19-105] Count: 87
Incrementally executes
new data and writesto sink

Continuous Aggregations
Maintain runningaggregate as in-memory state
backed by WAL in file system for fault-tolerance
101
state data generated and used
across incremental executions
Incremental
Execution 1
state:
87
Offsets:[19-105] Running Count: 87
memory
Incremental
Execution 2
state:
179
Offsets:[106-179] Count: 87+92 = 179

Structured Streaming: Recap
• High-level streaming API built on Datasets/DataFrames
• Eventtime, windowing,sessions,sources&
sinks End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serveusing
JDBC Add, remove,change queriesat runtime
• Build and applyML models

Current Status: Spark 2.0.2
Basic infrastructureand API
- Eventtime, windows,aggregations
- Append and Complete output modes
- Support for a subsetof batch queries
Sourceand sink
- Sources: Files,Kafka, Flume,Kinesis
- Sinks: Files,in-memory table,
unmanaged sinks(foreach)
Experimental releaseto
set the future direction
Not ready for production
but good for experiment

Coming Soon (Dec 2016): Spark 2.1.0
Event-time watermarks and old state evictions
Status monitoring and metrics
- Current status (processing,waiting fordata, etc.)
- Current metrics (inputrates, processing rates, state data size, etc.)
- Codahale / Dropwizard metrics (reporting to Ganglia,Graphite, etc.)
Many stability & performance improvements
104

Future Direction
Stability, stability, stability
Supportfor more queries
Sessionization
More outputmodes
Supportfor more sinks
ML integrations
Make Structured
Streaming readyfor
production workloads
by Spark 2.2/2.3

Blogs for Deeper Technical Dive
Blog1:
https://databricks.com/blog/2016/07/28/
continuous-applications-evolving-
streaming-in-apache-spark-2-0.html
Blog2:
https://databricks.com/blog/2016/07/28/
structured-streaming-in-apache-
spark.html

Resources
• Getting Started Guide w ApacheSpark on Databricks
• docs.databricks.com
• Spark Programming Guide
• StructuredStreaming Programming Guide
• Databricks EngineeringBlogs
• sparkhub.databricks.com
• spark-packages.org

https://spark-summit.org/east-2017/

Do you have any questions
for my prepared answers?

Demo & Workshop: Structured Streaming
• http://dbricks.co/sswksh3 (Demo)
• http://dbricks.co/sswksh4 (Workshop)
• Done!

1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperatestreaming with batch AND interactive
- RDD/DStream hassimilar API, butstill requirestranslation
3. Reasoning about end-to-end guarantees
- Requirescarefully constructing sinks that handle failurescorrectly
- Data consistency in the storage while being updated
Pain points with DStreams

Output Modes
Defines what is written every time there is a trigger
Different output modes make sense for differentqueries
22
i n p u t.select ("dev ic e", "s i g n al ")
.w r i te
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Append modewith
non-aggregationqueries
i n p u t.agg( cou nt("* ") )
.w r i te
.outputMode("complete")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Complete mode with
aggregationqueries

Jump Start with Apache Spark 2.0 on Databricks

More Related Content

What's hot

Viewers also liked

Similar to Jump Start with Apache Spark 2.0 on Databricks

More from Databricks

Recently uploaded

Jump Start with Apache Spark 2.0 on Databricks