Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jump Start with
Apache® Spark™ 2.x
on Databricks
Jules S. Damji
Spark Community Evangelist
Big Data Trunk Meetup, Fremont ...
I have used Apache Spark Before…
I know the difference between
DataFrame and RDDs…
$ whoami
Spark Community Evangelist @ Databricks
Developer Advocate @ Hortonworks
Software engineering @: Sun Microsystems...
Agenda for the next 3+ hours
• Get to know Databricks
• Overview of Spark
Fundamentals& Architecture
• What’s New in Spark...
Get to know Databricks
• Get Databricks communityedition http://databricks.com/try-databricks
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
con...
MapReduce
Generalbatch
processing
Pregel
Dremel Mahout
Drill
Giraph
ImpalaStorm
S4 . . .
Specialized systems
for newworklo...
MapReduce
Generalbatch
processing
Unified engine
Why Spark? Big Data Systems
Yesterday..
?
Pregel
Dremel
Drill
Giraph
Impa...
An Analogy
Specialized devices Unified device
New applications
Unified engine across diverse workloads& environments
Unified engineacross diverse workloads &
environments
Apache Spark
Fundamentals
&
Architecture
A Resilient Distributed Dataset
(RDD)
2 kinds of Actions
collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
Apache Spark Architecture
Deployments	
Modes
• Local
• Standalone
• YARN
• Mesos
Driver
+
Executor
Driver
+
Executor
Container
EC2 Machine
Student-1 Notebook
Student-2 Notebook
Container
JVM
JVM
30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
S
S
S
S
Ex.
Ex.
30 GB Container
30 GB Container
22 GB JVM
22 G...
Apache Spark Architecture
An Anatomy ofan Application
Spark	Application
• Jobs
• Stages
• Tasks
S S
Container
S*
*
*
*
*
*
*
*
JVM
T
*
*
DF/RDD
How did we Get Here..?
Where we Going..?
A Brief History
29
2012
Started
@
UC Berkeley
2010 2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQ...
Apache Spark 2.0
• Steps to Bigger& Better Things….
Builds on all we learned in past 2 years
Major Themes in Apache Spark 2.0
TungstenPhase 2
speedupsof 5-10x
& Catalyst Optimizer
Faster
StructuredStreaming
real-tim...
Unified API Foundation for the
Future: Spark Sessions, DataFrame
Dataset, MLlib, Structured Streaming…
SparkSession – A Unified entry point to
Spark
• Conduit to Spark
– Creates Datasets/DataFrames
– Reads/writes data
– Works...
SparkSession vs SparkContext
SparkSessions Subsumes
• SparkContext
• SQLContext
• HiveContext
• StreamingContext
• SparkCo...
SparkSession – A Unified entry point to
Spark
Datasets and DataFrames
Impose Structure
Long Term
• RDD as the low-levelAPI in Spark
• For controland certain type-safety inJava/Scala
• Datasets & DataFrames giv...
Spark 1.6 vs Spark 2.x
Spark 1.6 vs Spark 2.x
Towards SQL 2003
• Today, Spark can run all 99 TPC-DS queries!
- New standard compliant parser(with good error
messages!)
...
0
100
200
300
400
500
600
Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better
Time (1.6)
Time (2.0)
Other notable API improvements
• DataFrame-based ML pipeline API becomingthe main MLlib
API
• ML model & pipeline persiste...
Workshop: Notebook on SparkSession
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/sswksh1
– http://docs...
DataFrames/Datasets & Spark
SQL & Catalyst Optimizer
The not so secret truth…
SQL
is not about SQL
is about more thanSQL
Spark SQL: The whole story
10
Is About Creating and Running Spark Programs
Faster:
•  Write less code
•  Read less data
• ...
Spark SQL Architecture
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL
DataFrames
Code
Generator
D...
52
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Ana...
Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to...
# Load partitioned Hive table
def add_demographics(events):
u = sqlCtx.table(" users")
events 
. jo in ( u , events.user_i...
Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "peop...
43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
MLPipelines
Structured
Streaming
{ JSON }
JDBC
andmore…
FoundationalSpar...
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Dataset Spark 2.0 APIs
Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Ite...
Structured APIs In Spark
60
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Tim...
Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.0
val df = spark.read.j son("p...
Source: michaelmalak
Project Tungsten II
Project Tungsten
• Substantially speed up execution by optimizing CPU
efficiency, via: SPARK-12795
(1) Runtime code genera...
6 “bricks”
Tungsten’s Compact Row Format
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Nullbitmap
Offset to data
Offset...
Encoders
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “ br i c ks”)
Encoders...
Datasets: Lightning-fast Serialization with Encoders
Performance of Core Primitives
cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o grou...
Workshop: Notebook on
DataFrames/Datasets & Spark SQL
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/ss...
Introduction to Structured
Streaming
Streaming in Apache Spark
Streaming demands newtypes of streaming requirements…
3
SQL MLlib
Spark Core
GraphX
Functional, ...
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
• Need to interact with batch data,
interactive analysis, machine learning, ...
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesSta...
Use case: IoT Device Monitoring
Anomalydetection
- Learn modelsoffline
- Useonline + continuous
learning
IoT events event ...
The simplest way to perform streaming analytics
is not having to reason about streaming at all
Static,
bounded table
Stream as a unbound DataFrame
Streaming,
unbounded table
Single API !
Stream as unbounded DataFrame
Gist of Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesi...
Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and in...
Structured Streaming ModelTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data f...
Structured Streaming ModelTrigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
...
Structured Streaming ModelTrigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
...
Example WordCount
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Batch ETL with DataFrame
inputDF = spark.read
.format("json")
.load("source-path")
resultDF = input
.select("device", "sig...
Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "si...
Streaming ETL with DataFrames
1 2 3
Result
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
inpu...
Continuous Aggregations
Continuously compute average
signal of each type of device
90
input.groupBy("device-type")
.avg("s...
Continuous Windowed Aggregations
91
input.groupBy(
$"device-type",
window($"event-time", "10 min"))
.avg("signal")
Continu...
Joining streams with static data
kafkaDataset = spark.readStream
.format("kafka")
.option("subscribe", "iot-updates")
.opt...
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
quer...
Watermarking for handling late data
[Spark 2.1]
94
input.withWatermark("event-time", "15 min")
.groupBy(
$"device-type",
w...
Structured Streaming
Underneath the Hood
Logically:
Dataset operations on table
(i.e. as easy to understand asbatch)
Physically:
Spark automatically runs the query...
Batch/Streaming Execution on Spark SQL
97
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Pla...
Batch Execution on Spark SQL
98
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobsto com...
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incrementa...
Continuous Incremental Execution
100
Planner
Incremental
Execution 2
Offsets:[106-197] Count: 92
Plannerpollsfor
new data ...
Continuous Aggregations
Maintain runningaggregate as in-memory state
backed by WAL in file system for fault-tolerance
101
...
Structured Streaming: Recap
• High-level streaming API built on Datasets/DataFrames
• Eventtime, windowing,sessions,source...
Current Status: Spark 2.0.2
Basic infrastructureand API
- Eventtime, windows,aggregations
- Append and Complete output mod...
Coming Soon (Dec 2016): Spark 2.1.0
Event-time watermarks and old state evictions
Status monitoring and metrics
- Current ...
Future Direction
Stability, stability, stability
Supportfor more queries
Sessionization
More outputmodes
Supportfor more s...
Blogs for Deeper Technical Dive
Blog1:
https://databricks.com/blog/2016/07/28/
continuous-applications-evolving-
streaming...
Resources
• Getting Started Guide w ApacheSpark on Databricks
• docs.databricks.com
• Spark Programming Guide
• Structured...
https://spark-summit.org/east-2017/
Do you have any questions
for my prepared answers?
Demo & Workshop: Structured Streaming
• Import Notebook into your Spark 2.0 Cluster
• http://dbricks.co/sswksh3 (Demo)
• h...
1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. ...
Output Modes
Defines what is written every time there is a trigger
Different output modes make sense for differentqueries
...
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Next
Upcoming SlideShare
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Next
Download to read offline and view in fullscreen.

Share

Jump Start with Apache Spark 2.0 on Databricks

Download to read offline

Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.

In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Jump Start with Apache Spark 2.0 on Databricks

  1. 1. Jump Start with Apache® Spark™ 2.x on Databricks Jules S. Damji Spark Community Evangelist Big Data Trunk Meetup, Fremont 12/17/2016 @2twitme
  2. 2. I have used Apache Spark Before…
  3. 3. I know the difference between DataFrame and RDDs…
  4. 4. $ whoami Spark Community Evangelist @ Databricks Developer Advocate @ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
  5. 5. Agenda for the next 3+ hours • Get to know Databricks • Overview of Spark Fundamentals& Architecture • What’s New in Spark 2.0 • UnifiedAPIs: SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook1 • Lunch Hour 1.5 • Introduction to DataFrames, DataSets and Spark SQL • Workshop Notebook2 • Break • Introduction to Structured StreamingConcepts • Workshop Notebook3 • Go Home… Hour 1.5
  6. 6. Get to know Databricks • Get Databricks communityedition http://databricks.com/try-databricks
  7. 7. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 7 Data Value Created Databricks on top of Spark to make big data simple. The Best Place to Run Apache Spark
  8. 8. MapReduce Generalbatch processing Pregel Dremel Mahout Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads Why Spark? Big Data Systems of Yesterday… Hard to manage, tune, deployHard to combine in pipelines
  9. 9. MapReduce Generalbatch processing Unified engine Why Spark? Big Data Systems Yesterday.. ? Pregel Dremel Drill Giraph Impala Storm Mahout. . . Specialized systems for newworkloads
  10. 10. An Analogy Specialized devices Unified device New applications
  11. 11. Unified engine across diverse workloads& environments
  12. 12. Unified engineacross diverse workloads & environments
  13. 13. Apache Spark Fundamentals & Architecture
  14. 14. A Resilient Distributed Dataset (RDD)
  15. 15. 2 kinds of Actions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
  16. 16. Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos
  17. 17. Driver + Executor Driver + Executor Container EC2 Machine Student-1 Notebook Student-2 Notebook Container JVM JVM
  18. 18. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S S S S S Ex. Ex. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S Dr Ex. ... ... Standalone Mode:
  19. 19. Apache Spark Architecture An Anatomy ofan Application Spark Application • Jobs • Stages • Tasks
  20. 20. S S Container S* * * * * * * * JVM T * * DF/RDD
  21. 21. How did we Get Here..? Where we Going..?
  22. 22. A Brief History 29 2012 Started @ UC Berkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten ML Pipelines 2016 Apache Spark 2.0 Easier Smarter Faster
  23. 23. Apache Spark 2.0 • Steps to Bigger& Better Things…. Builds on all we learned in past 2 years
  24. 24. Major Themes in Apache Spark 2.0 TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  25. 25. Unified API Foundation for the Future: Spark Sessions, DataFrame Dataset, MLlib, Structured Streaming…
  26. 26. SparkSession – A Unified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
  27. 27. SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  28. 28. SparkSession – A Unified entry point to Spark
  29. 29. Datasets and DataFrames
  30. 30. Impose Structure
  31. 31. Long Term • RDD as the low-levelAPI in Spark • For controland certain type-safety inJava/Scala • Datasets & DataFrames give richer semantics& optimizations • For semi-structureddataand DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames
  32. 32. Spark 1.6 vs Spark 2.x
  33. 33. Spark 1.6 vs Spark 2.x
  34. 34. Towards SQL 2003 • Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser(with good error messages!) - Subqueries(correlated& uncorrelated) - Approximate aggregatestats - https://databricks.com/blog/2016/07/26/introducing-apache- spark-2-0.html - https://databricks.com/blog/2016/06/17/sql-subqueries-in- apache-spark-2-0.html
  35. 35. 0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  36. 36. Other notable API improvements • DataFrame-based ML pipeline API becomingthe main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programming languages: Scala, Java, Python, R • ImprovedR support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-Means
  37. 37. Workshop: Notebook on SparkSession • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR +<return> / Shift +Return • Try challenges • Break…
  38. 38. DataFrames/Datasets & Spark SQL & Catalyst Optimizer
  39. 39. The not so secret truth… SQL is not about SQL is about more thanSQL
  40. 40. Spark SQL: The whole story 10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdo the hard work
  41. 41. Spark SQL Architecture Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL DataFrames Code Generator Datasets
  42. 42. 52 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
  43. 43. Catalyst Optimizations Logical Optimizations Create Physical Plan & generate JVM bytecode • Push filter predicates down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons on strings into cheaper integer comparisons via dictionary encoding • RDBMS: reduce amount of data traffic by pushing predicates down • Catalyst compiles operations into physical plans for execution and generates JVM bytecode • Intelligently choose betweenbroadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual function calls
  44. 44. # Load partitioned Hive table def add_demographics(events): u = sqlCtx.table(" users") events . jo in ( u , events.user_id == u.user_id) .withColumn("c ity" , zipToCity( df .z ip)) # Join on user_id # Run udf to add c it y column PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "New York").select(even ts. times ta mp) .co llec t() LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 54 scan (events) filter
  45. 45. Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 55 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  46. 46. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.0 Components Spark SQL GraphFrames
  47. 47. http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  48. 48. Dataset Spark 2.0 APIs
  49. 49. Background: What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
  50. 50. Structured APIs In Spark 60 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  51. 51. Type-safe:operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.0 val df = spark.read.j son("people.json") / / Convert data to domain obj ects. case cl ass Person(name: Stri ng, age: I n t ) val ds: Dataset[Person] = df.as[Person] val fi l terD S = d s . f i l t er ( _ . age > 30) // will return DataFrame=Dataset[Row] val groupDF = ds.filter(p=> p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)
  52. 52. Source: michaelmalak
  53. 53. Project Tungsten II
  54. 54. Project Tungsten • Substantially speed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cachelocality (3) Off-heap memory management
  55. 55. 6 “bricks” Tungsten’s Compact Row Format 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20
  56. 56. Encoders 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format
  57. 57. Datasets: Lightning-fast Serialization with Encoders
  58. 58. Performance of Core Primitives cost per row (single thread) primitive Spark 1.6 Spark 2.0 filter 15 ns 1.1 ns sum w/o group 14 ns 0.9 ns sum w/ group 79 ns 10.7 ns hash join 115 ns 4.0 ns sort (8 bit entropy) 620 ns 5.3 ns sort (64 bit entropy) 620 ns 40 ns sort-merge join 750 ns 700 ns Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
  59. 59. Workshop: Notebook on DataFrames/Datasets & Spark SQL • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh2A – http://dbricks.co/sswksh2 – https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql. Dataset • Workthrough each Notebookcell • Try challenges • Break..
  60. 60. Introduction to Structured Streaming
  61. 61. Streaming in Apache Spark Streaming demands newtypes of streaming requirements… 3 SQL MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 51%users say most important partof Apache Spark Spark Streaming in production jumped to 22%from 14% Streaming
  62. 62. Streaming apps are growing more complex 4
  63. 63. Streaming computations don’t run in isolation • Need to interact with batch data, interactive analysis, machine learning, etc.
  64. 64. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning
  65. 65. Use case: IoT Device Monitoring Anomalydetection - Learn modelsoffline - Useonline + continuous learning IoT events event stream fromKafka ETL into long term storage - Prevent dataloss Status monitoring - Preventduplicates Interactively - Handle late data debugissues - Aggregateon windows -consistency on eventtime Continuous Applications Not just streaming any more
  66. 66. The simplest way to perform streaming analytics is not having to reason about streaming at all
  67. 67. Static, bounded table Stream as a unbound DataFrame Streaming, unbounded table Single API !
  68. 68. Stream as unbounded DataFrame
  69. 69. Gist of Structured Streaming High-level streaming API built on SparkSQL engine Runs the same computation as batch queriesin Datasets / DataFrames Eventtime, windowing,sessions,sources& sinks Guaranteesan end-to-end exactlyonce semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML modelsto your Stream
  70. 70. Advantages over DStreams 1. Processingwith event-time,dealingwith late data 2. Exactly same API for batch,streaming,and interactive 3. End-to-endexactly-once guaranteesfromthe system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 82
  71. 71. Structured Streaming ModelTrigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: newrows appended to table Query: operations on input usual map/filter/reduce newwindow, session ops
  72. 72. Structured Streaming ModelTrigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  73. 73. Structured Streaming ModelTrigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [complete mode] output only new rows since last trigger Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible withall queries
  74. 74. Example WordCount https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
  75. 75. Batch ETL with DataFrame inputDF = spark.read .format("json") .load("source-path") resultDF = input .select("device", "signal") .where("signal > 15") resultDF.write .format("parquet") .save("dest-path") Read from JSON file Select some devices Write to parquet file
  76. 76. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  77. 77. Streaming ETL with DataFrames 1 2 3 Result Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path")
  78. 78. Continuous Aggregations Continuously compute average signal of each type of device 90 input.groupBy("device-type") .avg("signal")
  79. 79. Continuous Windowed Aggregations 91 input.groupBy( $"device-type", window($"event-time", "10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event-time column
  80. 80. Joining streams with static data kafkaDataset = spark.readStream .format("kafka") .option("subscribe", "iot-updates") .option("kafka.bootstrap.servers", ...) .load() staticDataset = ctxt.read .jdbc("jdbc://", "iot-device-info") joinedDataset = kafkaDataset.join(staticDataset, "device-type") 92 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthaving to thinkthat you are joining streaming data
  81. 81. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 93 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  82. 82. Watermarking for handling late data [Spark 2.1] 94 input.withWatermark("event-time", "15 min") .groupBy( $"device-type", window($"event-time", "10 min")) .avg("signal") Continuously compute 10 minute average while data can be 15 minutes late
  83. 83. Structured Streaming Underneath the Hood
  84. 84. Logically: Dataset operations on table (i.e. as easy to understand asbatch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally andcontinuously) DataFrame LogicalPlan Catalystoptimizer Continuous, incrementalexecution Query Execution
  85. 85. Batch/Streaming Execution on Spark SQL 97 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  86. 86. Batch Execution on Spark SQL 98 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobsto compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations MemoryOptimizations Compact and fastencoding Offheap memory Project Tungsten -Phase 1 and 2
  87. 87. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans 99 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  88. 88. Continuous Incremental Execution 100 Planner Incremental Execution 2 Offsets:[106-197] Count: 92 Plannerpollsfor new data from sources Incremental Execution 1 Offsets:[19-105] Count: 87 Incrementally executes new data and writesto sink
  89. 89. Continuous Aggregations Maintain runningaggregate as in-memory state backed by WAL in file system for fault-tolerance 101 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets:[19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets:[106-179] Count: 87+92 = 179
  90. 90. Structured Streaming: Recap • High-level streaming API built on Datasets/DataFrames • Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serveusing JDBC Add, remove,change queriesat runtime • Build and applyML models
  91. 91. Current Status: Spark 2.0.2 Basic infrastructureand API - Eventtime, windows,aggregations - Append and Complete output modes - Support for a subsetof batch queries Sourceand sink - Sources: Files,Kafka, Flume,Kinesis - Sinks: Files,in-memory table, unmanaged sinks(foreach) Experimental releaseto set the future direction Not ready for production but good for experiment
  92. 92. Coming Soon (Dec 2016): Spark 2.1.0 Event-time watermarks and old state evictions Status monitoring and metrics - Current status (processing,waiting fordata, etc.) - Current metrics (inputrates, processing rates, state data size, etc.) - Codahale / Dropwizard metrics (reporting to Ganglia,Graphite, etc.) Many stability & performance improvements 104
  93. 93. Future Direction Stability, stability, stability Supportfor more queries Sessionization More outputmodes Supportfor more sinks ML integrations Make Structured Streaming readyfor production workloads by Spark 2.2/2.3
  94. 94. Blogs for Deeper Technical Dive Blog1: https://databricks.com/blog/2016/07/28/ continuous-applications-evolving- streaming-in-apache-spark-2-0.html Blog2: https://databricks.com/blog/2016/07/28/ structured-streaming-in-apache- spark.html
  95. 95. Resources • Getting Started Guide w ApacheSpark on Databricks • docs.databricks.com • Spark Programming Guide • StructuredStreaming Programming Guide • Databricks EngineeringBlogs • sparkhub.databricks.com • spark-packages.org
  96. 96. https://spark-summit.org/east-2017/
  97. 97. Do you have any questions for my prepared answers?
  98. 98. Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.0 Cluster • http://dbricks.co/sswksh3 (Demo) • http://dbricks.co/sswksh4 (Workshop) • Done!
  99. 99. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperatestreaming with batch AND interactive - RDD/DStream hassimilar API, butstill requirestranslation 3. Reasoning about end-to-end guarantees - Requirescarefully constructing sinks that handle failurescorrectly - Data consistency in the storage while being updated Pain points with DStreams
  100. 100. Output Modes Defines what is written every time there is a trigger Different output modes make sense for differentqueries 22 i n p u t.select ("dev ic e", "s i g n al ") .w r i te .outputMode("append") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Append modewith non-aggregationqueries i n p u t.agg( cou nt("* ") ) .w r i te .outputMode("complete") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Complete mode with aggregationqueries
  • SubhadeepDas2

    Apr. 28, 2019
  • MahendranKarunanithi

    Nov. 6, 2017
  • uhashimi

    Dec. 18, 2016
  • bunkertor

    Dec. 18, 2016

Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Apache Spark Fundamentals & Concepts What’s new in Spark 2.x SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

Views

Total views

1,050

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

89

Shares

0

Comments

0

Likes

4

×