SlideShare a Scribd company logo
1 of 33
Download to read offline
Continuous Application
with
Apache® Spark™ 2.0
Jules S. Damji
Spark Community Evangelist
QconfSF 11/10.2016
@2twitme
$ whoami
• Spark Community Evangelist @ Databricks
• Previously Developer Advocate @ Hortonworks
• In the past engineering roles at:
• Sun Microsystems, Netscape, @Home, VeriSign,
Scalix, Centrify, LoudCloud/Opsware, ProQuest
• jules@databricks.com
• https://www.linkedin.com/in/dmatrix
Introduction to Structured
Streaming
Streaming in Apache Spark
Streaming demands newtypes of streaming requirements…
3
SQL Streaming MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 51%users say most important partof Apache Spark
Spark Streaming in production jumped to 22%from 14%
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
• Need to interact with batch data,
interactive analysis, machine learning, etc.
Use case: IoT Device Monitoring
IoT events
fromKafka
ETL into long term storage
- Prevent dataloss
- PreventduplicatesStatusmonitoring
- Handle latedata
- Aggregateon windows
on even-ttime
Interactively
debugissues
-consistency
event stream
Anomalydetection
- Learn modelsoffline
- Use online+continuous
learning
Use case: IoT Device Monitoring
Anomalydetection
- Learn modelsoffline
- Useonline + continuous
learning
IoT events event stream
fromKafka
ETL into long term storage
- Prevent dataloss
Status monitoring - Preventduplicates Interactively
- Handle late data debugissues
- Aggregateon windows -consistency
on eventtime
Continuous Applications
Not just streaming any more
The simplest way to perform streaming analytics
is not having to reason about streaming at all
Static,
bounded table
Stream as a unbound DataFrame
Streaming,
unbounded table
Single API !
Stream as unbounded DataFrame
Gist of Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesin Datasets / DataFrames
Eventtime, windowing,sessions,sources& sinks
Guaranteesan end-to-end exactlyonce semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML modelsto your Stream
Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and interactive
3. End-to-endexactly-once guaranteesfromthe system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
14
Structured Streaming ModelTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: howfrequently to check
input for newdata
Query: operations on input
usual map/filter/reduce
newwindow, session ops
Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output
Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output: Write only new rows
*Not all output modes are feasible withall queries
Example WordCount
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Batch ETL with DataFrame
inputDF = spark.read
.format("json")
.load("source-path")
resultDF = input
.select("device", "signal")
.where("signal > 15")
resultDF.write
.format("parquet")
.save("dest-path")
Read from JSON file
Select some devices
Write to parquet file
Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing
Streaming ETL with DataFrame
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
Continuous Aggregations
Continuously compute average
signal of each type of device
22
input.groupBy("device-type")
.avg("signal")
input.groupBy(
window("event-time",
"10min"),
"device type")
.avg("signal")
Continuously compute average signal of
each type of device in last10 minutesof
eventtime
- Windowing is just a type of aggregation
- Simple API for event time based windowing
Joining streams with static data
kafkaDataset = spark.read
. ka f ka ( "io t - u pd a te s")
. st r e a m ()
st a t icDa t a se t = ct xt . r e a d
. j d b c ( " j d b c : / / ", "io t - d e vice - in f o ")
joinedDataset =
ka f ka Dataset .joi n(
st a t icDa t a se t , "d e vice- type ")
21
Join streaming data from Kafka with
static data via JDBC to enrich the
streaming data …
… withouthavingto thinkthat you
are joining streamingdata
Output Modes
Defines what is outputted every time there is a trigger
Different output modes make sense for differentqueries
22
i n p u t.select ("dev ic e", "s i g n al ")
.w r i te
.outputMode("append")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Append modewith
non-aggregationqueries
i n p u t.agg( cou nt("* ") )
.w r i te
.outputMode("complete")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Complete mode with
aggregationqueries
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
25
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack
Logically:
Dataset operations on table
(i.e. as easy to understand asbatch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally andcontinuously)
DataFrame
LogicalPlan
Catalystoptimizer
Continuous,
incrementalexecution
Query Execution
Batch/Streaming Execution on Spark SQL
27
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalotofmagic!
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans
28
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4
Structured Streaming: Recap
• High-level streaming API built on Datasets/DataFrames
• Eventtime, windowing,sessions,sources&
sinks End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serveusing
JDBC Add, remove,change queriesat runtime
• Build and applyML models
Demo & Workshop: Structured Streaming
• Import Notebook into your Spark 2.0 Cluster
• http://dbricks.co/sswksh3 (Demo)
• http://dbricks.co/sswksh4 (Workshop)
Resources
• docs.databricks.com
• Spark Programming Guide
• StructuredStreaming Programming Guide
• Databricks EngineeringBlogs
• sparkhub.databricks.com
• https://spark-packages.org/
Do you have any questions
for my prepared answers?

More Related Content

What's hot

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 

What's hot (20)

Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
DataFlow & Beam
DataFlow & BeamDataFlow & Beam
DataFlow & Beam
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 

Viewers also liked

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 

Viewers also liked (20)

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Wrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech EcosystemWrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech Ecosystem
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 

Similar to Continuous Application with Structured Streaming 2.0

Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
Angel Giraldo
 

Similar to Continuous Application with Structured Streaming 2.0 (20)

Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Flow based programming in golang
Flow based programming in golangFlow based programming in golang
Flow based programming in golang
 
Source Code Analysis with SAST
Source Code Analysis with SASTSource Code Analysis with SAST
Source Code Analysis with SAST
 

More from Anyscale

More from Anyscale (7)

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache Spark
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 

Continuous Application with Structured Streaming 2.0

  • 1. Continuous Application with Apache® Spark™ 2.0 Jules S. Damji Spark Community Evangelist QconfSF 11/10.2016 @2twitme
  • 2. $ whoami • Spark Community Evangelist @ Databricks • Previously Developer Advocate @ Hortonworks • In the past engineering roles at: • Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest • jules@databricks.com • https://www.linkedin.com/in/dmatrix
  • 4. Streaming in Apache Spark Streaming demands newtypes of streaming requirements… 3 SQL Streaming MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 51%users say most important partof Apache Spark Spark Streaming in production jumped to 22%from 14%
  • 5. Streaming apps are growing more complex 4
  • 6. Streaming computations don’t run in isolation • Need to interact with batch data, interactive analysis, machine learning, etc.
  • 7. Use case: IoT Device Monitoring IoT events fromKafka ETL into long term storage - Prevent dataloss - PreventduplicatesStatusmonitoring - Handle latedata - Aggregateon windows on even-ttime Interactively debugissues -consistency event stream Anomalydetection - Learn modelsoffline - Use online+continuous learning
  • 8. Use case: IoT Device Monitoring Anomalydetection - Learn modelsoffline - Useonline + continuous learning IoT events event stream fromKafka ETL into long term storage - Prevent dataloss Status monitoring - Preventduplicates Interactively - Handle late data debugissues - Aggregateon windows -consistency on eventtime Continuous Applications Not just streaming any more
  • 9.
  • 10. The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 11. Static, bounded table Stream as a unbound DataFrame Streaming, unbounded table Single API !
  • 12. Stream as unbounded DataFrame
  • 13. Gist of Structured Streaming High-level streaming API built on SparkSQL engine Runs the same computation as batch queriesin Datasets / DataFrames Eventtime, windowing,sessions,sources& sinks Guaranteesan end-to-end exactlyonce semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML modelsto your Stream
  • 14. Advantages over DStreams 1. Processingwith event-time,dealingwith late data 2. Exactly same API for batch,streaming,and interactive 3. End-to-endexactly-once guaranteesfromthe system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 14
  • 15. Structured Streaming ModelTrigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: howfrequently to check input for newdata Query: operations on input usual map/filter/reduce newwindow, session ops
  • 16. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  • 17. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch Append output: Write only new rows *Not all output modes are feasible withall queries
  • 19. Batch ETL with DataFrame inputDF = spark.read .format("json") .load("source-path") resultDF = input .select("device", "signal") .where("signal > 15") resultDF.write .format("parquet") .save("dest-path") Read from JSON file Select some devices Write to parquet file
  • 20. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  • 21. Streaming ETL with DataFrame input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") 1 2 3 Result [append-only table] Input Output [append mode] new rows in result of 2 new rows in result of 3
  • 22. Continuous Aggregations Continuously compute average signal of each type of device 22 input.groupBy("device-type") .avg("signal") input.groupBy( window("event-time", "10min"), "device type") .avg("signal") Continuously compute average signal of each type of device in last10 minutesof eventtime - Windowing is just a type of aggregation - Simple API for event time based windowing
  • 23. Joining streams with static data kafkaDataset = spark.read . ka f ka ( "io t - u pd a te s") . st r e a m () st a t icDa t a se t = ct xt . r e a d . j d b c ( " j d b c : / / ", "io t - d e vice - in f o ") joinedDataset = ka f ka Dataset .joi n( st a t icDa t a se t , "d e vice- type ") 21 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthavingto thinkthat you are joining streamingdata
  • 24. Output Modes Defines what is outputted every time there is a trigger Different output modes make sense for differentqueries 22 i n p u t.select ("dev ic e", "s i g n al ") .w r i te .outputMode("append") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Append modewith non-aggregationqueries i n p u t.agg( cou nt("* ") ) .w r i te .outputMode("complete") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Complete mode with aggregationqueries
  • 25. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 25 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  • 26. Logically: Dataset operations on table (i.e. as easy to understand asbatch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally andcontinuously) DataFrame LogicalPlan Catalystoptimizer Continuous, incrementalexecution Query Execution
  • 27. Batch/Streaming Execution on Spark SQL 27 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  • 28. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans 28 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  • 29. Structured Streaming: Recap • High-level streaming API built on Datasets/DataFrames • Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serveusing JDBC Add, remove,change queriesat runtime • Build and applyML models
  • 30.
  • 31. Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.0 Cluster • http://dbricks.co/sswksh3 (Demo) • http://dbricks.co/sswksh4 (Workshop)
  • 32. Resources • docs.databricks.com • Spark Programming Guide • StructuredStreaming Programming Guide • Databricks EngineeringBlogs • sparkhub.databricks.com • https://spark-packages.org/
  • 33. Do you have any questions for my prepared answers?