SlideShare a Scribd company logo
© 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
•  Features
•  Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming
•  Addressing the Improvement Areas
•  API
•  Fault Tolerance
•  Event Time
•  Managing Streaming queries
- Structured Streaming Examples
https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts
© 2015 IBM Corporation2
Spark Streaming 1. X
! Features of Spark Streaming
-  High Level API (stateful, joins, aggregates, windows etc.)
•  Overlap with RDD API (batch)
-  Fault – Tolerant (exactly once semantics achievable)
-  Back Pressure
-  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation3
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantees, user needs to do all the heavy lifting in
the Sink
Can that be handled in a very simple way for the end-user ?
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation4
Fault-Tolerant Semantics
Exactly	
  Once,	
  If	
  Outputs	
  are	
  Idempotent	
  or	
  transac6onal	
  
Exactly	
  Once,	
  as	
  long	
  as	
  received	
  data	
  is	
  not	
  lost	
  
Exactly	
  Once	
  needs	
  re-­‐playable	
  sources	
  (e.g.	
  Ka?a	
  Direct)	
  
Source
Receiver
Transforming
Outputting
Sink
© 2015 IBM Corporation5
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
-  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API
-  Request for more seamless API between Batch & Stream
-  Reduce complexities of streaming app *
! No Event Time support
-  Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation6
Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primary Batch API even to Streaming
- Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge
- Remove/Keep streaming complexities to minimum
!
© 2015 IBM Corporation7
Lets Dive in
© 2015 IBM Corporation8
SQL Batch vs SQL Streaming- Conceptually
© 2015 IBM Corporation9
Batch vs Streaming - Programmatically
© 2015 IBM Corporation10
Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
-  Complete – Entire updated Result table is written to external storage.
-  Append – Only new rows added in the Result table since last incremental query execution is
written to external storage.
-  Update - Only the rows updated in the Result table since last incremental query execution is
written to external storage.
Upto implementation of Storage connector to decide how to write.
* Aggregate queries only support complete mode and non-aggregate queries append mode
© 2015 IBM Corporation11
Supported Sinks & Modes in 2.0
*DEBUG	
  ONLY	
  
*DEBUG	
  ONLY	
  
© 2015 IBM Corporation12
Windowing in Structured Streaming
© 2015 IBM Corporation13
Window operations
!  Continuous time based aggregations are most common in Streaming applications.
-  Sliding window & Tumbling window
E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
!  Used in a Group By clause
Can be used in Batch as well
© 2015 IBM Corporation14
Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them
© 2015 IBM Corporation15
Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct
window group
! Use a normal filter in the SQL ?
! Watermarks
© 2015 IBM Corporation16
Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! What all can fail?
! Driver
! Executor
© 2015 IBM Corporation17
Spark 1.x Best Fault tolerance - Kafka Direct API
•  Simplified Parallelism
•  Less Storage Need
•  Exactly Once Semantics.
source & processing
Benefits	
  of	
  this	
  approach	
  
© 2015 IBM Corporation18
Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint	
  to	
  HDFS	
  
! Structured Streaming Checkpointing
Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any
processing is started for that trigger
Nth record in log indicates data that is currently being processed
N-1 entry in log indicates offsets idempotent written to Sink
Log entries are monotonically increasing integers
! On Recovery
Restart processing of nth entry in WAL
© 2015 IBM Corporation19
Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
-  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
-  Built-in Sources will *mostly* be only ones that support replay
https://issues.apache.org/jira/browse/SPARK-15842
© 2015 IBM Corporation20
Managing Streaming Queries
!  Streaming in 1.x was definetly lacking in
-  Starting / Stopping individual Streaming Queries
-  Changing the computation done in a Query.
-  When a Streaming Query abnormally terminates handle more gracefully than app crash.
© 2015 IBM Corporation21
Managing Streaming Queries
© 2015 IBM Corporation22
Managing Streaming Queries
© 2015 IBM Corporation23
Summary
!  Overall has a good set of features
-  Easier code share between Batch and Streaming (No different type hierarchies)
-  Window not tied to Batch interval
-  No Streaming context
-  Optimizer now available for your queries.
!  Getting started
-  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those.
-  Only get Runtime exceptions when you mess with above
!  How does it compare to Apache Beam ?
© 2015 IBM Corporation24
For Each Sink
© 2015 IBM Corporation25
Thank YOU

More Related Content

What's hot

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
datamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Shashank L
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
confluent
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 

What's hot (20)

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 

Similar to Introduction to Structured Streaming

Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Rajesh Balamohan
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
SD Times - Docker v2
SD Times - Docker v2SD Times - Docker v2
SD Times - Docker v2
Alvin Richards
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
Aerospike, Inc.
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
QCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application DevelopmentQCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application Development
Chris Bailey
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
Matt Ingenthron
 
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Benoit Hudzia
 
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
vasuballa
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Filipe Miranda
 
NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-final
shintaro mizuno
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019
Johan Norm
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 

Similar to Introduction to Structured Streaming (20)

Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
SD Times - Docker v2
SD Times - Docker v2SD Times - Docker v2
SD Times - Docker v2
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
QCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application DevelopmentQCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application Development
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
 
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
 
NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-final
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 

More from datamantra (15)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 

Introduction to Structured Streaming

  • 1. © 2015 IBM Corporation1 ! Agenda - Spark Streaming 1.X •  Features •  Areas for Improvement - Spark Streaming 2.0 – Structured Streaming •  Addressing the Improvement Areas •  API •  Fault Tolerance •  Event Time •  Managing Streaming queries - Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming - Summary thoughts
  • 2. © 2015 IBM Corporation2 Spark Streaming 1. X ! Features of Spark Streaming -  High Level API (stateful, joins, aggregates, windows etc.) •  Overlap with RDD API (batch) -  Fault – Tolerant (exactly once semantics achievable) -  Back Pressure -  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) ! Apache  Hadoop  Day  2015  
  • 3. © 2015 IBM Corporation3 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink Can that be handled in a very simple way for the end-user ? Apache  Hadoop  Day  2015  
  • 4. © 2015 IBM Corporation4 Fault-Tolerant Semantics Exactly  Once,  If  Outputs  are  Idempotent  or  transac6onal   Exactly  Once,  as  long  as  received  data  is  not  lost   Exactly  Once  needs  re-­‐playable  sources  (e.g.  Ka?a  Direct)   Source Receiver Transforming Outputting Sink
  • 5. © 2015 IBM Corporation5 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance -  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink ! API -  Request for more seamless API between Batch & Stream -  Reduce complexities of streaming app * ! No Event Time support -  Hard to support when processing time/batch time exposed in externals ! Streaming Query Management ! Micro-batch ! Apache  Hadoop  Day  2015  
  • 6. © 2015 IBM Corporation6 Spark Streaming 2.0 API ! Built on top of Spark SQL Engine ! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL. ! Challenge - Remove/Keep streaming complexities to minimum !
  • 7. © 2015 IBM Corporation7 Lets Dive in
  • 8. © 2015 IBM Corporation8 SQL Batch vs SQL Streaming- Conceptually
  • 9. © 2015 IBM Corporation9 Batch vs Streaming - Programmatically
  • 10. © 2015 IBM Corporation10 Output Modes - Sink ! Defined as what gets written from the Result table to external storage (Sink) ! Output modes -  Complete – Entire updated Result table is written to external storage. -  Append – Only new rows added in the Result table since last incremental query execution is written to external storage. -  Update - Only the rows updated in the Result table since last incremental query execution is written to external storage. Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode
  • 11. © 2015 IBM Corporation11 Supported Sinks & Modes in 2.0 *DEBUG  ONLY   *DEBUG  ONLY  
  • 12. © 2015 IBM Corporation12 Windowing in Structured Streaming
  • 13. © 2015 IBM Corporation13 Window operations !  Continuous time based aggregations are most common in Streaming applications. -  Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes ! New function that treats windowing as a regular aggregation !  Used in a Group By clause Can be used in Batch as well
  • 14. © 2015 IBM Corporation14 Event Time Windows ! Event-Time is time embedded within the data itself It is not the time Spark received the data ! What about processing time windows if you want them
  • 15. © 2015 IBM Corporation15 Handling Late Arrival in Event-Time ! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group ! Use a normal filter in the SQL ? ! Watermarks
  • 16. © 2015 IBM Corporation16 Fault Tolerance ! Why Care? ! Different guarantees for Data Loss ! Atleast Once ! Exactly Once ! What all can fail? ! Driver ! Executor
  • 17. © 2015 IBM Corporation17 Spark 1.x Best Fault tolerance - Kafka Direct API •  Simplified Parallelism •  Less Storage Need •  Exactly Once Semantics. source & processing Benefits  of  this  approach  
  • 18. © 2015 IBM Corporation18 Fault Tolerance in Structured Streaming Active Driver Checkpoint  to  HDFS   ! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers ! On Recovery Restart processing of nth entry in WAL
  • 19. © 2015 IBM Corporation19 Fault Tolerance in Structured Streaming ! End-to-End Exactly Once guarantees with -  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC) -  Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842
  • 20. © 2015 IBM Corporation20 Managing Streaming Queries !  Streaming in 1.x was definetly lacking in -  Starting / Stopping individual Streaming Queries -  Changing the computation done in a Query. -  When a Streaming Query abnormally terminates handle more gracefully than app crash.
  • 21. © 2015 IBM Corporation21 Managing Streaming Queries
  • 22. © 2015 IBM Corporation22 Managing Streaming Queries
  • 23. © 2015 IBM Corporation23 Summary !  Overall has a good set of features -  Easier code share between Batch and Streaming (No different type hierarchies) -  Window not tied to Batch interval -  No Streaming context -  Optimizer now available for your queries. !  Getting started -  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around * And not much control over those. -  Only get Runtime exceptions when you mess with above !  How does it compare to Apache Beam ?
  • 24. © 2015 IBM Corporation24 For Each Sink
  • 25. © 2015 IBM Corporation25 Thank YOU