SlideShare a Scribd company logo
© 2015 IBM Corporation
Apache Hadoop Day 2015
Paranth Thiruvengadam – Architect @ IBM
Sachin Aggarwal – Developer @ IBM
© 2015 IBM Corporation
Spark Streaming
 Features of Spark Streaming
 High Level API (joins, windows etc.)
 Fault – Tolerant (exactly once semantics achievable)
 Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX
etc.)
Apache Hadoop Day 2015
© 2015 IBM Corporation
Architecture
Apache Hadoop Day 2015
© 2015 IBM Corporation
High Level Overview
Apache Hadoop Day 2015
© 2015 IBM Corporation
Receiving Data
Driver
RECEIVER
Input
Source
Executor
Executor
Data Blocks
Data Blocks
Data Blocks
Are replicated
To another
Executor
Driver runs
Receiver as
Long running
tasks
Receiver divides
Streams into
Blocks and
keeps in
memory
Apache Hadoop Day 2015
© 2015 IBM Corporation
Processing Data
Driver
RECEIVER
Executor
Executor
Data Blocks
Data Blocks
Every batch
Internal Driver
Launches tasks
To process the
blocks
Data
Store
results
results
© 2015 IBM Corporation
What’s different from other
Streaming applications?
© 2015 IBM Corporation
Traditional Stream Processing
© 2015 IBM Corporation
Load Balancing…
© 2015 IBM Corporation
Node failure / Stragglers…
© 2015 IBM Corporation
Word Count with Kafka
© 2015 IBM Corporation
Fault Tolerance
© 2015 IBM Corporation
Fault Tolerance
 Why Care?
 Different guarantees for Data Loss
 Atleast Once
 Exactly Once
 What all can fail?
 Driver
 Executor
© 2015 IBM Corporation
What happens when executor fails?
© 2015 IBM Corporation
What happens when Driver fails?
© 2015 IBM Corporation
Recovering Driver – Checkpointing
© 2015 IBM Corporation
Driver restart
© 2015 IBM Corporation
Driver restart – ToDO List
 Configure automatic driver restart
 Spark Standalone
 YARN
 Set Checkpoint in HDFS compatible file system
streamingContext.checkpiont(hdfsDirectory)
 Ensure the Code uses checkpoints for recovery
Def setupStreamingContext() : StreamingContext = {
Val context = new StreamingContext(…)
Val lines = KafkaUtils.createStream(…)
…
Context.checkpoint(hdfsDir)
Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext)
Context.start()
© 2015 IBM Corporation
WAL for no data loss
© 2015 IBM Corporation
Recover using WAL
© 2015 IBM Corporation
Configuration – Enabling WAL
 Enable Checkpointing.
 Enable WAL in Spark Configuration
 sparkConf.set(“spark.streaming.receiver.writeAheadLog.en
able”, “true”)
 Receiver should acknowledge the input source after data
written to WAL
 Disable in-memory replication
© 2015 IBM Corporation
Normal Processing
© 2015 IBM Corporation
Restarting Failed Driver
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Aleast Once, with Checkpointing / WAL
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Exactly Once, with Kafka Direct API
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
How to achieve “exactly once”
guarantee?
© 2015 IBM Corporation
Before Kafka Direct API
© 2015 IBM Corporation
Kafka Direct API
• Simplified Parallelism
• Less Storage Need
• Exactly Once Semantics
Benefits of this approach
© 2015 IBM Corporation
Demo
D E M O
SPARK STREAMING
OVERVIEW OF SPARK STREAMING
DISCRETIZED STREAMS (DSTREAMS)
• Dstream is basic abstraction in Spark Streaming.
• It is represented by a continuous series of RDDs(of the
same type).
• Each RDD in a DStream contains data from a certain
interval
• DStreams can either be created from live data (such as,
data from TCP sockets, Kafka, Flume, etc.) using a
Streaming Context or it can be generated by
transforming existing DStreams using operations such
as `map`, `window` and `reduceByKeyAndWindow`.
DISCRETIZED STREAMS (DSTREAMS)
WORD COUNT
val sparkConf = new SparkConf()
.setMaster("local[2]”)
.setAppName("WordCount")
val sc = new SparkContext(
sparkConf)
val file = sc.textFile(“filePath”)
val words = file
.flatMap(_.split(" "))
Val pairs = words
.map(x => (x, 1))
val wordCounts =pairs
.reduceByKey(_ + _)
wordCounts.saveAsTextFile(args(1))
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("SocketStreaming")
val ssc = new StreamingContext(
conf, Seconds(2))
val lines = ssc
.socketTextStream("localhost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
DEMO
KAFKA STREAM
val lines = ssc
.socketTextStream("localh
ost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
val zkQuorum="localhost:2181”;
val group="test";
val topics="test";
val numThreads="1";
val topicMap = topics
.split(",")
.map((_, numThreads.toInt))
.toMap
val lines = KafkaUtils
.createStream(
ssc, zkQuorum, group, topicMap)
.map(_._2)
val words = lines
.flatMap(_.split(" "))
……..
DEMO
OPERATIONS
• Repartition
• Operation on RDD
(Example print partition count
of each RDD)
Val re_lines=lines
.repartition(5)
re_lines
.foreachRDD(x =>fun(x))
def fun (rdd:RDD[String]) ={
print("partition count”
+ rdd.partitions.length)
}
DEMO
STATELESS TRANSFORMATIONS
• map() Apply a function to each element in the DStream and return a DStream of the result.
• ds.map(x => x + 1)
• flatMap() Apply a function to each element in the DStream and return a DStream of the contents
of the iterators returned.
• ds.flatMap(x => x.split(" "))
• filter() Return a DStream consisting of only elements that pass the condition passed to filter.
• ds.filter(x => x != 1)
• repartition() Change the number of partitions of the DStream.
• ds.repartition(10)
• reduceBy Combine values with the same Key() key in each batch.
• ds.reduceByKey( (x,y)=>x+y)
• groupBy Group values with the same Key() key in each batch.
• ds.groupByKey()
DEMO
STATEFUL TRANSFORMATIONS
Stateful transformations require checkpointing to be
enabled in your StreamingContext for fault tolerance
• Windowed transformations: windowed computations
allow you to apply transformations over a sliding window
of data
• UpdateStateByKey transformation: Enables this by
providing access to a state variable for DStreams of
key/value pairs
DEMO
WINDOW OPERATIONS
This shows that any window operation needs to specify two
parameters.
• window length - The duration of the window.
• sliding interval - The interval at which the window
operation is performed.
These two parameters must be multiples of the batch
interval of the source Dstream
DEMO
WINDOWED TRANSFORMATIONS
• window(windowLength, slideInterval)
• Return a new Dstream, computed based on windowed batches of the source Dstream.
• countByWindow(windowLength, slideInterval)
• Return a sliding window count of elements in the stream.
• val totalWordCount= words.countByWindow(Seconds(30), Seconds(10))
• reduceByWindow(func, windowLength, slideInterval)
• Return a new single-element stream, created by aggregating elements in the stream over a sliding
interval using func.
• The function should be associative so that it can be computed correctly in parallel.
• val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10)
• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
• Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the
given reduce function func over batches in a sliding window
• val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
• countByValueAndWindow(windowLength, slideInterval)
• Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a
sliding window.
• val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
DEMO
UPDATE STATE BY KEY
TRANSFORMATION
• updateStateByKey()
• Enables this by providing access to a state variable for DStreams of
key/value pairs
• User provide a function updateFunc(events, oldState) and initialRDD
• val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1),
("world", 1)))
• val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
• val stateCount= pairs.updateStateByKey[Int](updateFunc)
DEMO
TRANSFORM OPERATION
• The transform operation allows arbitrary RDD-to-RDD
functions to be applied on a DStream.
• It can be used to apply any RDD operation that is not
exposed in the DStream API.
• For example, the functionality of joining every batch in a
data stream with another dataset is not directly exposed
in the DStream API.
• val cleanedDStream = wordCounts.transform(rdd => {
rdd.join(data)
})
DEMO
JOIN OPERATIONS
• Stream-stream joins:
• Streams can be very easily joined with other streams.
• val stream1: DStream[String, String] = ...
• val stream2: DStream[String, String] = ...
• val joinedStream = stream1.join(stream2)
• Windowed join
• val windowedStream1 = stream1.window(Seconds(20))
• val windowedStream2 = stream2.window(Minutes(1))
• val joinedStream = windowedStream1.join(windowedStream2)
• Stream-dataset joins
• val dataset: RDD[String, String] = ...
• val windowedStream = stream.window(Seconds(20))...
• val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
DEMO
USING FOREACHRDD()
• foreachRDD is a powerful primitive that allows data to be sent out to
external systems.
• dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection)
}
}
• Using foreachRDD, Each RDD is converted to a DataFrame, registered
as a temporary table and then queried using SQL.
• words.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.registerTempTable("words")
val wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word")
wordCountsDataFrame.show()
}
DEMO
DSTREAMS (SPARK CODE)
• DStreams internally is characterized by a few basic properties:
• A list of other DStreams that the DStream depends on
• A time interval at which the DStream generates an RDD
• A function that is used to generate an RDD after each time interval
• Methods that should be implemented by subclasses of Dstream
• Time interval after which the DStream generates a RDD
• def slideDuration: Duration
• List of parent DStreams on which this DStream depends on
• def dependencies: List[DStream[_]]
• Method that generates a RDD for the given time
• def compute(validTime: Time): Option[RDD[T]]
• This class contains the basic operations available on all DStreams, such as
`map`, `filter` and `window`. In addition, PairDStreamFunctions contains
operations available only on DStreams of key-value pairs, such as
`groupByKeyAndWindow` and `join`. These operations are automatically
available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit
conversions.
© 2015 IBM Corporation

More Related Content

What's hot

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Spark
SparkSpark
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Prakash Chockalingam
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 

What's hot (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Spark
SparkSpark
Spark
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 

Viewers also liked

Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use cases
punesparkmeetup
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
Avi Levi
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert Systems
Jim Haughwout
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
MOLDEAS at City College
MOLDEAS at City CollegeMOLDEAS at City College
MOLDEAS at City College
CARLOS III UNIVERSITY OF MADRID
 
WP4-QoS Management in the Cloud
WP4-QoS Management in the CloudWP4-QoS Management in the Cloud
WP4-QoS Management in the Cloud
CARLOS III UNIVERSITY OF MADRID
 
Reactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsReactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka Streams
Dean Wampler
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
andyseaborne
 
PSL Overview
PSL OverviewPSL Overview
PSL Overview
stephenbach
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 

Viewers also liked (20)

Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use cases
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert Systems
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
MOLDEAS at City College
MOLDEAS at City CollegeMOLDEAS at City College
MOLDEAS at City College
 
WP4-QoS Management in the Cloud
WP4-QoS Management in the CloudWP4-QoS Management in the Cloud
WP4-QoS Management in the Cloud
 
Reactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsReactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka Streams
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
 
PSL Overview
PSL OverviewPSL Overview
PSL Overview
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 

Similar to Apache Spark Streaming: Architecture and Fault Tolerance

Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
Vadym Khondar
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018
Loiane Groner
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
Eduardo Castro
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
Assaf Gannon
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
Toby Matejovsky
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?
Timo Walther
 
Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features WSO2
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
Radu Tudoran
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guide
Ady Liu
 
Qubell — Component Model
Qubell — Component ModelQubell — Component Model
Qubell — Component Model
Roman Timushev
 
El camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionEl camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - Introduction
Plain Concepts
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
StackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackStackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStack
Chiradeep Vittal
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
InfluxData
 

Similar to Apache Spark Streaming: Architecture and Fault Tolerance (20)

Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?
 
Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guide
 
Qubell — Component Model
Qubell — Component ModelQubell — Component Model
Qubell — Component Model
 
El camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionEl camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - Introduction
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
StackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackStackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStack
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
 

Recently uploaded

ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 

Recently uploaded (20)

ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 

Apache Spark Streaming: Architecture and Fault Tolerance

  • 1. © 2015 IBM Corporation Apache Hadoop Day 2015 Paranth Thiruvengadam – Architect @ IBM Sachin Aggarwal – Developer @ IBM
  • 2. © 2015 IBM Corporation Spark Streaming  Features of Spark Streaming  High Level API (joins, windows etc.)  Fault – Tolerant (exactly once semantics achievable)  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) Apache Hadoop Day 2015
  • 3. © 2015 IBM Corporation Architecture Apache Hadoop Day 2015
  • 4. © 2015 IBM Corporation High Level Overview Apache Hadoop Day 2015
  • 5. © 2015 IBM Corporation Receiving Data Driver RECEIVER Input Source Executor Executor Data Blocks Data Blocks Data Blocks Are replicated To another Executor Driver runs Receiver as Long running tasks Receiver divides Streams into Blocks and keeps in memory Apache Hadoop Day 2015
  • 6. © 2015 IBM Corporation Processing Data Driver RECEIVER Executor Executor Data Blocks Data Blocks Every batch Internal Driver Launches tasks To process the blocks Data Store results results
  • 7. © 2015 IBM Corporation What’s different from other Streaming applications?
  • 8. © 2015 IBM Corporation Traditional Stream Processing
  • 9. © 2015 IBM Corporation Load Balancing…
  • 10. © 2015 IBM Corporation Node failure / Stragglers…
  • 11. © 2015 IBM Corporation Word Count with Kafka
  • 12. © 2015 IBM Corporation Fault Tolerance
  • 13. © 2015 IBM Corporation Fault Tolerance  Why Care?  Different guarantees for Data Loss  Atleast Once  Exactly Once  What all can fail?  Driver  Executor
  • 14. © 2015 IBM Corporation What happens when executor fails?
  • 15. © 2015 IBM Corporation What happens when Driver fails?
  • 16. © 2015 IBM Corporation Recovering Driver – Checkpointing
  • 17. © 2015 IBM Corporation Driver restart
  • 18. © 2015 IBM Corporation Driver restart – ToDO List  Configure automatic driver restart  Spark Standalone  YARN  Set Checkpoint in HDFS compatible file system streamingContext.checkpiont(hdfsDirectory)  Ensure the Code uses checkpoints for recovery Def setupStreamingContext() : StreamingContext = { Val context = new StreamingContext(…) Val lines = KafkaUtils.createStream(…) … Context.checkpoint(hdfsDir) Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext) Context.start()
  • 19. © 2015 IBM Corporation WAL for no data loss
  • 20. © 2015 IBM Corporation Recover using WAL
  • 21. © 2015 IBM Corporation Configuration – Enabling WAL  Enable Checkpointing.  Enable WAL in Spark Configuration  sparkConf.set(“spark.streaming.receiver.writeAheadLog.en able”, “true”)  Receiver should acknowledge the input source after data written to WAL  Disable in-memory replication
  • 22. © 2015 IBM Corporation Normal Processing
  • 23. © 2015 IBM Corporation Restarting Failed Driver
  • 24. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Aleast Once, with Checkpointing / WAL Source Receiving Transforming Outputting Sink
  • 25. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Exactly Once, with Kafka Direct API Source Receiving Transforming Outputting Sink
  • 26. © 2015 IBM Corporation How to achieve “exactly once” guarantee?
  • 27. © 2015 IBM Corporation Before Kafka Direct API
  • 28. © 2015 IBM Corporation Kafka Direct API • Simplified Parallelism • Less Storage Need • Exactly Once Semantics Benefits of this approach
  • 29. © 2015 IBM Corporation Demo
  • 30. D E M O SPARK STREAMING
  • 31. OVERVIEW OF SPARK STREAMING
  • 32. DISCRETIZED STREAMS (DSTREAMS) • Dstream is basic abstraction in Spark Streaming. • It is represented by a continuous series of RDDs(of the same type). • Each RDD in a DStream contains data from a certain interval • DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, etc.) using a Streaming Context or it can be generated by transforming existing DStreams using operations such as `map`, `window` and `reduceByKeyAndWindow`.
  • 34. WORD COUNT val sparkConf = new SparkConf() .setMaster("local[2]”) .setAppName("WordCount") val sc = new SparkContext( sparkConf) val file = sc.textFile(“filePath”) val words = file .flatMap(_.split(" ")) Val pairs = words .map(x => (x, 1)) val wordCounts =pairs .reduceByKey(_ + _) wordCounts.saveAsTextFile(args(1)) val conf = new SparkConf() .setMaster("local[2]") .setAppName("SocketStreaming") val ssc = new StreamingContext( conf, Seconds(2)) val lines = ssc .socketTextStream("localhost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 35. DEMO
  • 36. KAFKA STREAM val lines = ssc .socketTextStream("localh ost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) val zkQuorum="localhost:2181”; val group="test"; val topics="test"; val numThreads="1"; val topicMap = topics .split(",") .map((_, numThreads.toInt)) .toMap val lines = KafkaUtils .createStream( ssc, zkQuorum, group, topicMap) .map(_._2) val words = lines .flatMap(_.split(" ")) ……..
  • 37. DEMO
  • 38. OPERATIONS • Repartition • Operation on RDD (Example print partition count of each RDD) Val re_lines=lines .repartition(5) re_lines .foreachRDD(x =>fun(x)) def fun (rdd:RDD[String]) ={ print("partition count” + rdd.partitions.length) }
  • 39. DEMO
  • 40. STATELESS TRANSFORMATIONS • map() Apply a function to each element in the DStream and return a DStream of the result. • ds.map(x => x + 1) • flatMap() Apply a function to each element in the DStream and return a DStream of the contents of the iterators returned. • ds.flatMap(x => x.split(" ")) • filter() Return a DStream consisting of only elements that pass the condition passed to filter. • ds.filter(x => x != 1) • repartition() Change the number of partitions of the DStream. • ds.repartition(10) • reduceBy Combine values with the same Key() key in each batch. • ds.reduceByKey( (x,y)=>x+y) • groupBy Group values with the same Key() key in each batch. • ds.groupByKey()
  • 41. DEMO
  • 42. STATEFUL TRANSFORMATIONS Stateful transformations require checkpointing to be enabled in your StreamingContext for fault tolerance • Windowed transformations: windowed computations allow you to apply transformations over a sliding window of data • UpdateStateByKey transformation: Enables this by providing access to a state variable for DStreams of key/value pairs
  • 43. DEMO
  • 44. WINDOW OPERATIONS This shows that any window operation needs to specify two parameters. • window length - The duration of the window. • sliding interval - The interval at which the window operation is performed. These two parameters must be multiples of the batch interval of the source Dstream
  • 45. DEMO
  • 46. WINDOWED TRANSFORMATIONS • window(windowLength, slideInterval) • Return a new Dstream, computed based on windowed batches of the source Dstream. • countByWindow(windowLength, slideInterval) • Return a sliding window count of elements in the stream. • val totalWordCount= words.countByWindow(Seconds(30), Seconds(10)) • reduceByWindow(func, windowLength, slideInterval) • Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. • The function should be associative so that it can be computed correctly in parallel. • val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10) • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) • Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window • val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) • countByValueAndWindow(windowLength, slideInterval) • Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. • val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
  • 47. DEMO
  • 48. UPDATE STATE BY KEY TRANSFORMATION • updateStateByKey() • Enables this by providing access to a state variable for DStreams of key/value pairs • User provide a function updateFunc(events, oldState) and initialRDD • val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) • val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } • val stateCount= pairs.updateStateByKey[Int](updateFunc)
  • 49. DEMO
  • 50. TRANSFORM OPERATION • The transform operation allows arbitrary RDD-to-RDD functions to be applied on a DStream. • It can be used to apply any RDD operation that is not exposed in the DStream API. • For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. • val cleanedDStream = wordCounts.transform(rdd => { rdd.join(data) })
  • 51. DEMO
  • 52. JOIN OPERATIONS • Stream-stream joins: • Streams can be very easily joined with other streams. • val stream1: DStream[String, String] = ... • val stream2: DStream[String, String] = ... • val joinedStream = stream1.join(stream2) • Windowed join • val windowedStream1 = stream1.window(Seconds(20)) • val windowedStream2 = stream2.window(Minutes(1)) • val joinedStream = windowedStream1.join(windowedStream2) • Stream-dataset joins • val dataset: RDD[String, String] = ... • val windowedStream = stream.window(Seconds(20))... • val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
  • 53. DEMO
  • 54. USING FOREACHRDD() • foreachRDD is a powerful primitive that allows data to be sent out to external systems. • dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) } } • Using foreachRDD, Each RDD is converted to a DataFrame, registered as a temporary table and then queried using SQL. • words.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) import sqlContext.implicits._ val wordsDataFrame = rdd.toDF("word") wordsDataFrame.registerTempTable("words") val wordCountsDataFrame = sqlContext.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() }
  • 55. DEMO
  • 56. DSTREAMS (SPARK CODE) • DStreams internally is characterized by a few basic properties: • A list of other DStreams that the DStream depends on • A time interval at which the DStream generates an RDD • A function that is used to generate an RDD after each time interval • Methods that should be implemented by subclasses of Dstream • Time interval after which the DStream generates a RDD • def slideDuration: Duration • List of parent DStreams on which this DStream depends on • def dependencies: List[DStream[_]] • Method that generates a RDD for the given time • def compute(validTime: Time): Option[RDD[T]] • This class contains the basic operations available on all DStreams, such as `map`, `filter` and `window`. In addition, PairDStreamFunctions contains operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and `join`. These operations are automatically available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit conversions.
  • 57. © 2015 IBM Corporation

Editor's Notes

  1. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  2. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  3. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  4. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  5. Have to have a sample code before coming to this slide.
  6. reference ids of the blocks for locating their data in the executor memory, (ii) offset information of the block data in the logs
  7. Have to read on Kafka Direct API.