SlideShare a Scribd company logo
1 of 32
Download to read offline
NRT: Deep Dive into Spark Streaming
Agenda
Spark Streaming Introduction
 Computing Model in Spark Streaming
 System Model & Architecture
 Fault-tolerance, Check pointing
 Comb on Spark Streaming
What is Spark Streaming?
• Extends Spark for doing large scale stream processing
• Efficient and fault-tolerant stateful stream processing
• Provides a simple batch-like API for implementing complex algorithms
Spark
Spark
SQL
Spark
Streaming
GraphX MLLib …
Why Spark Streaming
• Scales to hundreds of nodes
• Achieves low latency
• Efficiently recover from failures
• Integrates with batch and interactive processing
Spark Streaming
Run a streaming computation as a series of very small, deterministic
batch jobs
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
 Chop up the live stream into batches of X
seconds (Batch Size as low as 0.5 Second,
Latency is about 1 second)
 Spark treats each batch of data as RDDs
and processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches
Agenda
 Spark Streaming Introduction
Computing Model in Spark Streaming
 System Model & Architecture
 Fault-tolerance, Check pointing
 Comb on Spark Streaming
Spark Computing Model
• Discretized Stream (DStream)
Represents a stream of data
Implemented as a sequence of RDDs
• DStreams can be either …
Created from Streaming input sources
Created by applying transformations on existing DStreams
Spark Streaming - Key Concepts
 DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
 Transformations – modify data from one DStream to another
- Standard RDD operations – map, countByValue, reduceByKey, join, …
- Stateful operations – window, countByValueAndWindow, …
 Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
batch @ t+1batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed dataset)
Twitter Streaming API
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one DStream to create
another DStream
new DStream
new RDDs created
for every batch
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
DStream of data
Example 2 – Count the hashtags over last 1 min
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval
tagCounts
Example 2 – Count the hashtags over last 1 min
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
hashTags
t-1 t t+1 t+2 t+3
sliding window
countByValue
count over all
the data in the
window
Example 3 - Arbitrary Stateful Computations
 Maintain arbitrary state, track sessions
- Maintain per-user mood as state, and update it with his/her tweets
moods = tweets.updateStateByKey(tweet => updateMood(tweet))
updateMood(newTweets, lastMood) => newMood
tweets
t-1 t t+1 t+2 t+3
moods
Example 4 - Combine Batch and Stream Processing
 Do arbitrary Spark RDD computation within DStream
- Join incoming tweets with a spam file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
•Combine Live data Streams with Historical data
•Combine Streaming with MLLib, GraphX algos
•Query Streaming data using SQL
Agenda
 Spark Streaming Introduction
 Computing Model in Spark Streaming
System Model & Architecture
 Fault-tolerance, Check pointing
 Comb on Spark Streaming
System - Components
• Network Input Tracker – Keeps track of the data received by each network receiver and
maps them to the corresponding input DStreams
• Job Scheduler – Periodically queries the DStream graph to generate Spark jobs from
received data, and hands them to Job Manager for execution
• Job Manager – Maintains a job queue and executes the jobs in Spark
ssc = new StreamingContext
t = ssc.twitterStream(“…”)
t.filter(…).foreach(…)
Your program
DStream graph
Spark Context
Network Input Tracker
Job Manager
Job Scheduler
Spark Client
RDD graph Scheduler
Block manager Shuffle tracker
Spark Worker
Block
manager
Task threads
Cluster Manager
DStream Graph
Spark Streaming Program DStream Graph
t = ssc.twitterStream(“…”)
.map(…)
t.foreach(…)
t1 = ssc.twitterStream(“…”)
t2 = ssc.twitterStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
T
M
FE
Twitter Input DStream
Mapped DStream
Foreach DStream
T1
U
M
T2
M FFE
FE FE
Dummy DStream signifying
an output operation
DStream Graph -> RDD Graphs -> Spark Jobs
• Every interval, RDD graph is computed from DStream graph
• For each output operation, a Spark action is created
• For each action, a Spark job is created to compute it
T
U
M
T
M FFE
FE FE
DStream Graph
B
U
M
B
M FA
A A
RDD Graph
Spark actions
Block RDDs with data
received in last batch
interval
3 Spark jobs
Execution Model – Receiving Data
Spark Streaming + Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Receiver
Data Received
Block Manager
Blocks replicated
Block Manager
Master
Block
Manager
Blocks pushed
Execution Model – Job Scheduling
Spark Streaming + Driver Spark Workers
Network
Input
Tracker
Job Scheduler
Spark’s
Schedulers
Receiver
Block
Manager
Block
Manager
Jobs executed on
worker nodes
DStream
Graph
Job
Manager
JobQueue
Jobs
Block IDsRDDs
Agenda
 Spark Streaming Introduction
 Computing Model in Spark Streaming
 System Model & Architecture
Fault-tolerance, Check pointing
 Comb on Spark Streaming
Fault-tolerance: Worker
• RDDs remember the operations that
created them
• Batches of input data are replicated in
memory for fault-tolerance
• Data lost due to worker failure, can be
recomputed from replicated input data
• All transformed data is fault-tolerant,
and exactly-once transformations
input data
replicated
in memory
flatMap
lost partitions
recomputed
on other
workers
tweets
RDD
hashTags
RDD
Fault-tolerance: Driver
• Master saves the state of the DStreams to a checkpoint file
• Checkpoint file saved to HDFS periodically
• If master fails, it can be restarted using the checkpoint file
• Automated master fault recovery
Check Pointing - RDD
Saving RDD to HDFS to prevent RDD graph from growing too large
 Done internally in Spark transparent to the user program
 Done lazily, saved to HDFS the first time it is computed
red_rdd.checkpoint()
HDFS file
Contents of red_rdd saved
to a HDFS file
transparent to
all child RDDs
Why is RDD checkpointing necessary?
Stateful DStream operators can have infinite lineages
Large lineages lead to …
 Large closure of the RDD object  large task sizes  high task launch times
 High recovery times under failure
data
t-1 t t+1 t+2 t+3
states
Check Pointing - Metadata
Saving of the information defining the streaming computation to fault-tolerant
storage like HDFS. Metadata includes:
 Configuration - The configuration that were used to create the streaming application.
 DStream operations - The set of DStream operations that define the streaming
application.
 Incomplete batches - Batches whose jobs are queued but have not completed yet.
This is used to recover from failure of the node running the driver of the streaming
application
Agenda
 Spark Streaming Introduction
 Computing Model in Spark Streaming
 System Model & Architecture
 Fault-tolerance, Check pointing
Comb on Spark Streaming
Data Processing – Window Operation
• Window based Computing
• countByWindow
• reduceByKeyAndWindow
• Scenarios
• App Metrics in time Window
• Cache/Persist events in time Window
• Notes:
• Checkpoint
• Performance tuning
val eventCount = inputs.countByWindow(Seconds(30), Seconds(10))
eventCount.print()
val idCount=lines.map(x=>(parser(x),1)).reduceByKey(_+_)
val idWindowCount=idCount.reduceByKeyAndWindow(
(a:Int,b:Int)=>(a+b),Seconds(30),Seconds(10))
idWindowCount.print()
Data Processing – Session Operation
• Sessionization: updateStateByKey
val events =inputs.map[(String, (Long, String))](event => {
val eventObj = event.getBondValue()
val id = eventObj.getAppInfo().getId()
val eventName = eventObj.getEventInfo.getName()
val time = System.currentTimeMillis()
(eventName, (time, id))
})
val latestSessions = events.map[(String, (Long, Long, Long))](imp => {
(imp._1, (imp._2._1, imp._2._1, 1))
})
.reduceByKey((imp1, imp2) => {
(Math.min(imp1._1, imp2._1), Math.max(imp1._2, imp2._2), imp1._3 + imp2._3)
})
.updateStateByKey(SessionObject.updateStatbyOfSessions)
1
1
2
How to Update
Session Status?
def updateStatbyOfSessions(newBatch: Seq[(Long, Long, Long)], session: Option[(Long, Long, Long)]
): Option[(Long, Long, Long)] = {
var result: Option[(Long, Long, Long)] = null
if (newBatch.isEmpty) {
if (System.currentTimeMillis() - session.get._2 > SESSION_TIMEOUT) {
result = None
} else {
result =session
}
}
newBatch.foreach(c => {
if (session.isEmpty || c._1 - session.get._2 > SESSION_TIMEOUT) {
result = Some(c)
} else {
result = Some((
Math.min(c._1, session.get._1), //newStartTime
Math.max(c._2, session.get._2), //newFinishTime
session.get._3 + c._3 //newSumOfPageView
))
}
})
result
}
Create New Session
• Previous Session is empty
OR
• Previous Session Timeout,
With new batches
Extend Going on Sessions
• Previous Session not empty
AND
• New Batch is not timed out to
previous Session
Retire Timed out Session
• Previous session is timed out
QA

More Related Content

What's hot

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot modePrakash Chockalingam
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 

What's hot (20)

Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 

Viewers also liked

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and BeyondQuantUniversity
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopQuantUniversity
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
Apache Spark: Coming up to speed
Apache Spark: Coming up to speedApache Spark: Coming up to speed
Apache Spark: Coming up to speedAdarsh Pannu
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 

Viewers also liked (20)

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Apache Spark: Coming up to speed
Apache Spark: Coming up to speedApache Spark: Coming up to speed
Apache Spark: Coming up to speed
 
Deep learning - Part I
Deep learning - Part IDeep learning - Part I
Deep learning - Part I
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 

Similar to Deep dive into spark streaming

strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptrveiga100
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingShidrokhGoudarzi1
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017ajay_ei
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseMapR Technologies
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...vcrisan
 

Similar to Deep dive into spark streaming (20)

strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Spark cep
Spark cepSpark cep
Spark cep
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
 

Recently uploaded

Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubaikojalkojal131
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)Delhi Call girls
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...nilamkumrai
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...nirzagarg
 
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋nirzagarg
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...SUHANI PANDEY
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 

Recently uploaded (20)

Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 

Deep dive into spark streaming

  • 1. NRT: Deep Dive into Spark Streaming
  • 2. Agenda Spark Streaming Introduction  Computing Model in Spark Streaming  System Model & Architecture  Fault-tolerance, Check pointing  Comb on Spark Streaming
  • 3. What is Spark Streaming? • Extends Spark for doing large scale stream processing • Efficient and fault-tolerant stateful stream processing • Provides a simple batch-like API for implementing complex algorithms Spark Spark SQL Spark Streaming GraphX MLLib …
  • 4. Why Spark Streaming • Scales to hundreds of nodes • Achieves low latency • Efficiently recover from failures • Integrates with batch and interactive processing
  • 5. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds (Batch Size as low as 0.5 Second, Latency is about 1 second)  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  • 6. Agenda  Spark Streaming Introduction Computing Model in Spark Streaming  System Model & Architecture  Fault-tolerance, Check pointing  Comb on Spark Streaming
  • 7. Spark Computing Model • Discretized Stream (DStream) Represents a stream of data Implemented as a sequence of RDDs • DStreams can be either … Created from Streaming input sources Created by applying transformations on existing DStreams
  • 8. Spark Streaming - Key Concepts  DStream – sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets  Transformations – modify data from one DStream to another - Standard RDD operations – map, countByValue, reduceByKey, join, … - Stateful operations – window, countByValueAndWindow, …  Output Operations – send data to external entity - saveAsHadoopFiles – saves to HDFS - foreach – do anything with each batch of results
  • 9. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data batch @ t+1batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed dataset) Twitter Streaming API
  • 10. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  • 11. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  • 12. DStream of data Example 2 – Count the hashtags over last 1 min val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval window length sliding interval
  • 13. tagCounts Example 2 – Count the hashtags over last 1 min val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window
  • 14. Example 3 - Arbitrary Stateful Computations  Maintain arbitrary state, track sessions - Maintain per-user mood as state, and update it with his/her tweets moods = tweets.updateStateByKey(tweet => updateMood(tweet)) updateMood(newTweets, lastMood) => newMood tweets t-1 t t+1 t+2 t+3 moods
  • 15. Example 4 - Combine Batch and Stream Processing  Do arbitrary Spark RDD computation within DStream - Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) }) •Combine Live data Streams with Historical data •Combine Streaming with MLLib, GraphX algos •Query Streaming data using SQL
  • 16. Agenda  Spark Streaming Introduction  Computing Model in Spark Streaming System Model & Architecture  Fault-tolerance, Check pointing  Comb on Spark Streaming
  • 17. System - Components • Network Input Tracker – Keeps track of the data received by each network receiver and maps them to the corresponding input DStreams • Job Scheduler – Periodically queries the DStream graph to generate Spark jobs from received data, and hands them to Job Manager for execution • Job Manager – Maintains a job queue and executes the jobs in Spark ssc = new StreamingContext t = ssc.twitterStream(“…”) t.filter(…).foreach(…) Your program DStream graph Spark Context Network Input Tracker Job Manager Job Scheduler Spark Client RDD graph Scheduler Block manager Shuffle tracker Spark Worker Block manager Task threads Cluster Manager
  • 18. DStream Graph Spark Streaming Program DStream Graph t = ssc.twitterStream(“…”) .map(…) t.foreach(…) t1 = ssc.twitterStream(“…”) t2 = ssc.twitterStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…) t.map(…).foreach(…) t.filter(…).foreach(…) T M FE Twitter Input DStream Mapped DStream Foreach DStream T1 U M T2 M FFE FE FE Dummy DStream signifying an output operation
  • 19. DStream Graph -> RDD Graphs -> Spark Jobs • Every interval, RDD graph is computed from DStream graph • For each output operation, a Spark action is created • For each action, a Spark job is created to compute it T U M T M FFE FE FE DStream Graph B U M B M FA A A RDD Graph Spark actions Block RDDs with data received in last batch interval 3 Spark jobs
  • 20. Execution Model – Receiving Data Spark Streaming + Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data Received Block Manager Blocks replicated Block Manager Master Block Manager Blocks pushed
  • 21. Execution Model – Job Scheduling Spark Streaming + Driver Spark Workers Network Input Tracker Job Scheduler Spark’s Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager JobQueue Jobs Block IDsRDDs
  • 22. Agenda  Spark Streaming Introduction  Computing Model in Spark Streaming  System Model & Architecture Fault-tolerance, Check pointing  Comb on Spark Streaming
  • 23. Fault-tolerance: Worker • RDDs remember the operations that created them • Batches of input data are replicated in memory for fault-tolerance • Data lost due to worker failure, can be recomputed from replicated input data • All transformed data is fault-tolerant, and exactly-once transformations input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  • 24. Fault-tolerance: Driver • Master saves the state of the DStreams to a checkpoint file • Checkpoint file saved to HDFS periodically • If master fails, it can be restarted using the checkpoint file • Automated master fault recovery
  • 25. Check Pointing - RDD Saving RDD to HDFS to prevent RDD graph from growing too large  Done internally in Spark transparent to the user program  Done lazily, saved to HDFS the first time it is computed red_rdd.checkpoint() HDFS file Contents of red_rdd saved to a HDFS file transparent to all child RDDs
  • 26. Why is RDD checkpointing necessary? Stateful DStream operators can have infinite lineages Large lineages lead to …  Large closure of the RDD object  large task sizes  high task launch times  High recovery times under failure data t-1 t t+1 t+2 t+3 states
  • 27. Check Pointing - Metadata Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. Metadata includes:  Configuration - The configuration that were used to create the streaming application.  DStream operations - The set of DStream operations that define the streaming application.  Incomplete batches - Batches whose jobs are queued but have not completed yet. This is used to recover from failure of the node running the driver of the streaming application
  • 28. Agenda  Spark Streaming Introduction  Computing Model in Spark Streaming  System Model & Architecture  Fault-tolerance, Check pointing Comb on Spark Streaming
  • 29. Data Processing – Window Operation • Window based Computing • countByWindow • reduceByKeyAndWindow • Scenarios • App Metrics in time Window • Cache/Persist events in time Window • Notes: • Checkpoint • Performance tuning val eventCount = inputs.countByWindow(Seconds(30), Seconds(10)) eventCount.print() val idCount=lines.map(x=>(parser(x),1)).reduceByKey(_+_) val idWindowCount=idCount.reduceByKeyAndWindow( (a:Int,b:Int)=>(a+b),Seconds(30),Seconds(10)) idWindowCount.print()
  • 30. Data Processing – Session Operation • Sessionization: updateStateByKey val events =inputs.map[(String, (Long, String))](event => { val eventObj = event.getBondValue() val id = eventObj.getAppInfo().getId() val eventName = eventObj.getEventInfo.getName() val time = System.currentTimeMillis() (eventName, (time, id)) }) val latestSessions = events.map[(String, (Long, Long, Long))](imp => { (imp._1, (imp._2._1, imp._2._1, 1)) }) .reduceByKey((imp1, imp2) => { (Math.min(imp1._1, imp2._1), Math.max(imp1._2, imp2._2), imp1._3 + imp2._3) }) .updateStateByKey(SessionObject.updateStatbyOfSessions) 1 1 2 How to Update Session Status?
  • 31. def updateStatbyOfSessions(newBatch: Seq[(Long, Long, Long)], session: Option[(Long, Long, Long)] ): Option[(Long, Long, Long)] = { var result: Option[(Long, Long, Long)] = null if (newBatch.isEmpty) { if (System.currentTimeMillis() - session.get._2 > SESSION_TIMEOUT) { result = None } else { result =session } } newBatch.foreach(c => { if (session.isEmpty || c._1 - session.get._2 > SESSION_TIMEOUT) { result = Some(c) } else { result = Some(( Math.min(c._1, session.get._1), //newStartTime Math.max(c._2, session.get._2), //newFinishTime session.get._3 + c._3 //newSumOfPageView )) } }) result } Create New Session • Previous Session is empty OR • Previous Session Timeout, With new batches Extend Going on Sessions • Previous Session not empty AND • New Batch is not timed out to previous Session Retire Timed out Session • Previous session is timed out
  • 32. QA