SlideShare a Scribd company logo
1 of 72
Download to read offline
Joan Viladrosa, Billy Mobile
Apache Spark Streaming
+ Kafka 0.10: An Integration Story
#EUstr5
About me
Joan Viladrosa Riera
@joanvr
joanviladrosa
joan.viladrosa@billymob.com
2#EUstr5
Degree In Computer Science
Advanced Programming Techniques
System Interfaces and Integration
Co-Founder, Educabits
Educational Big data solutions
using AWS cloud
Big Data Developer, Trovit
Hadoop and MapReduce Framework
SEM keywords optimization
Big Data Architect & Tech Le
BillyMobile
Full architecture with Hadoop:
Kafka, Storm, Hive, HBase, Spark, D
Apache Kafka
#EUstr5
What is
Apache
Kafka?
• Publish - Subscribe
Message System
4#EUstr5
What is
Apache
Kafka?
• Publish - Subscribe
Message System
• Fast
• Scalable
• Durable
• Fault-tolerant
What makes it great?
5#EUstr5
What is Apache Kafka?
As a central point
Producer Producer Producer Producer
Kafka
Consumer Consumer Consumer Consumer
6#EUstr5
What is Apache Kafka?
A lot of different connectors
Apache
Storm
Apache
Spark
My Java App Logger
Kafka
Apache
Storm
Apache
Spark
My Java App
Monitoring
Tool
7#EUstr5
Kafka
Terminology
Topic: A feed of messages
Producer: Processes that publish
messages to a topic
Consumer: Processes that
subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
8#EUstr5
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0
Partition 1
Partition 2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Topic:
Old New
writes
9#EUstr5
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0 7 8 9
Old New
1
0
1
1
1
2
1
3
1
4
1
5
Producer
writes
Consumer A
(offset=6)
Consumer B
(offset=12)
reads reads
10#EUstr5
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
11#EUstr5
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
More
Storage
More
Parallelism
12#EUstr5
Kafka Semantics
In short: consumer
delivery semantics are
up to you, not Kafka
• Kafka doesn’t store the
state of the consumers*
• It just sends you what
you ask for (topic,
partition, offset, length)
• You have to take care of
your state
13#EUstr5
Apache Kafka Timeline
may-2016nov-2015nov-2013nov-2012
New
Producer
New
Consumer
Security
Kafka Streams
Apache
Incubator
Project
0.7 0.8 0.9 0.10
14#EUstr5
Apache
Spark Streaming
#EUstr5
• Process streams of data
• Micro-batching approach
What is
Apache
Spark
Streaming?
16#EUstr5
• Process streams of data
• Micro-batching approach
• Same API as Spark
• Same integrations as Spark
• Same guarantees &
semantics as Spark
What makes it great?
What is
Apache
Spark
Streaming?
17#EUstr5
What is Apache Spark Streaming?
Relying on the same Spark Engine: “same syntax” as batch jobs
https://spark.apache.org/docs/latest/streaming-programming-guide.html18
How does it work?
• Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html19
How does it work?
• Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html20
How does it work?
21https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
How does it work?
22https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
Spark
Streaming
Semantics
As in Spark:
• Not guarantee exactly-once
semantics for output actions
• Any side-effecting output
operations may be repeated
• Because of node failure, process
failure, etc.
So, be careful when outputting to
external sources
Side effects
23#EUstr5
Spark Streaming
Kafka Integration
#EUstr5
Spark Streaming Kafka Integration Timeline
dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014
Fault Tolerant
WAL
+
Python API
Direct
Streams
+
Python API
Improved
Streaming UI
Metadata in
UI (offsets)
+
Graduated
Direct
Receivers Native Kafka
0.10
(experimental)
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
25#EUstr5
Kafka Receiver (≤ Spark 1.1)
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
26#EUstr5
Kafka Receiver with WAL (Spark 1.2)
HDFS
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
WAL
Receiver
27#EUstr5
Application
Driver
Executor
Spark
Context
Jobs
Computation
checkpointed
Receiver
Input
stream
Block
metadata
Block
metadata
written
to log
Block data
written both
memory + log
Streaming
Context
Kafka Receiver with WAL (Spark 1.2)
28#EUstr5
Kafka Receiver with WAL (Spark 1.2)
Restarted
Driver
Restarted
Executor
Restarted
Spark
Context
Relaunch
Jobs
Restart
computation
from info in
checkpoints Restarted
Receiver
Resend
unacked data
Recover
Block
metadata
from log
Recover Block
data from log
Restarted
Streaming
Context
29#EUstr5
Kafka Receiver with WAL (Spark 1.2)
HDFS
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
WAL
Receiver
30#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
31#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver 1. Query latest offsets
and decide offset ranges
for batch
32#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
33#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
34#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
35#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
36#EUstr5
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batch
37#EUstr5
Direct Kafka
API benefits
• No WALs or Receivers
• Allows end-to-end exactly-
once semantics pipelines *
* updates to downstream systems should be
idempotent or transactional
• More fault-tolerant
• More efficient
• Easier to use.
38#EUstr5
Spark Streaming UI improvements (Spark 1.4)39
Kafka Metadata (offsets) in UI (Spark 1.5)40
What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
41#EUstr5
Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
42#EUstr5
What’s really
New with this
New Kafka
Integration?
• New Consumer API
* Instead of Simple API
• Location Strategies
• Consumer Strategies
• SSL / TLS
• No Python API :(
43#EUstr5
Location Strategies
• New consumer API will pre-fetch messages into buffers
• So, keep cached consumers into executors
• It’s better to schedule partitions on the host with
appropriate consumers
44#EUstr5
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
45#EUstr5
Consumer Strategies
• New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
• ConsumerStrategies provides an abstraction
that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
46#EUstr5
Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
interest
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting
offset for a particular partition.
● ConsumerStrategy is a public class that you can47#EUstr5
SSL/TTL encryption
• New consumer API supports SSL
• Only applies to communication between Spark
and Kafka brokers
• Still responsible for separately securing Spark
inter-node communication
48#EUstr5
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Basic usage
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
49#EUstr5
How to use
New Kafka
Integration on
Spark 2.0+
Java Example Code
Getting metadata
stream .foreachRD D { rdd = >
valoffsetRanges = rdd.asInstanceO f[HasO ffsetRanges]
.offsetRanges
rdd.foreachPartition { iter= >
valosr:O ffsetRange = offsetRanges(
TaskContext.get.partitionId)
//getany needed data from the offsetrange
valtopic = osr.topic
valkafkaPartitionId = osr.partition
valbegin = osr.from O ffset
valend = osr.untilO ffset
}
}
50#EUstr5
RDDTopic
Kafka or Spark RDD Partitions?
Kafka Spark
51
1
2
3
4
1
2
3
4
RDDTopic
Kafka or Spark RDD Partitions?
Kafka Spark
52
1
2
3
4
1
2
3
4
How to use
New Kafka
Integration on
Spark 2.0+
Java Example Code
Getting metadata
stream .foreachRD D { rdd = >
valoffsetRanges = rdd.asInstanceO f[HasO ffsetRanges]
.offsetRanges
rdd.foreachPartition { iter= >
valosr:O ffsetRange = offsetRanges(
TaskContext.get.partitionId)
//getany needed data from the offsetrange
valtopic = osr.topic
valkafkaPartitionId = osr.partition
valbegin = osr.from O ffset
valend = osr.untilO ffset
}
}
53#EUstr5
How to use
New Kafka
Integration on
Spark 2.0+
Java Example Code
Store offsets in Kafka itself:
Commit API
stream .foreachRD D { rdd = >
valoffsetRanges = rdd.asInstanceO f[HasO ffsetRanges]
.offsetRanges
//D O YO UR STUFF w ith DATA
stream .asInstanceO f[CanCom m itO ffsets]
.com m itAsync(offsetRanges)
}
}
54#EUstr5
Kafka + Spark Semantics
- At most once
- At least once
- Exactly once
55#EUstr5
Kafka + Spark
Semantics
• We don’t want duplicates
• Not worth the hassle of ensuring that
messages don’t get lost
• Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param
enable.auto.commit to true
At most once
56#EUstr5
Kafka + Spark
Semantics
• This will mean you lose messages on
restart
• At least they shouldn’t get replayed.
• Test this carefully if it’s actually
important to you that a message never
gets repeated, because it’s not a
common use case.
At most once
57#EUstr5
Kafka + Spark
Semantics
• We don’t want to loose any record
• We don’t care about duplicates
• Example: Sending internal alerts on
relative rare occurrences on the stream
1. Set spark.task.maxFailures > 1000
2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param
enable.auto.commit to false
At least once
58#EUstr5
Kafka + Spark
Semantics
• Don’t be silly! Do NOT replay your whole
log on every restart…
• Manually commit the offsets when you
are 100% sure records are processed
• If this is “too hard” you’d better have a
relative short retention log
• Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)
At least once
59#EUstr5
Kafka + Spark
Semantics
• We don’t want to loose any record
• We don’t want duplicates either
• Example: Storing stream in data
warehouse
1. We need some kind of idempotent writes,
or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
data
3. Same parameters as at least once
Exactly once
60#EUstr5
Kafka + Spark
Semantics
• Probably the hardest to achieve right
• Still some small chance of failure if your
app fails just between writing data and
committing offsets… (but REALLY small)
Exactly once
61#EUstr5
Apache Kafka
Apacke Spark
at Billy Mobile
62
15Brecords monthly
35T
Bweekly retention
log
6Kevents/second
x4growth/year
Our use
cases
• Input events from Kafka
• Enrich events with some
external data sources
• Finally store it to Hive
We do NOT want duplicates
We do NOT want to lose events
ETL to Data
Warehouse
63
Our use
cases
• Hive is not transactional
• Neither idempotent writes
• Writing files to HDFS is “atomic”
(whole or nothing)
• A relation 1:1 from each
partition-batch to file in HDFS
• Store to ZK the current state of
the batch
• Store to ZK offsets of last
finished batch
ETL to Data
Warehouse
64
Our use
cases
• Input events from Kafka
• Periodically load batch-
computed model
• Detect when an offer stops
converting (or too much)
• We do not care about losing
some events (on restart)
• We always need to process the
“real-time” stream
Anomalies detector
65
Our use
cases
• It’s useless to detect anomalies
on a lagged stream!
• Actually it could be very bad
• Always restart stream on latest
offsets
• Restart with “fresh” state
Anomalies detector
66
Our use
cases
• Input events from Kafka
• Almost no processing
• Store it to HBase
– (has idempotent writes)
• We do not care about duplicates
• We can NOT lose a single event
Store to
Entity Cache
67
Our use
cases
• Since HBase has idempotent writes,
we can write events multiple times
without hassle
• But, we do NOT start with earliest
offsets…
– That would be 7 days of redundant writes…!!!
• We store offsets of last finished batch
• But obviously we might re-write some
events on restart or failure
Store to
Entity Cache
68
Lessons
Learned
• Do NOT use checkpointing
– Not recoverable across code upgrades
– Do your own checkpointing
• Track offsets yourself
– In general, more reliable:
HDFS, ZK, RMDBS...
• Memory usually is an issue
– You don’t want to waste it
– Adjust batchDuration
– Adjust maxRatePerPartition
69
Further
Improvements
• Dynamic Allocation
spark.dynamicAllocation.enabled vs
spark.streaming.dynamicAllocation.enabled
https://issues.apache.org/jira/browse/SPARK-12133
But no reference in docs...
• Graceful shutdown
• Structured Streaming
70
Thank you very much!
Questions?
@joanvr
joanviladrosa
joan.viladrosa@billymob.com

More Related Content

What's hot

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingSpark Summit
 
Internals of Speeding up PySpark with Arrow
 Internals of Speeding up PySpark with Arrow Internals of Speeding up PySpark with Arrow
Internals of Speeding up PySpark with ArrowDatabricks
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)Nicolas Poggi
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
 

What's hot (20)

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Internals of Speeding up PySpark with Arrow
 Internals of Speeding up PySpark with Arrow Internals of Speeding up PySpark with Arrow
Internals of Speeding up PySpark with Arrow
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 

Similar to Spark Streaming and Kafka Integration

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafkaconfluent
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
 
introductiontoapachekafka-201102140206.pdf
introductiontoapachekafka-201102140206.pdfintroductiontoapachekafka-201102140206.pdf
introductiontoapachekafka-201102140206.pdfTarekHamdi8
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 

Similar to Spark Streaming and Kafka Integration (20)

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake
 
introductiontoapachekafka-201102140206.pdf
introductiontoapachekafka-201102140206.pdfintroductiontoapachekafka-201102140206.pdf
introductiontoapachekafka-201102140206.pdf
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Spark Streaming and Kafka Integration

  • 1.
  • 2. Joan Viladrosa, Billy Mobile Apache Spark Streaming + Kafka 0.10: An Integration Story #EUstr5
  • 3. About me Joan Viladrosa Riera @joanvr joanviladrosa joan.viladrosa@billymob.com 2#EUstr5 Degree In Computer Science Advanced Programming Techniques System Interfaces and Integration Co-Founder, Educabits Educational Big data solutions using AWS cloud Big Data Developer, Trovit Hadoop and MapReduce Framework SEM keywords optimization Big Data Architect & Tech Le BillyMobile Full architecture with Hadoop: Kafka, Storm, Hive, HBase, Spark, D
  • 5. What is Apache Kafka? • Publish - Subscribe Message System 4#EUstr5
  • 6. What is Apache Kafka? • Publish - Subscribe Message System • Fast • Scalable • Durable • Fault-tolerant What makes it great? 5#EUstr5
  • 7. What is Apache Kafka? As a central point Producer Producer Producer Producer Kafka Consumer Consumer Consumer Consumer 6#EUstr5
  • 8. What is Apache Kafka? A lot of different connectors Apache Storm Apache Spark My Java App Logger Kafka Apache Storm Apache Spark My Java App Monitoring Tool 7#EUstr5
  • 9. Kafka Terminology Topic: A feed of messages Producer: Processes that publish messages to a topic Consumer: Processes that subscribe to topics and process the feed of published messages Broker: Each server of a kafka cluster that holds, receives and sends the actual data 8#EUstr5
  • 10. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 Partition 1 Partition 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Topic: Old New writes 9#EUstr5
  • 11. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 7 8 9 Old New 1 0 1 1 1 2 1 3 1 4 1 5 Producer writes Consumer A (offset=6) Consumer B (offset=12) reads reads 10#EUstr5
  • 12. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers 11#EUstr5
  • 13. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Storage More Parallelism 12#EUstr5
  • 14. Kafka Semantics In short: consumer delivery semantics are up to you, not Kafka • Kafka doesn’t store the state of the consumers* • It just sends you what you ask for (topic, partition, offset, length) • You have to take care of your state 13#EUstr5
  • 15. Apache Kafka Timeline may-2016nov-2015nov-2013nov-2012 New Producer New Consumer Security Kafka Streams Apache Incubator Project 0.7 0.8 0.9 0.10 14#EUstr5
  • 17. • Process streams of data • Micro-batching approach What is Apache Spark Streaming? 16#EUstr5
  • 18. • Process streams of data • Micro-batching approach • Same API as Spark • Same integrations as Spark • Same guarantees & semantics as Spark What makes it great? What is Apache Spark Streaming? 17#EUstr5
  • 19. What is Apache Spark Streaming? Relying on the same Spark Engine: “same syntax” as batch jobs https://spark.apache.org/docs/latest/streaming-programming-guide.html18
  • 20. How does it work? • Discretized Streams https://spark.apache.org/docs/latest/streaming-programming-guide.html19
  • 21. How does it work? • Discretized Streams https://spark.apache.org/docs/latest/streaming-programming-guide.html20
  • 22. How does it work? 21https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
  • 23. How does it work? 22https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
  • 24. Spark Streaming Semantics As in Spark: • Not guarantee exactly-once semantics for output actions • Any side-effecting output operations may be repeated • Because of node failure, process failure, etc. So, be careful when outputting to external sources Side effects 23#EUstr5
  • 26. Spark Streaming Kafka Integration Timeline dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014 Fault Tolerant WAL + Python API Direct Streams + Python API Improved Streaming UI Metadata in UI (offsets) + Graduated Direct Receivers Native Kafka 0.10 (experimental) 1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1 25#EUstr5
  • 27. Kafka Receiver (≤ Spark 1.1) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver 26#EUstr5
  • 28. Kafka Receiver with WAL (Spark 1.2) HDFS Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper WAL Receiver 27#EUstr5
  • 30. Kafka Receiver with WAL (Spark 1.2) Restarted Driver Restarted Executor Restarted Spark Context Relaunch Jobs Restart computation from info in checkpoints Restarted Receiver Resend unacked data Recover Block metadata from log Recover Block data from log Restarted Streaming Context 29#EUstr5
  • 31. Kafka Receiver with WAL (Spark 1.2) HDFS Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper WAL Receiver 30#EUstr5
  • 32. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor Driver 31#EUstr5
  • 33. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor Driver 1. Query latest offsets and decide offset ranges for batch 32#EUstr5
  • 34. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 33#EUstr5
  • 35. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API 34#EUstr5
  • 36. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 35#EUstr5
  • 37. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 36#EUstr5
  • 38. Direct Kafka Integration w/o Receivers or WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batch 37#EUstr5
  • 39. Direct Kafka API benefits • No WALs or Receivers • Allows end-to-end exactly- once semantics pipelines * * updates to downstream systems should be idempotent or transactional • More fault-tolerant • More efficient • Easier to use. 38#EUstr5
  • 40. Spark Streaming UI improvements (Spark 1.4)39
  • 41. Kafka Metadata (offsets) in UI (Spark 1.5)40
  • 42. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right? 41#EUstr5
  • 43. Spark 2.0+ new Kafka Integration spark-streaming-kafka-0-8 spark-streaming-kafka-0-10 Broker Version 0.8.2.1 or higher 0.10.0 or higher Api Stability Stable Experimental Language Support Scala, Java, Python Scala, Java Receiver DStream Yes No Direct DStream Yes Yes SSL / TLS Support No Yes Offset Commit Api No Yes Dynamic Topic Subscription No Yes 42#EUstr5
  • 44. What’s really New with this New Kafka Integration? • New Consumer API * Instead of Simple API • Location Strategies • Consumer Strategies • SSL / TLS • No Python API :( 43#EUstr5
  • 45. Location Strategies • New consumer API will pre-fetch messages into buffers • So, keep cached consumers into executors • It’s better to schedule partitions on the host with appropriate consumers 44#EUstr5
  • 46. Location Strategies - PreferConsistent Distribute partitions evenly across available executors - PreferBrokers If your executors are on the same hosts as your Kafka brokers - PreferFixed Specify an explicit mapping of partitions to hosts 45#EUstr5
  • 47. Consumer Strategies • New consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup. • ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint. 46#EUstr5
  • 48. Consumer Strategies - Subscribe subscribe to a fixed collection of topics - SubscribePattern use a regex to specify topics of interest - Assign specify a fixed collection of partitions ● Overloaded constructors to specify the starting offset for a particular partition. ● ConsumerStrategy is a public class that you can47#EUstr5
  • 49. SSL/TTL encryption • New consumer API supports SSL • Only applies to communication between Spark and Kafka brokers • Still responsible for separately securing Spark inter-node communication 48#EUstr5
  • 50. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Basic usage val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "stream_group_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("topicA", "topicB") val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.map(record => (record.key, record.value)) 49#EUstr5
  • 51. How to use New Kafka Integration on Spark 2.0+ Java Example Code Getting metadata stream .foreachRD D { rdd = > valoffsetRanges = rdd.asInstanceO f[HasO ffsetRanges] .offsetRanges rdd.foreachPartition { iter= > valosr:O ffsetRange = offsetRanges( TaskContext.get.partitionId) //getany needed data from the offsetrange valtopic = osr.topic valkafkaPartitionId = osr.partition valbegin = osr.from O ffset valend = osr.untilO ffset } } 50#EUstr5
  • 52. RDDTopic Kafka or Spark RDD Partitions? Kafka Spark 51 1 2 3 4 1 2 3 4
  • 53. RDDTopic Kafka or Spark RDD Partitions? Kafka Spark 52 1 2 3 4 1 2 3 4
  • 54. How to use New Kafka Integration on Spark 2.0+ Java Example Code Getting metadata stream .foreachRD D { rdd = > valoffsetRanges = rdd.asInstanceO f[HasO ffsetRanges] .offsetRanges rdd.foreachPartition { iter= > valosr:O ffsetRange = offsetRanges( TaskContext.get.partitionId) //getany needed data from the offsetrange valtopic = osr.topic valkafkaPartitionId = osr.partition valbegin = osr.from O ffset valend = osr.untilO ffset } } 53#EUstr5
  • 55. How to use New Kafka Integration on Spark 2.0+ Java Example Code Store offsets in Kafka itself: Commit API stream .foreachRD D { rdd = > valoffsetRanges = rdd.asInstanceO f[HasO ffsetRanges] .offsetRanges //D O YO UR STUFF w ith DATA stream .asInstanceO f[CanCom m itO ffsets] .com m itAsync(offsetRanges) } } 54#EUstr5
  • 56. Kafka + Spark Semantics - At most once - At least once - Exactly once 55#EUstr5
  • 57. Kafka + Spark Semantics • We don’t want duplicates • Not worth the hassle of ensuring that messages don’t get lost • Example: Sending statistics over UDP 1. Set spark.task.maxFailures to 1 2. Make sure spark.speculation is false (the default) 3. Set Kafka param auto.offset.reset to “largest” 4. Set Kafka param enable.auto.commit to true At most once 56#EUstr5
  • 58. Kafka + Spark Semantics • This will mean you lose messages on restart • At least they shouldn’t get replayed. • Test this carefully if it’s actually important to you that a message never gets repeated, because it’s not a common use case. At most once 57#EUstr5
  • 59. Kafka + Spark Semantics • We don’t want to loose any record • We don’t care about duplicates • Example: Sending internal alerts on relative rare occurrences on the stream 1. Set spark.task.maxFailures > 1000 2. Set Kafka param auto.offset.reset to “smallest” 3. Set Kafka param enable.auto.commit to false At least once 58#EUstr5
  • 60. Kafka + Spark Semantics • Don’t be silly! Do NOT replay your whole log on every restart… • Manually commit the offsets when you are 100% sure records are processed • If this is “too hard” you’d better have a relative short retention log • Or be REALLY ok with duplicates. For example, you are outputting to an external system that handles duplicates for you (HBase) At least once 59#EUstr5
  • 61. Kafka + Spark Semantics • We don’t want to loose any record • We don’t want duplicates either • Example: Storing stream in data warehouse 1. We need some kind of idempotent writes, or whole-or-nothing writes (transactions) 2. Only store offsets EXACTLY after writing data 3. Same parameters as at least once Exactly once 60#EUstr5
  • 62. Kafka + Spark Semantics • Probably the hardest to achieve right • Still some small chance of failure if your app fails just between writing data and committing offsets… (but REALLY small) Exactly once 61#EUstr5
  • 63. Apache Kafka Apacke Spark at Billy Mobile 62 15Brecords monthly 35T Bweekly retention log 6Kevents/second x4growth/year
  • 64. Our use cases • Input events from Kafka • Enrich events with some external data sources • Finally store it to Hive We do NOT want duplicates We do NOT want to lose events ETL to Data Warehouse 63
  • 65. Our use cases • Hive is not transactional • Neither idempotent writes • Writing files to HDFS is “atomic” (whole or nothing) • A relation 1:1 from each partition-batch to file in HDFS • Store to ZK the current state of the batch • Store to ZK offsets of last finished batch ETL to Data Warehouse 64
  • 66. Our use cases • Input events from Kafka • Periodically load batch- computed model • Detect when an offer stops converting (or too much) • We do not care about losing some events (on restart) • We always need to process the “real-time” stream Anomalies detector 65
  • 67. Our use cases • It’s useless to detect anomalies on a lagged stream! • Actually it could be very bad • Always restart stream on latest offsets • Restart with “fresh” state Anomalies detector 66
  • 68. Our use cases • Input events from Kafka • Almost no processing • Store it to HBase – (has idempotent writes) • We do not care about duplicates • We can NOT lose a single event Store to Entity Cache 67
  • 69. Our use cases • Since HBase has idempotent writes, we can write events multiple times without hassle • But, we do NOT start with earliest offsets… – That would be 7 days of redundant writes…!!! • We store offsets of last finished batch • But obviously we might re-write some events on restart or failure Store to Entity Cache 68
  • 70. Lessons Learned • Do NOT use checkpointing – Not recoverable across code upgrades – Do your own checkpointing • Track offsets yourself – In general, more reliable: HDFS, ZK, RMDBS... • Memory usually is an issue – You don’t want to waste it – Adjust batchDuration – Adjust maxRatePerPartition 69
  • 71. Further Improvements • Dynamic Allocation spark.dynamicAllocation.enabled vs spark.streaming.dynamicAllocation.enabled https://issues.apache.org/jira/browse/SPARK-12133 But no reference in docs... • Graceful shutdown • Structured Streaming 70
  • 72. Thank you very much! Questions? @joanvr joanviladrosa joan.viladrosa@billymob.com