SlideShare a Scribd company logo
Apache Spark Streaming +
Kafka 0.10: An Integration Story
Joan Viladrosa, Billy Mobile
About me
Degree In Computer Science
Advanced Programming Techniques &
System Interfaces and Integration
Co-Founder, Educabits
Educational Big data solutions
using AWS cloud
Big Data Developer, Trovit
Hadoop and MapReduce Framework
SEM keywords optimization
Big Data Architect & Tech Lead
Full architecture with Hadoop:
Kafka, Storm, Hive, HBase, Spark, Druid, …
Joan Viladrosa Riera
Apache Kafka
What is
- Publish - Subscribe
Message System
What is
What makes it great?
- Publish - Subscribe
Message System
- Fast
- Scalable
- Durable
- Fault-tolerant
What is Apache Kafka
Producer Producer Producer Producer
Consumer Consumer Consumer Consumer
As a central point
What is Apache Kafka
A lot of different connectors
My Java App Logger
My Java App
Topic: A feed of messages
Producer: Processes that publish
messages to a topic
Consumer: Processes that
subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0
Partition 1
Partition 2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Old New
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0 7 8 9
Old New
Consumer A
Consumer B
reads reads
Kafka Topic Partitions
0 1 2 3 4 5 6P0
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Kafka Topic Partitions
0 1 2 3 4 5 6P0
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Kafka Semantics
In short: consumer
delivery semantics are
up to you, not Kafka
- Kafka doesn’t store the
state of the consumers*
- It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state
Apache Kafka Timeline
Kafka Streams
0.7 0.8 0.9 0.10
Apache Spark
What is
- Process streams of data
- Micro-batching approach
What is
What makes it great?
- Process streams of data
- Micro-batching approach
- Same API as Spark
- Same integrations as Spark
- Same guarantees &
semantics as Spark
What is Apache Spark Streaming
Relying on the same Spark Engine: “same syntax” as batch jobs
How does it work?
- Discretized Streams
How does it work?
- Discretized Streams
How does it work?
How does it work?
Side effects
As in Spark:
- Not guarantee exactly-once
semantics for output actions
- Any side-effecting output
operations may be repeated
- Because of node failure, process
failure, etc.
So, be careful when outputting to
external sources
Spark Streaming
Kafka Integration
Spark Streaming Kafka Integration Timeline
Fault Tolerant
Python API
Python API
Streaming UI
Metadata in
UI (offsets)
Receivers Native Kafka
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
Kafka Receiver (≤ Spark 1.1)
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
Kafka Receiver with WAL (Spark 1.2)
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
Kafka Receiver with WAL (Spark 1.2)
to log
Block data
written both
memory + log
Kafka Receiver with WAL (Spark 1.2)
Restarted Driver Restarted
from info in
checkpoints Restarted
unacked data
from log
Recover Block
data from log
Kafka Receiver with WAL (Spark 1.2)
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
for batch
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
topic1, p1,
(2000, 2100)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
topic1, p2,
(2010, 2110)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
topic1, p1,
(2000, 2100)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
topic1, p2,
(2010, 2110)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
2. Launch jobs
using offset
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batch
Direct Kafka
API benefits
- No WALs or Receivers
- Allows end-to-end
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
Spark Streaming UI improvements (Spark 1.4)
Kafka Metadata (offsets) in UI (Spark 1.5)
What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
What’s really
New with this
New Kafka
- New Consumer API
* Instead of Simple API
- Location Strategies
- Consumer Strategies
- No Python API :(
Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with already
appropriate consumers
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
- ConsumerStrategies provides an abstraction
that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting offset
for a particular partition.
● ConsumerStrategy is a public class that you can extend.
SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Basic Usage
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"" -> (false: java.lang.Boolean)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
Subscribe[String, String](topics, kafkaParams)
) => (record.key, record.value))
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Getting Metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
Kafka or Spark RDD partitions?
Kafka Spark
Kafka or Spark RDD partitions?
Kafka Spark
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Getting Metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Store offsets in Kafka itself:
Commit API
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
- At most once
- At least once
- Exactly once
Kafka +
Kafka + Spark
At most once
- We don’t want duplicates
- Not worth the hassle of ensuring that
messages don’t get lost
- Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param
to true
Kafka + Spark
At most once
- This will mean you lose messages on
- At least they shouldn’t get replayed.
- Test this carefully if it’s actually important
to you that a message never gets
repeated, because it’s not a common use
Kafka + Spark
At least once
- We don’t want to loose any record
- We don’t care about duplicates
- Example: Sending internal alerts on
relative rare occurrences on the stream
1. Set spark.task.maxFailures > 1000
2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param
to false
Kafka + Spark
At least once
- Don’t be silly! Do NOT replay your whole
log on every restart…
- Manually commit the offsets when you
are 100% sure records are processed
- If this is “too hard” you’d better have a
relative short retention log
- Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)
Kafka + Spark
Exactly once
- We don’t want to loose any record
- We don’t want duplicates either
- Example: Storing stream in data
1. We need some kind of idempotent writes,
or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
3. Same parameters as at least once
Kafka + Spark
Exactly once
- Probably the hardest to achieve right
- Still some small chance of failure if your
app fails just between writing data and
committing offsets… (but REALLY small)
Streaming +
at Billy Mobile
a story of love and fury
we rock it!
15Brecords monthly
35TBweekly retention log
Our use cases: ETL to Data Warehouse
- Input events from Kafka
- Enrich events with some external data sources
- Finally store it to Hive
- We do NOT want duplicates
- We do NOT want to lose events
Our use cases: ETL to Data Warehouse
- Hive is not transactional
- Neither idempotent writes
- Writing files to HDFS is “atomic” (whole or nothing)
- A relation 1:1 from each partition-batch to file in HDFS
- Store to ZK the current state of the batch
- Store to ZK offsets of last finished batch
Our use cases: ETL to Data Warehouse
On failure:
- If executors fails, just keep going (reschedule task)
> spark.task.maxFailures = 1000
- If driver fails (or restart):
- Load offsets and state from “current batch” if exists
and “finish” it (KafkaUtils.createRDD)
- Continue Stream from last saved offsets
Our use cases: Anomalies Detection
- Input events from Kafka
- Periodically load batch-computed model
- Detect when an offer stops converting (or too much)
- We do not care about losing some events (on restart)
- We always need to process the “real-time” stream
Our use cases: Anomalies Detection
- It’s useless to detect anomalies on a lagged stream!
- Actually it could be very bad
- Always restart stream on latest offsets
- Restart with “fresh” state
Our use cases: Store it to Entity Cache
- Input events from Kafka
- Almost no processing
- Store it to HBase (has idempotent writes)
- We do not care about duplicates
- We can NOT lose a single event
Our use cases: Store it to Entity Cache
- Since HBase has idempotent writes, we can write
events multiple times without hassle
- But, we do NOT start with earliest offsets…
- That would be 7 days of redundant writes…!!!
- We store offsets of last finished batch
- But obviously we might re-write some events on restart
or failure
- Do NOT use checkpointing!
- Not recoverable across upgrades
- Do your own checkpointing
- Track offsets yourself
- Memory might be an issue
- You do not want to waste it...
- Adjust batchDuration
- Adjust maxRatePerPartition
- Dynamic Allocation
spark.dynamicAllocation.enabled vs
But no reference in docs...
- Graceful shutdown
- Structured Streaming
Thank you
very much!
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story

More Related Content

What's hot

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
Muga Nishizawa
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
Abhinav Singh
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Michael Noll
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
Knoldus Inc.
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
Eno Thereska
Pulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarniPulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarni
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
Kaufman Ng
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
Hugo Louro
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web

What's hot (20)

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
Pulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarniPulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarni
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA

Viewers also liked

Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Michael Noll
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Adaryl "Bob" Wakefield, MBA
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / Cloudera
Ibm watson
Ibm watsonIbm watson
Ibm watson
Vivek Mohan
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Cloudera, Inc.
Building the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time AnalyticsBuilding the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time Analytics
Softnix Security Data Lake
Softnix Security Data Lake Softnix Security Data Lake
Softnix Security Data Lake
Softnix Technology

Viewers also liked (20)

Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / Cloudera
Ibm watson
Ibm watsonIbm watson
Ibm watson
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Building the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time AnalyticsBuilding the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time Analytics
Softnix Security Data Lake
Softnix Security Data Lake Softnix Security Data Lake
Softnix Security Data Lake

Similar to [Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Summit
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Guido Schmutz
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
Inexture Solutions
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi
Connecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESBConnecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESB
Jitendra Bafna
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarFast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Kai Wähner

Similar to [Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story (20)

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Connecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESBConnecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESB
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarFast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

  • 1. Apache Spark Streaming + Kafka 0.10: An Integration Story Joan Viladrosa, Billy Mobile
  • 2. About me Degree In Computer Science Advanced Programming Techniques & System Interfaces and Integration Co-Founder, Educabits Educational Big data solutions using AWS cloud Big Data Developer, Trovit Hadoop and MapReduce Framework SEM keywords optimization Big Data Architect & Tech Lead BillyMobile Full architecture with Hadoop: Kafka, Storm, Hive, HBase, Spark, Druid, … Joan Viladrosa Riera @joanvr joanviladrosa
  • 4. What is Apache Kafka? - Publish - Subscribe Message System
  • 5. What is Apache Kafka? What makes it great? - Publish - Subscribe Message System - Fast - Scalable - Durable - Fault-tolerant
  • 6. What is Apache Kafka Producer Producer Producer Producer Kafka Consumer Consumer Consumer Consumer As a central point
  • 7. What is Apache Kafka A lot of different connectors Apache Storm Apache Spark My Java App Logger Kafka Apache Storm Apache Spark My Java App Monitoring Tool
  • 8. Kafka Terminology Topic: A feed of messages Producer: Processes that publish messages to a topic Consumer: Processes that subscribe to topics and process the feed of published messages Broker: Each server of a kafka cluster that holds, receives and sends the actual data
  • 9. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 Partition 1 Partition 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Topic: Old New writes
  • 10. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 7 8 9 Old New 1 0 1 1 1 2 1 3 1 4 1 5 Producer writes Consumer A (offset=6) Consumer B (offset=12) reads reads
  • 11. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers
  • 12. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Storage More Parallelism
  • 13. Kafka Semantics In short: consumer delivery semantics are up to you, not Kafka - Kafka doesn’t store the state of the consumers* - It just sends you what you ask for (topic, partition, offset, length) - You have to take care of your state
  • 16. What is Apache Spark Streaming? - Process streams of data - Micro-batching approach
  • 17. What is Apache Spark Streaming? What makes it great? - Process streams of data - Micro-batching approach - Same API as Spark - Same integrations as Spark - Same guarantees & semantics as Spark
  • 18. What is Apache Spark Streaming Relying on the same Spark Engine: “same syntax” as batch jobs
  • 19. How does it work? - Discretized Streams
  • 20. How does it work? - Discretized Streams
  • 21. How does it work?
  • 22. How does it work?
  • 23. Spark Streaming Semantics Side effects As in Spark: - Not guarantee exactly-once semantics for output actions - Any side-effecting output operations may be repeated - Because of node failure, process failure, etc. So, be careful when outputting to external sources
  • 25. Spark Streaming Kafka Integration Timeline dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014 Fault Tolerant WAL + Python API Direct Streams + Python API Improved Streaming UI Metadata in UI (offsets) + Graduated Direct Receivers Native Kafka 0.10 (experimental) 1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
  • 26. Kafka Receiver (≤ Spark 1.1) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  • 27. Executor HDFS WAL Kafka Receiver with WAL (Spark 1.2) Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  • 28. Kafka Receiver with WAL (Spark 1.2) Application Driver Executor Spark Context Jobs Computation checkpointed Receiver Input stream Block metadata Block metadata written to log Block data written both memory + log Streaming Context
  • 29. Kafka Receiver with WAL (Spark 1.2) Restarted Driver Restarted Executor Restarted Spark Context Relaunch Jobs Restart computation from info in checkpoints Restarted Receiver Resend unacked data Recover Block metadata from log Recover Block data from log Restarted Streaming Context
  • 30. Executor HDFS WAL Kafka Receiver with WAL (Spark 1.2) Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  • 31. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver
  • 32. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver 1. Query latest offsets and decide offset ranges for batch
  • 33. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  • 34. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API
  • 35. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API topic1, p2, (2010, 2110)
  • 36. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API topic1, p2, (2010, 2110)
  • 37. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batch
  • 38. Direct Kafka API benefits - No WALs or Receivers - Allows end-to-end exactly-once semantics pipelines * * updates to downstream systems should be idempotent or transactional - More fault-tolerant - More efficient - Easier to use.
  • 39. Spark Streaming UI improvements (Spark 1.4)
  • 40. Kafka Metadata (offsets) in UI (Spark 1.5)
  • 41. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right?
  • 42. Spark 2.0+ new Kafka Integration spark-streaming-kafka-0-8 spark-streaming-kafka-0-10 Broker Version or higher 0.10.0 or higher Api Stability Stable Experimental Language Support Scala, Java, Python Scala, Java Receiver DStream Yes No Direct DStream Yes Yes SSL / TLS Support No Yes Offset Commit Api No Yes Dynamic Topic Subscription No Yes
  • 43. What’s really New with this New Kafka Integration? - New Consumer API * Instead of Simple API - Location Strategies - Consumer Strategies - SSL / TLS - No Python API :(
  • 44. Location Strategies - New consumer API will pre-fetch messages into buffers - So, keep cached consumers into executors - It’s better to schedule partitions on the host with already appropriate consumers
  • 45. Location Strategies - PreferConsistent Distribute partitions evenly across available executors - PreferBrokers If your executors are on the same hosts as your Kafka brokers - PreferFixed Specify an explicit mapping of partitions to hosts
  • 46. Consumer Strategies - New consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup. - ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.
  • 47. Consumer Strategies - Subscribe subscribe to a fixed collection of topics - SubscribePattern use a regex to specify topics of interest - Assign specify a fixed collection of partitions ● Overloaded constructors to specify the starting offset for a particular partition. ● ConsumerStrategy is a public class that you can extend.
  • 48. SSL/TTL encryption - New consumer API supports SSL - Only applies to communication between Spark and Kafka brokers - Still responsible for separately securing Spark inter-node communication
  • 49. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Basic Usage val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "" -> "stream_group_id", "auto.offset.reset" -> "latest", "" -> (false: java.lang.Boolean) ) val topics = Array("topicA", "topicB") val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) => (record.key, record.value))
  • 50. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Getting Metadata stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } }
  • 51. Kafka or Spark RDD partitions? RDDTopic Kafka Spark 1 2 3 4 1 2 3 4
  • 52. Kafka or Spark RDD partitions? RDDTopic Kafka Spark 1 2 3 4 1 2 3 4
  • 53. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Getting Metadata stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } }
  • 54. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Store offsets in Kafka itself: Commit API stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges // DO YOUR STUFF with DATA stream.asInstanceOf[CanCommitOffsets] .commitAsync(offsetRanges) } }
  • 55. - At most once - At least once - Exactly once Kafka + Spark Semantics
  • 56. Kafka + Spark Semantics At most once - We don’t want duplicates - Not worth the hassle of ensuring that messages don’t get lost - Example: Sending statistics over UDP 1. Set spark.task.maxFailures to 1 2. Make sure spark.speculation is false (the default) 3. Set Kafka param auto.offset.reset to “largest” 4. Set Kafka param to true
  • 57. Kafka + Spark Semantics At most once - This will mean you lose messages on restart - At least they shouldn’t get replayed. - Test this carefully if it’s actually important to you that a message never gets repeated, because it’s not a common use case.
  • 58. Kafka + Spark Semantics At least once - We don’t want to loose any record - We don’t care about duplicates - Example: Sending internal alerts on relative rare occurrences on the stream 1. Set spark.task.maxFailures > 1000 2. Set Kafka param auto.offset.reset to “smallest” 3. Set Kafka param to false
  • 59. Kafka + Spark Semantics At least once - Don’t be silly! Do NOT replay your whole log on every restart… - Manually commit the offsets when you are 100% sure records are processed - If this is “too hard” you’d better have a relative short retention log - Or be REALLY ok with duplicates. For example, you are outputting to an external system that handles duplicates for you (HBase)
  • 60. Kafka + Spark Semantics Exactly once - We don’t want to loose any record - We don’t want duplicates either - Example: Storing stream in data warehouse 1. We need some kind of idempotent writes, or whole-or-nothing writes (transactions) 2. Only store offsets EXACTLY after writing data 3. Same parameters as at least once
  • 61. Kafka + Spark Semantics Exactly once - Probably the hardest to achieve right - Still some small chance of failure if your app fails just between writing data and committing offsets… (but REALLY small)
  • 62. Spark Streaming + Kafka at Billy Mobile a story of love and fury
  • 63. Some Billy Insights we rock it! 15Brecords monthly 35TBweekly retention log 6Kevents/second x4growth/year
  • 64. Our use cases: ETL to Data Warehouse - Input events from Kafka - Enrich events with some external data sources - Finally store it to Hive - We do NOT want duplicates - We do NOT want to lose events
  • 65. Our use cases: ETL to Data Warehouse - Hive is not transactional - Neither idempotent writes - Writing files to HDFS is “atomic” (whole or nothing) - A relation 1:1 from each partition-batch to file in HDFS - Store to ZK the current state of the batch - Store to ZK offsets of last finished batch
  • 66. Our use cases: ETL to Data Warehouse On failure: - If executors fails, just keep going (reschedule task) > spark.task.maxFailures = 1000 - If driver fails (or restart): - Load offsets and state from “current batch” if exists and “finish” it (KafkaUtils.createRDD) - Continue Stream from last saved offsets
  • 67. Our use cases: Anomalies Detection - Input events from Kafka - Periodically load batch-computed model - Detect when an offer stops converting (or too much) - We do not care about losing some events (on restart) - We always need to process the “real-time” stream
  • 68. Our use cases: Anomalies Detection - It’s useless to detect anomalies on a lagged stream! - Actually it could be very bad - Always restart stream on latest offsets - Restart with “fresh” state
  • 69. Our use cases: Store it to Entity Cache - Input events from Kafka - Almost no processing - Store it to HBase (has idempotent writes) - We do not care about duplicates - We can NOT lose a single event
  • 70. Our use cases: Store it to Entity Cache - Since HBase has idempotent writes, we can write events multiple times without hassle - But, we do NOT start with earliest offsets… - That would be 7 days of redundant writes…!!! - We store offsets of last finished batch - But obviously we might re-write some events on restart or failure
  • 71. Lessons Learned - Do NOT use checkpointing! - Not recoverable across upgrades - Do your own checkpointing - Track offsets yourself - ZK, HDFS, DB… - Memory might be an issue - You do not want to waste it... - Adjust batchDuration - Adjust maxRatePerPartition
  • 72. Future Research - Dynamic Allocation spark.dynamicAllocation.enabled vs spark.streaming.dynamicAllocation.enabled But no reference in docs... - Graceful shutdown - Structured Streaming