SlideShare a Scribd company logo
1 of 74
Download to read offline
Apache Spark Streaming +
Kafka 0.10: An Integration Story
Joan Viladrosa, Billy Mobile
About me
Degree In Computer Science
Advanced Programming Techniques &
System Interfaces and Integration
Co-Founder, Educabits
Educational Big data solutions
using AWS cloud
Big Data Developer, Trovit
Hadoop and MapReduce Framework
SEM keywords optimization
Big Data Architect & Tech Lead
BillyMobile
Full architecture with Hadoop:
Kafka, Storm, Hive, HBase, Spark, Druid, …
Joan Viladrosa Riera
@joanvr
joanviladrosa
joan.viladrosa@billymob.com
Apache Kafka
What is
Apache
Kafka?
- Publish - Subscribe
Message System
What is
Apache
Kafka?
What makes it great?
- Publish - Subscribe
Message System
- Fast
- Scalable
- Durable
- Fault-tolerant
What is Apache Kafka
Producer Producer Producer Producer
Kafka
Consumer Consumer Consumer Consumer
As a central point
What is Apache Kafka
A lot of different connectors
Apache
Storm
Apache
Spark
My Java App Logger
Kafka
Apache
Storm
Apache
Spark
My Java App
Monitoring
Tool
Kafka
Terminology
Topic: A feed of messages
Producer: Processes that publish
messages to a topic
Consumer: Processes that
subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0
Partition 1
Partition 2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Topic:
Old New
writes
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0 7 8 9
Old New
1
0
1
1
1
2
1
3
1
4
1
5
Producer
writes
Consumer A
(offset=6)
Consumer B
(offset=12)
reads reads
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
More
Storage
More
Parallelism
Kafka Semantics
In short: consumer
delivery semantics are
up to you, not Kafka
- Kafka doesn’t store the
state of the consumers*
- It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state
Apache Kafka Timeline
may-2016nov-2015nov-2013nov-2012
New
Producer
New
Consumer
Security
Kafka Streams
Apache
Incubator
Project
0.7 0.8 0.9 0.10
Apache Spark
Streaming
What is
Apache
Spark
Streaming?
- Process streams of data
- Micro-batching approach
What is
Apache
Spark
Streaming?
What makes it great?
- Process streams of data
- Micro-batching approach
- Same API as Spark
- Same integrations as Spark
- Same guarantees &
semantics as Spark
What is Apache Spark Streaming
Relying on the same Spark Engine: “same syntax” as batch jobs
https://spark.apache.org/docs/latest/streaming-programming-guide.html
How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html
How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html
How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
Spark
Streaming
Semantics
Side effects
As in Spark:
- Not guarantee exactly-once
semantics for output actions
- Any side-effecting output
operations may be repeated
- Because of node failure, process
failure, etc.
So, be careful when outputting to
external sources
Spark Streaming
Kafka Integration
Spark Streaming Kafka Integration Timeline
dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014
Fault Tolerant
WAL
+
Python API
Direct
Streams
+
Python API
Improved
Streaming UI
Metadata in
UI (offsets)
+
Graduated
Direct
Receivers Native Kafka
0.10
(experimental)
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
Kafka Receiver (≤ Spark 1.1)
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
Executor
HDFS
WAL
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
Kafka Receiver with WAL (Spark 1.2)
Application
Driver
Executor
Spark
Context
Jobs
Computation
checkpointed
Receiver
Input
stream
Block
metadata
Block
metadata
written
to log
Block data
written both
memory + log
Streaming
Context
Kafka Receiver with WAL (Spark 1.2)
Restarted Driver Restarted
Executor
Restarted
Spark
Context
Relaunch
Jobs
Restart
computation
from info in
checkpoints Restarted
Receiver
Resend
unacked data
Recover
Block
metadata
from log
Recover Block
data from log
Restarted
Streaming
Context
Executor
HDFS
WAL
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
Driver
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
Driver 1. Query latest offsets
and decide offset ranges
for batch
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
topic1, p2,
(2010, 2110)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
topic1, p2,
(2010, 2110)
Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batch
Direct Kafka
API benefits
- No WALs or Receivers
- Allows end-to-end
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
Spark Streaming UI improvements (Spark 1.4)
Kafka Metadata (offsets) in UI (Spark 1.5)
What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
What’s really
New with this
New Kafka
Integration?
- New Consumer API
* Instead of Simple API
- Location Strategies
- Consumer Strategies
- SSL / TLS
- No Python API :(
Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with already
appropriate consumers
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
- ConsumerStrategies provides an abstraction
that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
interest
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting offset
for a particular partition.
● ConsumerStrategy is a public class that you can extend.
SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Basic Usage
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Getting Metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
Kafka or Spark RDD partitions?
RDDTopic
Kafka Spark
1
2
3
4
1
2
3
4
Kafka or Spark RDD partitions?
RDDTopic
Kafka Spark
1
2
3
4
1
2
3
4
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Getting Metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Store offsets in Kafka itself:
Commit API
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
// DO YOUR STUFF with DATA
stream.asInstanceOf[CanCommitOffsets]
.commitAsync(offsetRanges)
}
}
- At most once
- At least once
- Exactly once
Kafka +
Spark
Semantics
Kafka + Spark
Semantics
At most once
- We don’t want duplicates
- Not worth the hassle of ensuring that
messages don’t get lost
- Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param enable.auto.commit
to true
Kafka + Spark
Semantics
At most once
- This will mean you lose messages on
restart
- At least they shouldn’t get replayed.
- Test this carefully if it’s actually important
to you that a message never gets
repeated, because it’s not a common use
case.
Kafka + Spark
Semantics
At least once
- We don’t want to loose any record
- We don’t care about duplicates
- Example: Sending internal alerts on
relative rare occurrences on the stream
1. Set spark.task.maxFailures > 1000
2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param enable.auto.commit
to false
Kafka + Spark
Semantics
At least once
- Don’t be silly! Do NOT replay your whole
log on every restart…
- Manually commit the offsets when you
are 100% sure records are processed
- If this is “too hard” you’d better have a
relative short retention log
- Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)
Kafka + Spark
Semantics
Exactly once
- We don’t want to loose any record
- We don’t want duplicates either
- Example: Storing stream in data
warehouse
1. We need some kind of idempotent writes,
or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
data
3. Same parameters as at least once
Kafka + Spark
Semantics
Exactly once
- Probably the hardest to achieve right
- Still some small chance of failure if your
app fails just between writing data and
committing offsets… (but REALLY small)
Spark
Streaming +
Kafka
at Billy Mobile
a story of love and fury
Some
Billy
Insights
we rock it!
15Brecords monthly
35TBweekly retention log
6Kevents/second
x4growth/year
Our use cases: ETL to Data Warehouse
- Input events from Kafka
- Enrich events with some external data sources
- Finally store it to Hive
- We do NOT want duplicates
- We do NOT want to lose events
Our use cases: ETL to Data Warehouse
- Hive is not transactional
- Neither idempotent writes
- Writing files to HDFS is “atomic” (whole or nothing)
- A relation 1:1 from each partition-batch to file in HDFS
- Store to ZK the current state of the batch
- Store to ZK offsets of last finished batch
Our use cases: ETL to Data Warehouse
On failure:
- If executors fails, just keep going (reschedule task)
> spark.task.maxFailures = 1000
- If driver fails (or restart):
- Load offsets and state from “current batch” if exists
and “finish” it (KafkaUtils.createRDD)
- Continue Stream from last saved offsets
Our use cases: Anomalies Detection
- Input events from Kafka
- Periodically load batch-computed model
- Detect when an offer stops converting (or too much)
- We do not care about losing some events (on restart)
- We always need to process the “real-time” stream
Our use cases: Anomalies Detection
- It’s useless to detect anomalies on a lagged stream!
- Actually it could be very bad
- Always restart stream on latest offsets
- Restart with “fresh” state
Our use cases: Store it to Entity Cache
- Input events from Kafka
- Almost no processing
- Store it to HBase (has idempotent writes)
- We do not care about duplicates
- We can NOT lose a single event
Our use cases: Store it to Entity Cache
- Since HBase has idempotent writes, we can write
events multiple times without hassle
- But, we do NOT start with earliest offsets…
- That would be 7 days of redundant writes…!!!
- We store offsets of last finished batch
- But obviously we might re-write some events on restart
or failure
Lessons
Learned
- Do NOT use checkpointing!
- Not recoverable across upgrades
- Do your own checkpointing
- Track offsets yourself
- ZK, HDFS, DB…
- Memory might be an issue
- You do not want to waste it...
- Adjust batchDuration
- Adjust maxRatePerPartition
Future
Research
- Dynamic Allocation
spark.dynamicAllocation.enabled vs
spark.streaming.dynamicAllocation.enabled
https://issues.apache.org/jira/browse/SPARK-12133
But no reference in docs...
- Graceful shutdown
- Structured Streaming
Thank you
very much!
Questions?
@joanvr
joanviladrosa
joan.viladrosa@billymob.com
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story

More Related Content

What's hot

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collectorMuga Nishizawa
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaEno Thereska
 
Pulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarniPulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarniStreamNative
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectKaufman Ng
 
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreStreamNative
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Hugo Louro
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
 

What's hot (20)

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
 
Pulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarniPulsar Functions Deep Dive_Sanjeev kulkarni
Pulsar Functions Deep Dive_Sanjeev kulkarni
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 

Viewers also liked

Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataAdaryl "Bob" Wakefield, MBA
 
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Ontico
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCapgemini
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Cloudera, Inc.
 
Building the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time AnalyticsBuilding the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time AnalyticsSingleStore
 

Viewers also liked (20)

Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
 
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / Cloudera
 
Ibm watson
Ibm watsonIbm watson
Ibm watson
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
 
Building the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time AnalyticsBuilding the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time Analytics
 
Softnix Security Data Lake
Softnix Security Data Lake Softnix Security Data Lake
Softnix Security Data Lake
 

Similar to [Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaGuido Schmutz
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuideInexture Solutions
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Connecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESBConnecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESBJitendra Bafna
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Denodo
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformGuido Schmutz
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
 
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarFast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarTimothy Spann
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 

Similar to [Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story (20)

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Connecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESBConnecting Apache Kafka With Mule ESB
Connecting Apache Kafka With Mule ESB
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarFast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

  • 1. Apache Spark Streaming + Kafka 0.10: An Integration Story Joan Viladrosa, Billy Mobile
  • 2. About me Degree In Computer Science Advanced Programming Techniques & System Interfaces and Integration Co-Founder, Educabits Educational Big data solutions using AWS cloud Big Data Developer, Trovit Hadoop and MapReduce Framework SEM keywords optimization Big Data Architect & Tech Lead BillyMobile Full architecture with Hadoop: Kafka, Storm, Hive, HBase, Spark, Druid, … Joan Viladrosa Riera @joanvr joanviladrosa joan.viladrosa@billymob.com
  • 4. What is Apache Kafka? - Publish - Subscribe Message System
  • 5. What is Apache Kafka? What makes it great? - Publish - Subscribe Message System - Fast - Scalable - Durable - Fault-tolerant
  • 6. What is Apache Kafka Producer Producer Producer Producer Kafka Consumer Consumer Consumer Consumer As a central point
  • 7. What is Apache Kafka A lot of different connectors Apache Storm Apache Spark My Java App Logger Kafka Apache Storm Apache Spark My Java App Monitoring Tool
  • 8. Kafka Terminology Topic: A feed of messages Producer: Processes that publish messages to a topic Consumer: Processes that subscribe to topics and process the feed of published messages Broker: Each server of a kafka cluster that holds, receives and sends the actual data
  • 9. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 Partition 1 Partition 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Topic: Old New writes
  • 10. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 7 8 9 Old New 1 0 1 1 1 2 1 3 1 4 1 5 Producer writes Consumer A (offset=6) Consumer B (offset=12) reads reads
  • 11. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers
  • 12. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Storage More Parallelism
  • 13. Kafka Semantics In short: consumer delivery semantics are up to you, not Kafka - Kafka doesn’t store the state of the consumers* - It just sends you what you ask for (topic, partition, offset, length) - You have to take care of your state
  • 16. What is Apache Spark Streaming? - Process streams of data - Micro-batching approach
  • 17. What is Apache Spark Streaming? What makes it great? - Process streams of data - Micro-batching approach - Same API as Spark - Same integrations as Spark - Same guarantees & semantics as Spark
  • 18. What is Apache Spark Streaming Relying on the same Spark Engine: “same syntax” as batch jobs https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 19. How does it work? - Discretized Streams https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 20. How does it work? - Discretized Streams https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 21. How does it work? https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
  • 22. How does it work? https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
  • 23. Spark Streaming Semantics Side effects As in Spark: - Not guarantee exactly-once semantics for output actions - Any side-effecting output operations may be repeated - Because of node failure, process failure, etc. So, be careful when outputting to external sources
  • 25. Spark Streaming Kafka Integration Timeline dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014 Fault Tolerant WAL + Python API Direct Streams + Python API Improved Streaming UI Metadata in UI (offsets) + Graduated Direct Receivers Native Kafka 0.10 (experimental) 1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
  • 26. Kafka Receiver (≤ Spark 1.1) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  • 27. Executor HDFS WAL Kafka Receiver with WAL (Spark 1.2) Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  • 28. Kafka Receiver with WAL (Spark 1.2) Application Driver Executor Spark Context Jobs Computation checkpointed Receiver Input stream Block metadata Block metadata written to log Block data written both memory + log Streaming Context
  • 29. Kafka Receiver with WAL (Spark 1.2) Restarted Driver Restarted Executor Restarted Spark Context Relaunch Jobs Restart computation from info in checkpoints Restarted Receiver Resend unacked data Recover Block metadata from log Recover Block data from log Restarted Streaming Context
  • 30. Executor HDFS WAL Kafka Receiver with WAL (Spark 1.2) Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  • 31. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver
  • 32. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver 1. Query latest offsets and decide offset ranges for batch
  • 33. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  • 34. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API
  • 35. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API topic1, p2, (2010, 2110)
  • 36. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API topic1, p2, (2010, 2110)
  • 37. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batch
  • 38. Direct Kafka API benefits - No WALs or Receivers - Allows end-to-end exactly-once semantics pipelines * * updates to downstream systems should be idempotent or transactional - More fault-tolerant - More efficient - Easier to use.
  • 39. Spark Streaming UI improvements (Spark 1.4)
  • 40. Kafka Metadata (offsets) in UI (Spark 1.5)
  • 41. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right?
  • 42. Spark 2.0+ new Kafka Integration spark-streaming-kafka-0-8 spark-streaming-kafka-0-10 Broker Version 0.8.2.1 or higher 0.10.0 or higher Api Stability Stable Experimental Language Support Scala, Java, Python Scala, Java Receiver DStream Yes No Direct DStream Yes Yes SSL / TLS Support No Yes Offset Commit Api No Yes Dynamic Topic Subscription No Yes
  • 43. What’s really New with this New Kafka Integration? - New Consumer API * Instead of Simple API - Location Strategies - Consumer Strategies - SSL / TLS - No Python API :(
  • 44. Location Strategies - New consumer API will pre-fetch messages into buffers - So, keep cached consumers into executors - It’s better to schedule partitions on the host with already appropriate consumers
  • 45. Location Strategies - PreferConsistent Distribute partitions evenly across available executors - PreferBrokers If your executors are on the same hosts as your Kafka brokers - PreferFixed Specify an explicit mapping of partitions to hosts
  • 46. Consumer Strategies - New consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup. - ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.
  • 47. Consumer Strategies - Subscribe subscribe to a fixed collection of topics - SubscribePattern use a regex to specify topics of interest - Assign specify a fixed collection of partitions ● Overloaded constructors to specify the starting offset for a particular partition. ● ConsumerStrategy is a public class that you can extend.
  • 48. SSL/TTL encryption - New consumer API supports SSL - Only applies to communication between Spark and Kafka brokers - Still responsible for separately securing Spark inter-node communication
  • 49. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Basic Usage val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "stream_group_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("topicA", "topicB") val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.map(record => (record.key, record.value))
  • 50. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Getting Metadata stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } }
  • 51. Kafka or Spark RDD partitions? RDDTopic Kafka Spark 1 2 3 4 1 2 3 4
  • 52. Kafka or Spark RDD partitions? RDDTopic Kafka Spark 1 2 3 4 1 2 3 4
  • 53. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Getting Metadata stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } }
  • 54. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Store offsets in Kafka itself: Commit API stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges // DO YOUR STUFF with DATA stream.asInstanceOf[CanCommitOffsets] .commitAsync(offsetRanges) } }
  • 55. - At most once - At least once - Exactly once Kafka + Spark Semantics
  • 56. Kafka + Spark Semantics At most once - We don’t want duplicates - Not worth the hassle of ensuring that messages don’t get lost - Example: Sending statistics over UDP 1. Set spark.task.maxFailures to 1 2. Make sure spark.speculation is false (the default) 3. Set Kafka param auto.offset.reset to “largest” 4. Set Kafka param enable.auto.commit to true
  • 57. Kafka + Spark Semantics At most once - This will mean you lose messages on restart - At least they shouldn’t get replayed. - Test this carefully if it’s actually important to you that a message never gets repeated, because it’s not a common use case.
  • 58. Kafka + Spark Semantics At least once - We don’t want to loose any record - We don’t care about duplicates - Example: Sending internal alerts on relative rare occurrences on the stream 1. Set spark.task.maxFailures > 1000 2. Set Kafka param auto.offset.reset to “smallest” 3. Set Kafka param enable.auto.commit to false
  • 59. Kafka + Spark Semantics At least once - Don’t be silly! Do NOT replay your whole log on every restart… - Manually commit the offsets when you are 100% sure records are processed - If this is “too hard” you’d better have a relative short retention log - Or be REALLY ok with duplicates. For example, you are outputting to an external system that handles duplicates for you (HBase)
  • 60. Kafka + Spark Semantics Exactly once - We don’t want to loose any record - We don’t want duplicates either - Example: Storing stream in data warehouse 1. We need some kind of idempotent writes, or whole-or-nothing writes (transactions) 2. Only store offsets EXACTLY after writing data 3. Same parameters as at least once
  • 61. Kafka + Spark Semantics Exactly once - Probably the hardest to achieve right - Still some small chance of failure if your app fails just between writing data and committing offsets… (but REALLY small)
  • 62. Spark Streaming + Kafka at Billy Mobile a story of love and fury
  • 63. Some Billy Insights we rock it! 15Brecords monthly 35TBweekly retention log 6Kevents/second x4growth/year
  • 64. Our use cases: ETL to Data Warehouse - Input events from Kafka - Enrich events with some external data sources - Finally store it to Hive - We do NOT want duplicates - We do NOT want to lose events
  • 65. Our use cases: ETL to Data Warehouse - Hive is not transactional - Neither idempotent writes - Writing files to HDFS is “atomic” (whole or nothing) - A relation 1:1 from each partition-batch to file in HDFS - Store to ZK the current state of the batch - Store to ZK offsets of last finished batch
  • 66. Our use cases: ETL to Data Warehouse On failure: - If executors fails, just keep going (reschedule task) > spark.task.maxFailures = 1000 - If driver fails (or restart): - Load offsets and state from “current batch” if exists and “finish” it (KafkaUtils.createRDD) - Continue Stream from last saved offsets
  • 67. Our use cases: Anomalies Detection - Input events from Kafka - Periodically load batch-computed model - Detect when an offer stops converting (or too much) - We do not care about losing some events (on restart) - We always need to process the “real-time” stream
  • 68. Our use cases: Anomalies Detection - It’s useless to detect anomalies on a lagged stream! - Actually it could be very bad - Always restart stream on latest offsets - Restart with “fresh” state
  • 69. Our use cases: Store it to Entity Cache - Input events from Kafka - Almost no processing - Store it to HBase (has idempotent writes) - We do not care about duplicates - We can NOT lose a single event
  • 70. Our use cases: Store it to Entity Cache - Since HBase has idempotent writes, we can write events multiple times without hassle - But, we do NOT start with earliest offsets… - That would be 7 days of redundant writes…!!! - We store offsets of last finished batch - But obviously we might re-write some events on restart or failure
  • 71. Lessons Learned - Do NOT use checkpointing! - Not recoverable across upgrades - Do your own checkpointing - Track offsets yourself - ZK, HDFS, DB… - Memory might be an issue - You do not want to waste it... - Adjust batchDuration - Adjust maxRatePerPartition
  • 72. Future Research - Dynamic Allocation spark.dynamicAllocation.enabled vs spark.streaming.dynamicAllocation.enabled https://issues.apache.org/jira/browse/SPARK-12133 But no reference in docs... - Graceful shutdown - Structured Streaming