Ingest and Stream Processing - What will you choose?

1© Cloudera, Inc. All rights reserved.
13 April 2016
Ted Malaska| Principle Solutions Architect @ Cloudera,
Pat Patterson| Community Champion @ StreamSets
Ingest and Stream Processing -
What will you choose?

About Ted and Pat
Ted Malaska
• Principal Solutions Architect
@ Cloudera
• Apache HBase SparkOnHBase
Contributor
• Contact
• ted.malaska@cloudera.com
• @TedMalaska
Pat Patterson
• Community Champion @
StreamSets
• Formerly Developer Evangelist at
Salesforce
• Contact
• pat@streamsets.com
• @metadaddy

Streaming Patterns
•Ingestion
•Low Millisecond Actions
•Near Real Time Complex Actions

Parts Of Streaming
Producer Kafka Engine Destination

Parts Of Streaming
Producer Kafka Engine Destination
At Least once
Ordered
Partitioned
At Least Once Depends
Depends

Destinations
• File Systems: example HDFS
• Batch is good
• Only can do exactly once is a file is closed in a single ack.
• Good for Scans
• Solr
• Everything is Document based making exactly once
• Batch is still good
• Good for Search Queries

Destinations
• NoSQL: example HBase
• Everything has a row key making exactly once for writes
• Increments can be applied twice is so be careful
• Good for gets and puts
• Kudu
• Everything has a row key making exactly once for writes
• Good for gets, puts, and scans

Ingestion Destinations
• File Systems: example HDFS
• Flume
• Kafka Connect
• Solr
• Flume
• Any Streaming Engine

Ingestion Destinations
• NoSQL: example HBase
• Flume
• Any Streaming Engine: Storm and Spark Streaming Tested
• Kudu
• Flume
• Kafka Connect
• Any Streaming Engine: Spark Streaming Tested

Tricks With Producers
• Send Source ID (requires Partitioning In Kafka)
• Seq
• UUID
• UUID plus time
• Partition on SourceID
• Watch out for repartitions and partition fail overs

Streaming Engines
• Consumer
• Flume, KafkaConnect
• Storm
• Spark Streaming
• Flink
• Kafka Streams

Consumer: Flume, KafkaConnect
• Simple and Works
• Low latency
• High throughput
• Interceptors
• Transformations
• Alerting
• Ingestions

Storm
• Old Gen
• Low latency
• Low throughput
• At least once
• Around for ever
• Topology Based

Spark Streaming
• The Juggernaut
• Higher Latency
• High Through Put
• Exactly Once
• SQL
• MlLib
• Highly used
• Easy to Debug/Unit Test
• Easy to transition from
Batch
• Flow Language
• 600 commits in a month
and about 100 meetups

Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batch
Second
Batch

DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming

Flink
• I’m Better Than Spark Why Doesn’t Anyone use me
• Very much like Spark but not as feature rich
• Lower Latency
• Micro Batch -> ABS
• Asynchronous Barrier Snapshotting
• Flow Language
• ~1/6th the comments and meetups
• But Slim loves it 

Flink - ABS
Operator
Buffer

Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A
Hit
Barrier 1B
Still Behind

Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier 1A
Hit
Barrier 1B
Still Behind

Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier is
combined
and can
move on
Buffer can
be flushed
out

Kafka Streams
• The new Kid on the Block
• When you only have Kafka
• Low Latency
• High Throughput
• Interesting snapshot approach
• Very Young
• Flow Language

Summary about Engines
• Ingestion
• Flume and KafkaConnect
• Super Real Time and Special
• Consumer
• Counting, MlLib, SQL
• Spark
• Maybe future and cool
• Flink and KafkaStreams
• Odd man out
• Storm

StreamSets Data Collector
Building a Higher Level Tool

Thank you!

Ingest and Stream Processing - What will you choose?

More Related Content

What's hot

Viewers also liked

Similar to Ingest and Stream Processing - What will you choose?

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Ingest and Stream Processing - What will you choose?

Editor's Notes