1© Cloudera, Inc. All rights reserved.
13 April 2016
Ted Malaska| Principle Solutions Architect @ Cloudera,
Pat Patterson| Community Champion @ StreamSets
Ingest and Stream Processing -
What will you choose?
2© Cloudera, Inc. All rights reserved.
About Ted and Pat
Ted Malaska
• Principal Solutions Architect
@ Cloudera
• Apache HBase SparkOnHBase
Contributor
• Contact
• ted.malaska@cloudera.com
• @TedMalaska
Pat Patterson
• Community Champion @
StreamSets
• Formerly Developer Evangelist at
Salesforce
• Contact
• pat@streamsets.com
• @metadaddy
3© Cloudera, Inc. All rights reserved.
Streaming Patterns
•Ingestion
•Low Millisecond Actions
•Near Real Time Complex Actions
4© Cloudera, Inc. All rights reserved.
Parts Of Streaming
Producer Kafka Engine Destination
5© Cloudera, Inc. All rights reserved.
Parts Of Streaming
Producer Kafka Engine Destination
At Least once
Ordered
Partitioned
At Least Once Depends
Depends
6© Cloudera, Inc. All rights reserved.
Destinations
• File Systems: example HDFS
• Batch is good
• Only can do exactly once is a file is closed in a single ack.
• Good for Scans
• Solr
• Everything is Document based making exactly once
• Batch is still good
• Good for Search Queries
7© Cloudera, Inc. All rights reserved.
Destinations
• NoSQL: example HBase
• Everything has a row key making exactly once for writes
• Increments can be applied twice is so be careful
• Good for gets and puts
• Kudu
• Everything has a row key making exactly once for writes
• Good for gets, puts, and scans
8© Cloudera, Inc. All rights reserved.
Ingestion Destinations
• File Systems: example HDFS
• Flume
• Kafka Connect
• Solr
• Flume
• Any Streaming Engine
9© Cloudera, Inc. All rights reserved.
Ingestion Destinations
• NoSQL: example HBase
• Flume
• Any Streaming Engine: Storm and Spark Streaming Tested
• Kudu
• Flume
• Kafka Connect
• Any Streaming Engine: Spark Streaming Tested
10© Cloudera, Inc. All rights reserved.
Tricks With Producers
• Send Source ID (requires Partitioning In Kafka)
• Seq
• UUID
• UUID plus time
• Partition on SourceID
• Watch out for repartitions and partition fail overs
11© Cloudera, Inc. All rights reserved.
Streaming Engines
• Consumer
• Flume, KafkaConnect
• Storm
• Spark Streaming
• Flink
• Kafka Streams
12© Cloudera, Inc. All rights reserved.
Consumer: Flume, KafkaConnect
• Simple and Works
• Low latency
• High throughput
• Interceptors
• Transformations
• Alerting
• Ingestions
13© Cloudera, Inc. All rights reserved.
Storm
• Old Gen
• Low latency
• Low throughput
• At least once
• Around for ever
• Topology Based
14© Cloudera, Inc. All rights reserved.
Spark Streaming
• The Juggernaut
• Higher Latency
• High Through Put
• Exactly Once
• SQL
• MlLib
• Highly used
• Easy to Debug/Unit Test
• Easy to transition from
Batch
• Flow Language
• 600 commits in a month
and about 100 meetups
15© Cloudera, Inc. All rights reserved.
Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batch
Second
Batch
16© Cloudera, Inc. All rights reserved.
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming
17© Cloudera, Inc. All rights reserved.
Flink
• I’m Better Than Spark Why Doesn’t Anyone use me
• Very much like Spark but not as feature rich
• Lower Latency
• Micro Batch -> ABS
• Asynchronous Barrier Snapshotting
• Flow Language
• ~1/6th the comments and meetups
• But Slim loves it 
18© Cloudera, Inc. All rights reserved.
Flink - ABS
Operator
Buffer
19© Cloudera, Inc. All rights reserved.
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A
Hit
Barrier 1B
Still Behind
20© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier 1A
Hit
Barrier 1B
Still Behind
21© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier is
combined
and can
move on
Buffer can
be flushed
out
22© Cloudera, Inc. All rights reserved.
Kafka Streams
• The new Kid on the Block
• When you only have Kafka
• Low Latency
• High Throughput
• Interesting snapshot approach
• Very Young
• Flow Language
23© Cloudera, Inc. All rights reserved.
Summary about Engines
• Ingestion
• Flume and KafkaConnect
• Super Real Time and Special
• Consumer
• Counting, MlLib, SQL
• Spark
• Maybe future and cool
• Flink and KafkaStreams
• Odd man out
• Storm
24© Cloudera, Inc. All rights reserved.
StreamSets Data Collector
Building a Higher Level Tool
25© Cloudera, Inc. All rights reserved.
Thank you!

Ingest and Stream Processing - What will you choose?

  • 1.
    1© Cloudera, Inc.All rights reserved. 13 April 2016 Ted Malaska| Principle Solutions Architect @ Cloudera, Pat Patterson| Community Champion @ StreamSets Ingest and Stream Processing - What will you choose?
  • 2.
    2© Cloudera, Inc.All rights reserved. About Ted and Pat Ted Malaska • Principal Solutions Architect @ Cloudera • Apache HBase SparkOnHBase Contributor • Contact • ted.malaska@cloudera.com • @TedMalaska Pat Patterson • Community Champion @ StreamSets • Formerly Developer Evangelist at Salesforce • Contact • pat@streamsets.com • @metadaddy
  • 3.
    3© Cloudera, Inc.All rights reserved. Streaming Patterns •Ingestion •Low Millisecond Actions •Near Real Time Complex Actions
  • 4.
    4© Cloudera, Inc.All rights reserved. Parts Of Streaming Producer Kafka Engine Destination
  • 5.
    5© Cloudera, Inc.All rights reserved. Parts Of Streaming Producer Kafka Engine Destination At Least once Ordered Partitioned At Least Once Depends Depends
  • 6.
    6© Cloudera, Inc.All rights reserved. Destinations • File Systems: example HDFS • Batch is good • Only can do exactly once is a file is closed in a single ack. • Good for Scans • Solr • Everything is Document based making exactly once • Batch is still good • Good for Search Queries
  • 7.
    7© Cloudera, Inc.All rights reserved. Destinations • NoSQL: example HBase • Everything has a row key making exactly once for writes • Increments can be applied twice is so be careful • Good for gets and puts • Kudu • Everything has a row key making exactly once for writes • Good for gets, puts, and scans
  • 8.
    8© Cloudera, Inc.All rights reserved. Ingestion Destinations • File Systems: example HDFS • Flume • Kafka Connect • Solr • Flume • Any Streaming Engine
  • 9.
    9© Cloudera, Inc.All rights reserved. Ingestion Destinations • NoSQL: example HBase • Flume • Any Streaming Engine: Storm and Spark Streaming Tested • Kudu • Flume • Kafka Connect • Any Streaming Engine: Spark Streaming Tested
  • 10.
    10© Cloudera, Inc.All rights reserved. Tricks With Producers • Send Source ID (requires Partitioning In Kafka) • Seq • UUID • UUID plus time • Partition on SourceID • Watch out for repartitions and partition fail overs
  • 11.
    11© Cloudera, Inc.All rights reserved. Streaming Engines • Consumer • Flume, KafkaConnect • Storm • Spark Streaming • Flink • Kafka Streams
  • 12.
    12© Cloudera, Inc.All rights reserved. Consumer: Flume, KafkaConnect • Simple and Works • Low latency • High throughput • Interceptors • Transformations • Alerting • Ingestions
  • 13.
    13© Cloudera, Inc.All rights reserved. Storm • Old Gen • Low latency • Low throughput • At least once • Around for ever • Topology Based
  • 14.
    14© Cloudera, Inc.All rights reserved. Spark Streaming • The Juggernaut • Higher Latency • High Through Put • Exactly Once • SQL • MlLib • Highly used • Easy to Debug/Unit Test • Easy to transition from Batch • Flow Language • 600 commits in a month and about 100 meetups
  • 15.
    15© Cloudera, Inc.All rights reserved. Spark Streaming DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print First Batch Second Batch
  • 16.
    16© Cloudera, Inc.All rights reserved. DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1 Spark Streaming
  • 17.
    17© Cloudera, Inc.All rights reserved. Flink • I’m Better Than Spark Why Doesn’t Anyone use me • Very much like Spark but not as feature rich • Lower Latency • Micro Batch -> ABS • Asynchronous Barrier Snapshotting • Flow Language • ~1/6th the comments and meetups • But Slim loves it 
  • 18.
    18© Cloudera, Inc.All rights reserved. Flink - ABS Operator Buffer
  • 19.
    19© Cloudera, Inc.All rights reserved. Operator Buffer Operator Buffer Flink - ABS Barrier 1A Hit Barrier 1B Still Behind
  • 20.
    20© Cloudera, Inc.All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier 1A Hit Barrier 1B Still Behind
  • 21.
    21© Cloudera, Inc.All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier is combined and can move on Buffer can be flushed out
  • 22.
    22© Cloudera, Inc.All rights reserved. Kafka Streams • The new Kid on the Block • When you only have Kafka • Low Latency • High Throughput • Interesting snapshot approach • Very Young • Flow Language
  • 23.
    23© Cloudera, Inc.All rights reserved. Summary about Engines • Ingestion • Flume and KafkaConnect • Super Real Time and Special • Consumer • Counting, MlLib, SQL • Spark • Maybe future and cool • Flink and KafkaStreams • Odd man out • Storm
  • 24.
    24© Cloudera, Inc.All rights reserved. StreamSets Data Collector Building a Higher Level Tool
  • 25.
    25© Cloudera, Inc.All rights reserved. Thank you!

Editor's Notes

  • #2 Apache Spark and Apache HBase are an ideal combination for low-latency processing, storage, and serving of entity data. Combining both distributed in-memory processing and non-relational storage enables new near-real-time enrichment use cases and improves the performance of existing workflows. In this talk, we will first describe batch in-memory applications that need to process HBase tables. You'll learn about the importance of data locality between Spark and HBase table data and the impact on performance. Next, we'll look at Spark Streaming applications that leverage HBase for storing state. The ability to update streaming state by key and/or windows enables an array of applications such as near real-time fraud detection. We will conclude with a discussion on current open challenges and future work.