SlideShare a Scribd company logo
Best practices for
streaming
applications
O’Reilly Webcast
June 21st
/22nd
, 2016
Mark Grover | @mark_grover | Software Engineer
Ted Malaska | @TedMalaska | Principal Solutions Architect
2
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
• Contributor to Apache Hadoop,
Spark, Hive, Sqoop, Pig and Flume
Ted Malaska Mark Grover
3
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
4
Goal
5
Understand common use-
cases for streaming and
their architectures
6
What is streaming?
7
When to stream, and when not to
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in case
of failures
10s of seconds or
more, re-run in case of
failures
Real-time Near real-time Batch
8
When to stream, and when not to
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in case
of failures
10s of seconds or
more, re-run in case of
failures
Real-time Near real-time Batch
9
No free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in case
of failures
10s of seconds or
more, re-run in case of
failures
Real-time Near real-time Batch
“Difficult” architectures, lower latency “Easier” architectures, higher latency
10
Use-cases for
streaming
11
Use-case categories
• Ingestion
• Simple transformations
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing
12
Ingestion &
Transformations
13
What is ingestion?
Source Systems
Destination system
Streaming
engine
14
But there multiple sources
Ingest
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Streaming
engine Ingest
15
But..
• Sources, sinks, ingestion channels may go down
• Sources, sinks producing/consuming at different rates (buffering)
• Regular maintenance windows may need to be scheduled
• You need a resilient message broker (pub/sub)
16
Need for a message broker
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
17
Kafka
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
18
Destination systems
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
Most common
“destination” is a
storage system
19
Architecture diagram with a broker
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
20
Streaming engines
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Kafka
Connect
Apache
Flume
Message broker
Apache Beam
(incubating)
21
Storage options
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Kafka
Connect
Apache
Flume
Message broker
Apache Beam
(incubating)
22
Semantics
At most once, Exactly once, At least once
23
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve
24
Review
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
25
Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
At least once
At least once
Ordered
Partitioned
It depends It depends
26
Transforming data
in flight
27
Streaming architecture for ingestion
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
28
Ingestion and/or Transformation
1. Zero Transformation
– No transformation, plain ingest, no schema validation
– Keep the original format - SequenceFiles, Text, etc.
– Allows to store data that may have errors in the schema
2. Format Transformation
– Simply change the format of field, for example
– Structured Format e.g. Avro
– Which does schema validation
3. Enrichment Transformation
– Atomic
– Contextual
29
#3 - Enrichment transformations
Atomic
• Need to work with one event at a
time
• Mask a credit card number
• Add processing time or offset to the
record
Contextual
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache
30
Atomic transformations
• Require no context
• All streaming engines support it
31
Contextual transformations
• Well supported by many streaming engines
• Need to store the context somewhere.
32
Where to store the context
1. Locally Broadcast Cached Dim Data
– Local to Process (On Heap, Off Heap)
– Local to Node (Off Process)
2. Partitioned Cache
– Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
33
#1a - Locally broadcast cached data
Could be
On heap or Off heap
34
#1b - Off process cached data
Data is cached on the
node, outside of
process. Potentially in
an external system like
Rocks DB
35
#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached
36
#3 - External fetch
Data fetched from
external system
37
A combination (partitioned cache + external)
38
Anomaly detection using contextual transformations
39
Storage systems
When to use which one?
40
Storage Considerations
• Throughput
• Access Patterns
– Scanning
– Indexed
– Reversed Indexed
• Transaction Level
– Record/Document
– File
41
File Level
• HDFS
• S3
42
NoSql
• HBase
• Cassandra
• MongoDB
43
Search
• SolR
• Elastic Search
44
NoSql-Sql
• Kudu
45
Streaming engines
Comparison
46© Cloudera, Inc. All rights reserved.
Tricks With Producers
•Send Source ID (requires Partitioning In Kafka)
•Seq
•UUID
•UUID plus time
•Partition on SourceID
•Watch out for repartitions and partition fail overs
47© Cloudera, Inc. All rights reserved.
Streaming Engines
•Consumer
•Flume, KafkaConnect, Streaming Engine
•Storm
•Spark Streaming
•Flink
•Kafka Streams
48© Cloudera, Inc. All rights reserved.
Consumer: Flume, KafkaConnect
•Simple and Works
•Low latency
•High throughput
•Interceptors
•Transformations
•Alerting
•Ingestions
49© Cloudera, Inc. All rights reserved.
Consumer: Streaming Engines
•Not so great at HDFS Ingestion
•But great for record storage systems
•HBase
•Cassandra
•Kudu
•SolR
•Elastic Search
50© Cloudera, Inc. All rights reserved.
Storm
•Old Gen
•Low latency
•Low throughput
•At least once
•Around for ever
•Topology Based
51© Cloudera, Inc. All rights reserved.
Spark Streaming
•The Juggernaut
•Higher Latency
•High Through Put
• Exactly Once
•SQL
•MlLib
•Highly used
•Easy to Debug/Unit Test
•Easy to transition from
Batch
•Flow Language
•600 commits in a month
and about 100 meetups
52© Cloudera, Inc. All rights reserved.
Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batch
Second
Batch
53© Cloudera, Inc. All rights reserved.
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming
54© Cloudera, Inc. All rights reserved.
Flink
•I’m Better Than Spark Why Doesn’t Anyone use me
•Very much like Spark but not as feature rich
•Lower Latency
•Micro Batch -> ABS
•Asynchronous Barrier Snapshotting
•Flow Language
•~1/6th
the comments and meetups
•But Slim loves it ☺
55© Cloudera, Inc. All rights reserved.
Flink - ABS
Operator
Buffer
56© Cloudera, Inc. All rights reserved.
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A
Hit
Barrier 1B
Still Behind
57© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both
Barriers Hit
Operator
Buffer
Barrier 1A
Hit
Barrier 1B
Still Behind
Check Point
58© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both
Barriers Hit
Check Point
Operator
Buffer
Barrier is
combined
and can
move on
Buffer can
be flushed
out
59© Cloudera, Inc. All rights reserved.
Kafka Streams
• The new Kid on the Block
• When you only have Kafka
• Low Latency
• High Throughput
• Not exactly once
• Very Young
• Flow Language
• Very different hardware profile then others
• Not widely supported
• Not widely used
• Worries about separation of concern
60© Cloudera, Inc. All rights reserved.
Summary about Engines
• Ingestion
• Flume and KafkaConnect
• Super Real Time and Special
• Consumer
• Counting, MlLib, SQL
• Spark
• Maybe future and cool
• Flink and KafkaStreams
• Odd man out
• Storm
61© Cloudera, Inc. All rights reserved.
Abstractions
Code Abstractions
Beam
SQL Abstraction
SQL
UI Abstraction
StreamSets
Streaming Engines
62
Counting
63
Streaming and Counting
• Counting is easy right?
• Back to Only once
64
We started with Lambda
Pipe
Speed Layer
Batch Layer
Persist Results
Speed Results
Batch Results
Serving Layer
65
Why did Streaming Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm without Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups
66
We have come a long way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions
67
Increments
68
Puts with State
69
Advanced
streaming
When to use which one?
70
Advanced Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming
71
Advanced Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model
72
Advanced Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R
73
Wrap-up
74
Understand common
use-cases for streaming and
their architecturesOur original goal
75
Common streaming use-cases
• Ingestion
– Transformation
• Counting
– Lambda, etc.
• Advanced streaming
76
Thank you!Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
@hadooparchbook
hadooparchitecturebook.com
77
Transformations with context

More Related Content

What's hot

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
confluent
 
Data Mesh
Data MeshData Mesh
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
Kai Wähner
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
TarekHamdi8
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
Snowflake Computing
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
Adam Doyle
 

What's hot (20)

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 

Similar to Streaming architecture patterns

Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Shravan (Sean) Pabba
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community) [발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
동현 김
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
Piotr Pelczar
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
Alex Moskvin
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
Pat Patterson
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling Up
Sander Temme
 
haproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfhaproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdf
PawanVerma628806
 
HAProxy
HAProxy HAProxy
HAProxy
Arindam Nayak
 

Similar to Streaming architecture patterns (20)

Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community) [발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling Up
 
haproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfhaproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdf
 
HAProxy
HAProxy HAProxy
HAProxy
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
 

More from hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 

Recently uploaded

一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 

Recently uploaded (20)

一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 

Streaming architecture patterns

  • 1. Best practices for streaming applications O’Reilly Webcast June 21st /22nd , 2016 Mark Grover | @mark_grover | Software Engineer Ted Malaska | @TedMalaska | Principal Solutions Architect
  • 2. 2 About the presenters • Principal Solutions Architect at Cloudera • Done Hadoop for 6 years – Worked with > 70 companies in 8 countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Marvel fan boy, runner • Software Engineer at Cloudera, working on Spark • Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) • Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume Ted Malaska Mark Grover
  • 3. 3 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
  • 5. 5 Understand common use- cases for streaming and their architectures
  • 7. 7 When to stream, and when not to Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch
  • 8. 8 When to stream, and when not to Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch
  • 9. 9 No free lunch Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch “Difficult” architectures, lower latency “Easier” architectures, higher latency
  • 11. 11 Use-case categories • Ingestion • Simple transformations – Decision (e.g. Anomaly detection) • Simple counts – Lambda, etc. • Advanced usage – Machine Learning – Windowing
  • 13. 13 What is ingestion? Source Systems Destination system Streaming engine
  • 14. 14 But there multiple sources Ingest Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Streaming engine Ingest
  • 15. 15 But.. • Sources, sinks, ingestion channels may go down • Sources, sinks producing/consuming at different rates (buffering) • Regular maintenance windows may need to be scheduled • You need a resilient message broker (pub/sub)
  • 16. 16 Need for a message broker Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  • 17. 17 Kafka Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  • 18. 18 Destination systems Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker Most common “destination” is a storage system
  • 19. 19 Architecture diagram with a broker Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  • 20. 20 Streaming engines Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Kafka Connect Apache Flume Message broker Apache Beam (incubating)
  • 21. 21 Storage options Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Kafka Connect Apache Flume Message broker Apache Beam (incubating)
  • 22. 22 Semantics At most once, Exactly once, At least once
  • 23. 23 Semantic types • At most once – Not good for many cases – Only where performance/SLA is more important than accuracy • Exactly once – Expensive to achieve but desirable • At least once – Easiest to achieve
  • 24. 24 Review Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  • 25. 25 Semantics of our architecture Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker At least once At least once Ordered Partitioned It depends It depends
  • 27. 27 Streaming architecture for ingestion Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker Can be used to do simple transformations
  • 28. 28 Ingestion and/or Transformation 1. Zero Transformation – No transformation, plain ingest, no schema validation – Keep the original format - SequenceFiles, Text, etc. – Allows to store data that may have errors in the schema 2. Format Transformation – Simply change the format of field, for example – Structured Format e.g. Avro – Which does schema validation 3. Enrichment Transformation – Atomic – Contextual
  • 29. 29 #3 - Enrichment transformations Atomic • Need to work with one event at a time • Mask a credit card number • Add processing time or offset to the record Contextual • Need to refer to external context • Example - convert zip code to state, by looking up a cache
  • 30. 30 Atomic transformations • Require no context • All streaming engines support it
  • 31. 31 Contextual transformations • Well supported by many streaming engines • Need to store the context somewhere.
  • 32. 32 Where to store the context 1. Locally Broadcast Cached Dim Data – Local to Process (On Heap, Off Heap) – Local to Node (Off Process) 2. Partitioned Cache – Shuffle to move new data to partitioned cache 3. External Fetch Data (e.g. HBase, Memcached)
  • 33. 33 #1a - Locally broadcast cached data Could be On heap or Off heap
  • 34. 34 #1b - Off process cached data Data is cached on the node, outside of process. Potentially in an external system like Rocks DB
  • 35. 35 #2 - Partitioned cache data Data is partitioned based on field(s) and then cached
  • 36. 36 #3 - External fetch Data fetched from external system
  • 37. 37 A combination (partitioned cache + external)
  • 38. 38 Anomaly detection using contextual transformations
  • 39. 39 Storage systems When to use which one?
  • 40. 40 Storage Considerations • Throughput • Access Patterns – Scanning – Indexed – Reversed Indexed • Transaction Level – Record/Document – File
  • 46. 46© Cloudera, Inc. All rights reserved. Tricks With Producers •Send Source ID (requires Partitioning In Kafka) •Seq •UUID •UUID plus time •Partition on SourceID •Watch out for repartitions and partition fail overs
  • 47. 47© Cloudera, Inc. All rights reserved. Streaming Engines •Consumer •Flume, KafkaConnect, Streaming Engine •Storm •Spark Streaming •Flink •Kafka Streams
  • 48. 48© Cloudera, Inc. All rights reserved. Consumer: Flume, KafkaConnect •Simple and Works •Low latency •High throughput •Interceptors •Transformations •Alerting •Ingestions
  • 49. 49© Cloudera, Inc. All rights reserved. Consumer: Streaming Engines •Not so great at HDFS Ingestion •But great for record storage systems •HBase •Cassandra •Kudu •SolR •Elastic Search
  • 50. 50© Cloudera, Inc. All rights reserved. Storm •Old Gen •Low latency •Low throughput •At least once •Around for ever •Topology Based
  • 51. 51© Cloudera, Inc. All rights reserved. Spark Streaming •The Juggernaut •Higher Latency •High Through Put • Exactly Once •SQL •MlLib •Highly used •Easy to Debug/Unit Test •Easy to transition from Batch •Flow Language •600 commits in a month and about 100 meetups
  • 52. 52© Cloudera, Inc. All rights reserved. Spark Streaming DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print First Batch Second Batch
  • 53. 53© Cloudera, Inc. All rights reserved. DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1 Spark Streaming
  • 54. 54© Cloudera, Inc. All rights reserved. Flink •I’m Better Than Spark Why Doesn’t Anyone use me •Very much like Spark but not as feature rich •Lower Latency •Micro Batch -> ABS •Asynchronous Barrier Snapshotting •Flow Language •~1/6th the comments and meetups •But Slim loves it ☺
  • 55. 55© Cloudera, Inc. All rights reserved. Flink - ABS Operator Buffer
  • 56. 56© Cloudera, Inc. All rights reserved. Operator Buffer Operator Buffer Flink - ABS Barrier 1A Hit Barrier 1B Still Behind
  • 57. 57© Cloudera, Inc. All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier 1A Hit Barrier 1B Still Behind Check Point
  • 58. 58© Cloudera, Inc. All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Check Point Operator Buffer Barrier is combined and can move on Buffer can be flushed out
  • 59. 59© Cloudera, Inc. All rights reserved. Kafka Streams • The new Kid on the Block • When you only have Kafka • Low Latency • High Throughput • Not exactly once • Very Young • Flow Language • Very different hardware profile then others • Not widely supported • Not widely used • Worries about separation of concern
  • 60. 60© Cloudera, Inc. All rights reserved. Summary about Engines • Ingestion • Flume and KafkaConnect • Super Real Time and Special • Consumer • Counting, MlLib, SQL • Spark • Maybe future and cool • Flink and KafkaStreams • Odd man out • Storm
  • 61. 61© Cloudera, Inc. All rights reserved. Abstractions Code Abstractions Beam SQL Abstraction SQL UI Abstraction StreamSets Streaming Engines
  • 63. 63 Streaming and Counting • Counting is easy right? • Back to Only once
  • 64. 64 We started with Lambda Pipe Speed Layer Batch Layer Persist Results Speed Results Batch Results Serving Layer
  • 65. 65 Why did Streaming Suck • Increments with Cassandra • Double increment • No strong consistency • Storm without Kafka • Not only once • Not at least once • Batch would have to re-process EVERY record to remove dups
  • 66. 66 We have come a long way • We don’t have to use Increments any more and we can have consistency • HBase • We can have state in our streaming platform • Spark Streaming • We don’t lose data • Spark Streaming • Kafka • Other options • Full universe of Deduping • Again HBase with versions
  • 70. 70 Advanced Streaming • Ad-hoc will produce Identify Value • Ad-hoc will become batch • The value will demand less latency on batch • Batch will become Streaming
  • 71. 71 Advanced Streaming • Requirements for Ideal Batch to Streaming frameworks • Something that can snap both paradigms • Something that can use the tools of Ad-hoc • SQL • MlLib • R • Scala • Java • Development through a common IDE • Debugging • Unit Testing • Common deployment model
  • 72. 72 Advanced Streaming • In Spark Streaming • A DStream is a collection of RDD with respect to micro batch intervals • If we can access RDDs in Spark Streaming • We can convert to Vectors • KMeans • Principal component analysis • We can convert to LabeledPoint • NaiveBayes • Random Forest • Linear Support Vector Machines • We can convert to a DataFrames • SQL • R
  • 74. 74 Understand common use-cases for streaming and their architecturesOur original goal
  • 75. 75 Common streaming use-cases • Ingestion – Transformation • Counting – Lambda, etc. • Advanced streaming
  • 76. 76 Thank you!Mark Grover | @mark_grover Ted Malaska | @TedMalaska @hadooparchbook hadooparchitecturebook.com