SlideShare a Scribd company logo
Architectural
Patterns for
Streaming
applications
Strata+Hadoop World, Singapore – December 02, 2015
tiny.cloudera.com/streaming-singapore
tiny.cloudera.com/streaming-singapore-questions
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/streaming-singapore-questions
3
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
• Contributor to Apache Hadoop,
Spark, Hive, Sqoop, Pig and Flume
Questions? tiny.cloudera.com/streaming-singapore-questions
Ted Malaska Mark Grover
4
Goal
5
Understand common use-
cases for streaming and
their architectures
6
What is streaming?
7
When	to	stream,	and	when	not	to
• We are looking for a SLA sweet spot
• Multi milliseconds to seconds
• Not minutes
• Not constant low milliseconds or under
• Doesn’t come for free
Questions? tiny.cloudera.com/streaming-singapore-questions
8
Use-cases for
streaming
9
Use-case categories
• Ingestion
– Transformation
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing
Questions? tiny.cloudera.com/streaming-singapore-questions
10
Ingestion
11
What is ingestion?
Questions? tiny.cloudera.com/streaming-singapore-questions
IngestSource Systems
Destination system
12
But there multiple sources
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest
13
But..
• Sources, sinks, ingestion channels may go down
• Sources and sinks may be producing/consuming at different rates
• Regular maintenance windows may need to be scheduled
• We need a resilient message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
14
Need for a message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest Extract
Message broker
15
Kafka
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest Extract
Message broker
16
But ‘queue’ doesn’t ‘push’
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Message broker
17
Streaming data ingestion process
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka Connect
Apache Flume
Message broker
18
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
Connect
Apache
Flume
Message broker
19
Transforming data
in flight
20
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
21
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
22
Two types of transformations
Atomic
• Need to work with one event at a
time
• Example – mask a credit card
number
With context
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache
Questions? tiny.cloudera.com/streaming-singapore-questions
23
Atomic transformations
• Require no context
• Can be simply done within Flume interceptors, Kafka connect or
Spark streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
24
Flume Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
Questions? tiny.cloudera.com/streaming-singapore-questions
25
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
26
Transformations with context
Questions? tiny.cloudera.com/streaming-singapore-questions
27
Exactly once, at
least once, at most
once
(In the context of data ingestion)
28
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Copycat
Apache
Flume
Message broker
Can be used to
do simple
transformations
29
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve
Questions? tiny.cloudera.com/streaming-singapore-questions
30
Categories of storage systems
“Puts” based
• Can be re-inserted without side
effects since re-inserted record will
have duplicate key
“Appends” based
• Can not be re-inserted
Questions? tiny.cloudera.com/streaming-singapore-questions
31
How to achieve exactly once?
• For “puts” based storage systems
– At least once is enough (keys have to be unique though i.e. primary key)
– Re-inserted records will have duplicate keys
– Will simply overwrite the exist record with the same value
• For “appends” based storage systems (e.g. HDFS)
– Still easiest to do at least once
– Need to de-duplicate before processing
Questions? tiny.cloudera.com/streaming-singapore-questions
32
Anomaly detection
systems
33Questions? tiny.cloudera.com/streaming-singapore-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
34
Counting
35
Streaming	and	Counting
• Counting is easy right?
• Back to Only once
Questions? tiny.cloudera.com/streaming-singapore-questions
36
We	started	with	Lambda
Pipe
Speed Layer
Batch Layer
Persist Results
Speed Results
Batch Results
Serving Layer
Questions? tiny.cloudera.com/streaming-singapore-questions
37
Why	did	Streaming	Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm with out Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups
Questions? tiny.cloudera.com/streaming-singapore-questions
38
We	have	come	a	long	way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions
Questions? tiny.cloudera.com/streaming-singapore-questions
39
Increments
Questions? tiny.cloudera.com/streaming-singapore-questions
40
Puts	with	State
Questions? tiny.cloudera.com/streaming-singapore-questions
41
Advanced	Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
42
Advanced	Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model
Questions? tiny.cloudera.com/streaming-singapore-questions
43
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
44
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
45
Advanced usage
46
Advanced	Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R
Questions? tiny.cloudera.com/streaming-singapore-questions
47
Wrap-up
48
Understand common
use-cases for streaming and
their architectures
Our original goal
49
Common streaming use-cases
• Ingestion
– Transformation
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing
Questions? tiny.cloudera.com/streaming-singapore-questions
50
Free books!
• Book signings
– Wednesday (today), 5:30 PM at O’Reilly booth
– Thursday (tomorrow), 3:15 PM at Cloudera booth
• Please leave us a review!
Questions? tiny.cloudera.com/streaming-singapore-questions
51
Stay in touch!
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
@hadooparchbook
tiny.cloudera.com/streaming-singapore
tiny.cloudera.com/streaming-singapore-questions
hadooparchitecturebook.com

More Related Content

What's hot

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 

What's hot (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 

Similar to Architectural Patterns for Streaming Applications

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Timothy Spann
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Providence: rapid vulnerability prevention
Providence: rapid vulnerability preventionProvidence: rapid vulnerability prevention
Providence: rapid vulnerability preventionSalesforce Engineering
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton Araf Karsh Hamid
 

Similar to Architectural Patterns for Streaming Applications (20)

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Deep Learning 201
Apache Deep Learning 201Apache Deep Learning 201
Apache Deep Learning 201
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Providence: rapid vulnerability prevention
Providence: rapid vulnerability preventionProvidence: rapid vulnerability prevention
Providence: rapid vulnerability prevention
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 

More from hadooparchbook (10)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 

Recently uploaded

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 

Recently uploaded (20)

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Architectural Patterns for Streaming Applications

  • 1. Architectural Patterns for Streaming applications Strata+Hadoop World, Singapore – December 02, 2015 tiny.cloudera.com/streaming-singapore tiny.cloudera.com/streaming-singapore-questions Mark Grover | @mark_grover Ted Malaska | @TedMalaska
  • 2. 2 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook Questions? tiny.cloudera.com/streaming-singapore-questions
  • 3. 3 About the presenters • Principal Solutions Architect at Cloudera • Done Hadoop for 6 years – Worked with > 70 companies in 8 countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Marvel fan boy, runner • Software Engineer at Cloudera, working on Spark • Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) • Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume Questions? tiny.cloudera.com/streaming-singapore-questions Ted Malaska Mark Grover
  • 5. 5 Understand common use- cases for streaming and their architectures
  • 7. 7 When to stream, and when not to • We are looking for a SLA sweet spot • Multi milliseconds to seconds • Not minutes • Not constant low milliseconds or under • Doesn’t come for free Questions? tiny.cloudera.com/streaming-singapore-questions
  • 9. 9 Use-case categories • Ingestion – Transformation – Decision (e.g. Anomaly detection) • Simple counts – Lambda, etc. • Advanced usage – Machine Learning – Windowing Questions? tiny.cloudera.com/streaming-singapore-questions
  • 11. 11 What is ingestion? Questions? tiny.cloudera.com/streaming-singapore-questions IngestSource Systems Destination system
  • 12. 12 But there multiple sources Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Destination systemSource System 2 Source System 3 Ingest
  • 13. 13 But.. • Sources, sinks, ingestion channels may go down • Sources and sinks may be producing/consuming at different rates • Regular maintenance windows may need to be scheduled • We need a resilient message broker Questions? tiny.cloudera.com/streaming-singapore-questions
  • 14. 14 Need for a message broker Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Destination systemSource System 2 Source System 3 Ingest Extract Message broker
  • 15. 15 Kafka Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Destination systemSource System 2 Source System 3 Ingest Extract Message broker
  • 16. 16 But ‘queue’ doesn’t ‘push’ Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Message broker
  • 17. 17 Streaming data ingestion process Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka Connect Apache Flume Message broker
  • 18. 18 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka Connect Apache Flume Message broker
  • 20. 20 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker
  • 21. 21 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker Can be used to do simple transformations
  • 22. 22 Two types of transformations Atomic • Need to work with one event at a time • Example – mask a credit card number With context • Need to refer to external context • Example - convert zip code to state, by looking up a cache Questions? tiny.cloudera.com/streaming-singapore-questions
  • 23. 23 Atomic transformations • Require no context • Can be simply done within Flume interceptors, Kafka connect or Spark streaming Questions? tiny.cloudera.com/streaming-singapore-questions
  • 24. 24 Flume Interceptors • Mask fields • Validate information against external source • Extract fields • Modify data format • Filter or split events Questions? tiny.cloudera.com/streaming-singapore-questions
  • 25. 25 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker Can be used to do simple transformations
  • 26. 26 Transformations with context Questions? tiny.cloudera.com/streaming-singapore-questions
  • 27. 27 Exactly once, at least once, at most once (In the context of data ingestion)
  • 28. 28 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Copycat Apache Flume Message broker Can be used to do simple transformations
  • 29. 29 Semantic types • At most once – Not good for many cases – Only where performance/SLA is more important than accuracy • Exactly once – Expensive to achieve but desirable • At least once – Easiest to achieve Questions? tiny.cloudera.com/streaming-singapore-questions
  • 30. 30 Categories of storage systems “Puts” based • Can be re-inserted without side effects since re-inserted record will have duplicate key “Appends” based • Can not be re-inserted Questions? tiny.cloudera.com/streaming-singapore-questions
  • 31. 31 How to achieve exactly once? • For “puts” based storage systems – At least once is enough (keys have to be unique though i.e. primary key) – Re-inserted records will have duplicate keys – Will simply overwrite the exist record with the same value • For “appends” based storage systems (e.g. HDFS) – Still easiest to do at least once – Need to de-duplicate before processing Questions? tiny.cloudera.com/streaming-singapore-questions
  • 33. 33Questions? tiny.cloudera.com/streaming-singapore-questions Hadoop Cluster II Storage Batch Processing Hadoop Cluster I Flume (Sink) HBase and/or Memory Store HDFS HBase Impala Map/Reduce Spark Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles/Rules Batch Time Adjustments NRT/Stream Processing Spark Streaming Adjusting NRT stats Kafka Events Reporting Flume (Source) Interceptor(Rules) Flume (Source) Flume (Source) Interceptor (Rules) Kafka Alerts/Events Flume Channel Events Alerts Hadoop Cluster I HBase and/or Memory Store
  • 35. 35 Streaming and Counting • Counting is easy right? • Back to Only once Questions? tiny.cloudera.com/streaming-singapore-questions
  • 36. 36 We started with Lambda Pipe Speed Layer Batch Layer Persist Results Speed Results Batch Results Serving Layer Questions? tiny.cloudera.com/streaming-singapore-questions
  • 37. 37 Why did Streaming Suck • Increments with Cassandra • Double increment • No strong consistency • Storm with out Kafka • Not only once • Not at least once • Batch would have to re-process EVERY record to remove dups Questions? tiny.cloudera.com/streaming-singapore-questions
  • 38. 38 We have come a long way • We don’t have to use Increments any more and we can have consistency • HBase • We can have state in our streaming platform • Spark Streaming • We don’t lose data • Spark Streaming • Kafka • Other options • Full universe of Deduping • Again HBase with versions Questions? tiny.cloudera.com/streaming-singapore-questions
  • 41. 41 Advanced Streaming • Ad-hoc will produce Identify Value • Ad-hoc will become batch • The value will demand less latency on batch • Batch will become Streaming Questions? tiny.cloudera.com/streaming-singapore-questions
  • 42. 42 Advanced Streaming • Requirements for Ideal Batch to Streaming frameworks • Something that can snap both paradigms • Something that can use the tools of Ad-hoc • SQL • MlLib • R • Scala • Java • Development through a common IDE • Debugging • Unit Testing • Common deployment model Questions? tiny.cloudera.com/streaming-singapore-questions
  • 43. 43 Spark Streaming Example Questions? tiny.cloudera.com/streaming-singapore-questions 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  • 44. 44 Spark Streaming Example Questions? tiny.cloudera.com/streaming-singapore-questions 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  • 46. 46 Advanced Streaming • In Spark Streaming • A DStream is a collection of RDD with respect to micro batch intervals • If we can access RDDs in Spark Streaming • We can convert to Vectors • KMeans • Principal component analysis • We can convert to LabeledPoint • NaiveBayes • Random Forest • Linear Support Vector Machines • We can convert to a DataFrames • SQL • R Questions? tiny.cloudera.com/streaming-singapore-questions
  • 48. 48 Understand common use-cases for streaming and their architectures Our original goal
  • 49. 49 Common streaming use-cases • Ingestion – Transformation – Decision (e.g. Anomaly detection) • Simple counts – Lambda, etc. • Advanced usage – Machine Learning – Windowing Questions? tiny.cloudera.com/streaming-singapore-questions
  • 50. 50 Free books! • Book signings – Wednesday (today), 5:30 PM at O’Reilly booth – Thursday (tomorrow), 3:15 PM at Cloudera booth • Please leave us a review! Questions? tiny.cloudera.com/streaming-singapore-questions
  • 51. 51 Stay in touch! Mark Grover | @mark_grover Ted Malaska | @TedMalaska @hadooparchbook tiny.cloudera.com/streaming-singapore tiny.cloudera.com/streaming-singapore-questions hadooparchitecturebook.com