Have your Cake and Eat it Too - Architecture for Batch and Real-time processing

Have Your Cake
and
Eat It Too
Architectures for Batch and Stream
Processing
Speaker name // Speaker title

2
Stuff We’ll Talk About
• Why do we need both streams and batches
• Why is it a problem?
• Stream-Only Patterns (i.e. Kappa Architecture)
• Lambda-Architecture Technologies
– SummingBird
– Apache Spark
– Apache Flink
– Bring-your-own-framework

3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
• @gwenshap
About Me

4
Why Streaming
and Batch
©2014 Cloudera, Inc. All rights reserved.

5
Batch Processing
• Store data somewhere
• Read large chunks of data
• Do something with data
• Sometimes store results

6
Batch Examples
• Analytics
• ETL / ELT
• Training machine learning models
• Recommendations
Click to enter confidentiality information

7
Stream Processing
• Listen to incoming events
• Do something with each event
• Maybe store events / results

8
Stream Processing Examples
• Anomaly detection, alerts
• Monitoring, SLAs
• Operational intelligence
• Analytics, dashboards
• ETL

9
Streaming & Batch
Alerts
Monitoring, SLAs
Operational Intelligence
Risk Analysis
Anomaly
detection
Analytics
ETL

10
Four Categories
• Streams Only
• Batch Only
• Can be done in both
• Must be done in both
ETL
Some Analytics

11
ETL
Most Stream Processing projects I see involve few simple
transformations.
• Currency conversion
• JSON to Avro
• Field extraction
• Joining a stream to a static data set
• Aggregate on window
• Identifying change in trend
• Document indexing

12
Batch || Streaming
• Efficient:
– Lower CPU utilization
– Better network and disk throughput
– Fewer locks and waits
• Easier administration
• Easier integration with RDBMS
• Existing expertise
• Existing tools
• Real-time information

13
The Problem

14
We Like
• Efficiency
• Scalability
• Fault Tolerance
• Recovery from errors
• Experimenting with different
approaches
• Debuggers
• Cookies

15
But…
We don’t like
Maintaining two applications
That do the same thing

16
Do we really need to maintain same app
twice?
Yes, because:
• We are not sure about requirements
• We sometimes need to re-process
with very high efficiency
Not really:
• Different apps for batch and
streaming
• Can re-process with streams
• Can error-correct with streams
• Can maintain one code-base
for batches and streams

17
Stream-Only
Patterns
(Kappa
Architecture)

18
DWH Example
OLTP DB
Sensors,
Logs
DWH
Fact Table
(Partitioned)
Real Time
Fact Tables
Dimensio
n
Dimensio
n
Dimensio
n
Views
Aggregat
es
App 1:
Stream
processing
App 2:
Occasional load

19
We need to fix older data
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming
App v1
Streaming
App v2
Real-Time
Table
Replacement
Partition
Partitioned
Fact Table

20
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming
App v1
Streaming
App v2
Real-Time
Table
Replacement
Partition
Partitioned
Fact Table

21
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming
App v2
Real-Time
Table

22
Lambda-
Architecture
Technologies

23
WordCount in Scala
source.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_+_)
.print()

25
MapReduce was great because…
Very simple abstraction:
- Map
- Shuffle
- Reduce
- Type-safe
And it has simpler abstractions on top.

26
SummingBird
• Multi-stage MapReduce
• Run on Hadoop, Spark, Storm
• Very easy to combine
batch and streaming results

27
API
• Platform – Storm, Scalding, Spark…
• Producer.source(Platform) <- get data
• Producer – collection of events
• Transformations – map, filter, merge, leftJoin (lookup)
• Output – write(sink), sumByKey(store)
• Store – contains aggregate for each key, and reduce operation

28
Associative Reduce

29
WordCount SummingBird
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
val stormTopology = Storm.remote(“stormName”).plan(wordCount)
val hadoopJob = Scalding(“scaldingName”).plan(wordCount)

31
First, there was the RDD
• Spark is its own execution engine
• With high-level API
• RDDs are sharded collections
• Can be mapped, reduced, grouped,
filtered, etc

32
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch

33
DStream
DStream
DStreamSpark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1

34
Compared to SummingBird
Differences:
• Micro-batches
• Completely new execution model
• Real joins
• Reduce is not limited to Monads
• SparkStreaming has Richer API
• Summingbird can aggregate batch
and stream to one dataset
• SparkStreaming runs in debugger
Similarities:
• Almost same code will run in batch
and streams
• Use of Scala
• Use of functional programing
concepts

35
Spark Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()

36
Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. ssc.start()

38
Execution Model
You don’t want to know.

39
Flink vs SparkStreaming
Differences:
• Flink is event-by-event streaming,
events go through pipeline.
• SparkStreaming has good
integration with Hbase as state store
• “checkpoint barriers”
• Optimization based on strong typing
• Flink is newer than SparkStreaming,
there is less production experience
Similarities:
• Very similar APIs
• Built-in stream-specific operators
(windows)
• Exactly once guarantees through
checkpoints of offsets and state
(Flink is limited to small state for
now)

40
WordCount Batch
val env = ExecutionEnvironment.getExecutionEnvironment
val text = getTextDataSet(env)
val counts = text.flatMap { _.toLowerCase.split("W+") filter {
_.nonEmpty } }
.map { (_, 1) } .groupBy(0)
.sum(1)
counts.print()
env.execute(“Wordcount Example”)

41
WordCount Streaming
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream(host, port)
val counts = text.flatMap { _.toLowerCase.split("W+") filter {
_.nonEmpty } }
.map { (_, 1) } .groupBy(0)
.sum(1)
counts.print()
env.execute(“Wordcount Example”)

43
If the requirements are simple…

44
How difficult it is to parallelize
transformations?
Simple transformations
Are simple

45
Just add Kafka
Kafka is a reliable data source
You can read
Batches
Microbatches
Streams
Also allows for re-partitioning

46
Cluster management
• Managing cluster resources used to be difficult
• Now:
– YARN
– Mesos
– Docker
– Kubernetes

47
So your app should…
• Allocate resources and track tasks with YARN / Mesos
• Read from Kafka (however often you want)
• Do simple transformations
• Write to Kafka / Hbase
• How difficult can it possibly be?

48
Parting Thoughts

49
Good engineering lessons
• DRY – do you really need same code twice?
• Error correction is critical
• Reliability guarantees are critical
• Debuggers are really nice
• Latency / Throughput trade-offs
• Use existing expertise
• Stream processing is about patterns

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Have your Cake and Eat it Too - Architecture for Batch and Real-time processing

Similar to Have your Cake and Eat it Too - Architecture for Batch and Real-time processing (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing

Editor's Notes