Trivento summercamp fast data 9/9/2016

What is FastData & How the SMACK
stack plays a major role by implementing
a Fast Data strategy
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
De oude Prodentfabriek
Trivento Summercamp 2016
Amersfoort

Introduction
2
Introduction: Who Am I?

Agenda
A bit of history: Big Data Processing
What is Fast Data?
Batch Systems vs Streaming Systems & Streaming Evolution
Event Log, Message Queues
A Fast Data Architecture
Example Application
3

Data Processing
Batch processing: processing done on a bounded dataset.
Stream Processing (Streaming): processing done on an unbounded datasets.
Data items are pushed or pulled.
Two categories of systems: batch vs streaming systems.
5

Big Data - The story
Internet scale apps moved data size from Gigabytes to Petabytes.
Once upon a time there were traditional RDBMS like Oracle and Data
Warehouses but volume, velocity and variety changed the game.
6

MapReduce was a major breakthrough (Google published the seminal paper in
2004).
Nutch project already had an implementation in 2005
2006 becomes a subproject of Lucene with the name Hadoop.
2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it
becomes a top-level apache project.
Hadoop is good for batch processing.

Word Count example - Inverted Index.
8
Split 1
Split N
doc1,
doc2 ...
...
doc300,
doc100
MAP REDUCE
(w1,1)
…
(w20,1)
(w41,1)
…
(w1,1)
Shuffle
(w1, (1,1,1…))
...
(w41, (1,1,…))
...
(w1, 13)
...
(w1, 3)
...

Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value
store.” changed the DataBase world in 2007.
NoSQL Databases along with general system like Hadoop solve problems
cannot be solved with traditional RDBMs.
Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus
over more powerful cpus.
9

There is a major shift in the industry as batch processing is not enough any
more.
Batch jobs usually take hours if not days to complete, in many applications that
is not acceptable.
10

The trend now is near-real time computation which implies streaming
algorithms and needs new semantics. Fast Data (data in motion) & Big
Data (data at rest) at the same time.
The enterprise needs to get smarter, all major players across industries
use ML on top of massive datasets to make better decisions.
11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530
https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg

OpsClarity report:
92% plan to increase their investment in stream processing applications in the
next year
79% plan to reduce or eliminate investment in batch processing
32% use real time analysis to power core customer-facing applications
44% agreed that it is tedious to correlate issues across the pipeline
68% identified lack of experience and underlying complexity of new data
frameworks as their barrier to adoption
http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
12

13Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

14
In OpsClarity report:
● Apache Kafka is the most popular broker technology (ingestion queue)
● HDFS the most used data sink
● Apache Spark is the most popular data processing tool.

Big Data Landscape
15
Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

Big Data System
A Big Data System must have at least the following components at its core:
DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).
Distributed Data processing tool like: Spark, MapReduce, etc.
Tools and services to manage the previous systems.
16

Batch Systems - The Hadoop Ecosystem
17
Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in
March 2013.
Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the
next-generation replacement for MapReduce.
Image: Lightbend Inc.

Hadoop clusters, the gold standard for big data from ~2008 to the present (started back in 2005).
Strengths:
Lowest CapEx system for Big Data.
Excellent for ingesting and integrating diverse datasets.
Flexible: from classic analytics (aggregations and data warehousing) to machine learning.
18

Weaknesses:
Complex administration.
YARN can’t manage all distributed services.
MapReduce, has poor performance, a difficult programming model, and doesn’t support stream
processing.
19

Analyzing Infinite Data Streams
20
What does it mean to run a SQL query on an unbounded data set.
How should I deal with the late data.
What kind of time measurement should I use? Event-time, Processing time or
Ingestion time?
Accuracy of computations on bounded datasets vs on unbounded datasets
Algorithms for streaming computations?

21
Two cases for processing:
Single event processing: event transformation, trigger an alarm on an error event
Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.

22
Event aggregation introduces the concept of windowing wrt to the notion of time
selected:
Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
System Arrival or Ingestion time (the time that events arrived at the streaming system).
Ideally event time = processing time. Reality is: there is skew.

23
Windows come in different flavors:
Tumbling windows discretize a stream into non-overlapping windows.
Sliding Windows: slide over the stream of data.

24
Watermarks: indicates that no elements with a timestamp older or equal to the
watermark timestamp should arrive for the specific window of data.
Triggers: decide when the window is evaluated or purged.

25
Given the advances in streaming we can:
Trade-off latency with cost and accuracy
In certain use-cases replace batch processing with streaming

26
Recent advances in Streaming are a result of the pioneer work:
MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.
The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803

27
Apache Beam is the open source successor of Google’s DataFlow
It is becoming the standard api for streaming. Provides the advanced semantics
needed for the current needs in streaming applications.

Streaming Systems Architecture
28
User provides a graph of computations through a high level API where data
flows on the edges of this graph. Each vertex its an operator which executes
a user operation-computation. For example: stream.map().keyBy()...
Operators can run in multiple instances and preserve state (unlike batch
processing where we have immutable datasets).
State can be persisted and restored in the presence of failures.

Analyzing Infinite Data Streams - Flink Example
29
sealed trait SensorType { def stype: String }
case object TemperatureSensor extends SensorType { val stype = "TEMP" }
case object HumiditySensor extends SensorType { val stype = "HUM" }
case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)
https://github.com/skonto/trivento-summercamp-2016

30
class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int,
val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] {
final val serialVersionUID = 1L
@volatile var isRunning = true
var counter = 1
var timestamp = 0
val randomGen = Random
require(numberOfSensors > 0)
require(numberOfElements >= -1)
lazy val initialReading: Double = {
sensorType match {
case TemperatureSensor => 27.0
case HumiditySensor => 0.75
}
}
override def run(ctx: SourceContext[SensorData]): Unit = {
val counterCondition = {
if(numberOfElements == -1) {
x: Int => isRunning
} else {
x: Int => isRunning && counter <= x
}
}
while (counterCondition(numberOfElements)) {
Thread.sleep(10) // send sensor data every 10 milliseconds
val dataId = randomGen.nextInt(numberOfSensors) + 1
val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp)
ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs
timestamp = timestamp + 1
if (timestamp % watermarkTag == 0) { // watermark should be mod 0
ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds
}
counter = counter + 1
}
}
override def cancel(): Unit = {
// No cleanup needed
isRunning = false
}
}
The Source

31
object SensorSimple {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// set default env parallelism for all operators
env.setParallelism(2)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val numberOfSensors = 2
val watermarkTag = 10
val numberOfElements = 1000
val sensorDataStream =
env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))
sensorDataStream.writeAsText("inputData.txt")
val windowedKeyed = sensorDataStream
.keyBy(data => data.sensorId)
.timeWindow(Time.milliseconds(10))
windowedKeyed.max("value")
.writeAsText("outputMaxValue.txt")
windowedKeyed.apply(new SensorAverage())
.writeAsText("outputAverage.txt")
env.execute("Sensor Data Simple Statistics")
}
}
class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] {
def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = {
if (input.nonEmpty) {
val average = input.map(_.value).sum / input.size
out.collect(input.head.copy(value = average))
}
}
}
The Job

32
Operator 1 Operator 2
Watermark 1 (10)
0 3 6
2
7 5
849
Operators run the operations defined by the graph of
the streaming computation. Example Operators
(KeyBy, Map, FlatMap etc)
Two instances of the same operator with parallelism
2 (previous example).
Watermark N (10*N)
..
..
..
..
..
..
..
..
..
..
..
..
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...
time
file1 file2
window 2window 1

Streaming vs Batch Systems
33
Metric Batch Streaming
Data size per job TB to PB MB to TB (in flight)
Time between data arrival
and processing
Many minutes to hours Microseconds to minutes
Job execution times Minutes to hours Microseconds to minutes

Event Log as the Core Abstraction
34
Logging is everywhere:
Write-ahead log (WAL) in databases for durability.
The distributed log can be seen as the data structure which models the problem of consensus.
Reduction of the problem of making multiple machines all do the same thing to the problem of
a distributed consistent log implementation. The log feeds processes input (state-machine
replication model).
In real-time streaming implementations use logs as a natural mean for recording events as they
are processed according to the computation graph. This assists in implementing consistency
algorithms like ABS in Apache Flink and other functionality for all major streaming engines like
Google DataFlow, Apache Storm, Apache Samza.

Event Log as the Core Abstraction
35
Architecture patterns enablers:
Event Sourcing (ES)
Command-query Responsibility Segregation (CQRS)

Message Queues as the Integration Tool
36
FIFO data structures, the natural way to process logs.
Organise user data in topics, each topic has its own queue.
Benefits
Decouples consumers from producers
Arbitrary number of producers and consumers are supported
Easy to use
Kafka is the most popular implementation for Big Data Systems.

Message Queues - Kafka
37
Kafka is the most popular implementation for Big Data Systems.
Kafka is a “...distributed, partitioned, replicated commit log service”
No deletion of messages on read, allows replay of data.
Production tested at LinkedIn at scale.

Delivery/Processing Semantics
38
In distributed systems failure is part of the game. What semantics I can achieve for message delivery?
at-most-once delivery: for each message sent, that message is delivered zero or one times.
at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it,
such that at least one succeeds; messages may be duplicated but not lost.
exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the
message can neither be lost nor duplicated.
In theory it is impossible to have exactly once delivery.
In practice we might care more for exactly-once state changes and at-least once delivery. Example:
Keeping state at some operator of the streaming graph.

The SMACK Stack
39
Technologies which combined together deliver high performing streaming systems:
Spark
Mesos
Akka
Cassandra
Kafka

A Fast Data Architecture using the SMACK stack
40Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September
2016

Adding ML support to the IoT Application
42
Anomaly detection
Voice interface
Image classification
Recommendations
Automatic tuning of the IoT environment

What about Lambda Architecture?
43
Eventually we want to replace it, it is more of a traditional model.
Problems
Hard to maintain
Duplication of code & systems
Special systems for unifying views
In certain cases we can replace it with streaming based architectures.

Streaming Implementations Status
44
Apache Spark: Structured Streaming in v2 starts the improvement of the
streaming engine. Still based on micro-batches but event-time support was
added.
Apache Flink: SQL API supported from v0.9 and on. Still important features are
on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.

Picking the Right Tool for Streaming
45
Criteria to choose:
Processing semantics (strong consistency is needed for correctness)
Latency guarantees
Deployment / Operation
Ecosystem build around it
Complex event processing (CEP)
Batch & Streaming API support
Community & Support

Picking the Right Tool for Streaming
46
Some tips
Pick Flink if you need sub-second latency and Beam support
Pick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for
training models, has mature deployment capabilities.
Pick Gearpump for materializing Akka Streams in a distributed fashion.
Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed
solution out of the box). (Check Confluent Platform for many useful tools around Kafka).

References
48
Watermarks: Time and progress in streaming dataflow and beyond: Big Data Conference - Strata + Hadoop World, May 31 - June 3,
2016, London, UK
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-
Order Data Processing
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data
and the Increasing Value of the Internet of Things
Asynchronous Distributed Snapshots for Distributed Dataflows | the morning paper
State machine replication - Wikipedia, the free encyclopedia
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-
Order Data Processing | the morning paper
The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering
Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
How Apache Flink™ enables new streaming applications – data Artisans

Trivento summercamp fast data 9/9/2016

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Trivento summercamp fast data 9/9/2016

Similar to Trivento summercamp fast data 9/9/2016 (20)

More from Stavros Kontopoulos

More from Stavros Kontopoulos (8)

Recently uploaded

Recently uploaded (20)

Trivento summercamp fast data 9/9/2016