SlideShare a Scribd company logo
1 of 91
Lambda Architecture And
Beyond
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
Trivento Summercamp 2016
Amersfoort
De oude Prodentfabriek
Introduction
2
Introduction: Who Am I?
Agenda
A bit of history of Big Data Processing
Batch Systems vs Streaming Systems
What is Lambda Architecture?
Advantages, Disadvantages?
Use cases
Data Lakes, Data Silos etc...
Implementing Lambda Architecture, ML support, Implementation Tips
Beyond the Lambda Architecture (Kappa, FastData, Zeta etc)
3
Last warning...
4
Data Processing
Batch processing: processing done on a bounded dataset.
Stream Processing (Streaming): processing done on an unbounded datasets.
Data items are pushed or pulled.
Two categories of systems: batch vs streaming systems.
5
Big Data - The story
Internet scale apps moved data size from Gigabytes to Petabytes.
Once upon a time there were traditional RDBMS like Oracle and Data
Warehouses but volume, velocity and variety changed the game.
6
Big Data - The story
MapReduce was a major breakthrough (Google published the seminal paper in
2004).
Nutch project already had an implementation in 2005
2006 becomes a subproject of Lucene with the name Hadoop.
2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it
becomes a top-level apache project.
Hadoop is good for batch processing.
Big Data - The story
Word Count example - Inverted Index.
8
Split 1
Split N
doc1,
doc2 ...
...
doc300,
doc100
MAP REDUCE
(w1,1)
…
(w20,1)
(w41,1)
…
(w1,1)
Shuffle
(w1, (1,1,1…))
...
(w41, (1,1,…))
...
(w1, 13)
...
(w1, 3)
...
Big Data - The story
Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value
store.” changed the DataBase world in 2007.
NoSQL Databases along with general system like Hadoop solve problems
cannot be solved with traditional RDBMs.
Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus
over more powerful cpus.
9
Big Data - The story
There is a major shift in the industry as batch processing is not enough any
more.
Batch jobs usually take hours if not days to complete, in many applications that
is not acceptable.
10
Big Data - The story
The trend now is near-real time computation which implies streaming
algorithms and needs new semantics. Fast Data (data in motion) & Big
Data (data at rest) at the same time.
The enterprise needs to get smarter, all major players across industries
use ML on top of massive datasets to make better decisions.
11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530
https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg
Big Data - The story
OpsClarity report:
92% plan to increase their investment in stream processing applications in the
next year
79% plan to reduce or eliminate investment in batch processing
32% use real time analysis to power core customer-facing applications
44% agreed that it is tedious to correlate issues across the pipeline
68% identified lack of experience and underlying complexity of new data
frameworks as their barrier to adoption
http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
12
Big Data - The story
13Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
Big Data - The story
14
In OpsClarity report:
● Apache Kafka is the most popular broker technology (ingestion queue)
● HDFS the most used data sink
● Apache Spark is the most popular data processing tool.
Big Data Landscape
15
Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
Big Data System
A Big Data System must have at least the following components at its core:
DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).
Distributed Data processing tool like: Spark, Hadoop etc
Tools and services to manage the previous systems.
16
Big Data System - Layered View
A Big Data System has at least an infrastructure layer and application layer.
17
Big Data System Design Considerations / Problems
Data Locality
Data Versioning
Code change
Resource allocation
Deployment/Operation
Integration
Backup/Failover Strategy
Scaling Strategy
18
Big Data System Quality
A Big Data System should be:
fault-tolerant
easy to debug
generic enough
scalable
extensible
able to support ad-hoc queries
high throughput
able to support low latency reads/writes
19
Big Data and Immutable Data
Immutable data provide the following benefits:
Fault-tolerance to human error (you can always replay history and fix things)
Simplicity no index is needed for retrieve and update, just append newly arrived data.
20
Big Data System - Delivery/Processing Semantics
21
In distributed systems failure is part of the game. What semantics I can achieve for message delivery?
at-most-once delivery: for each message sent, that message is delivered zero or one times.
at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it,
such that at least one succeeds; messages may be duplicated but not lost.
exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the
message can neither be lost nor duplicated.
In theory it is impossible to have exactly once delivery.
In practice we might care more for exactly-once state changes and at-least once delivery. Example:
Keeping state at some operator of the streaming graph.
Batch Systems - The Hadoop Ecosystem
22
Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in
March 2013.
Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the
next-generation replacement for MapReduce.
Image: Lightbend Inc.
Batch Systems - The Hadoop Ecosystem
Hadoop clusters, the gold standard for big data from ~2008 to the present.
Strengths:
Lowest CapEx system for Big Data.
Excellent for ingesting and integrating diverse datasets.
Flexible: from classic analytics (aggregations and data warehousing) to machine learning.
23
Batch Systems - The Hadoop Ecosystem
Weaknesses:
Complex administration.
YARN can’t manage all distributed services.
MapReduce, has poor performance, a difficult programming model, and doesn’t support stream
processing.
24
Analyzing Infinite Data Streams
25
What does it mean to run a SQL query on an unbounded data set.
How should I deal with the late data which I see.
What kind of time measurement should I use? Event-time, Processing time or
Ingestion time?
Accuracy of computations on bounded datasets vs on unbounded datasets
Algorithms for streaming computations?
Analyzing Infinite Data Streams
26
Two cases for processing:
Single event processing: event transformation, trigger an alarm on an error event
Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
Analyzing Infinite Data Streams
27
Event aggregation introduces the concept of windowing wrt to the notion of time
selected:
Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
System Arrival or Ingestion time (the time that events arrived at the streaming system).
Ideally event time = Processing time. Reality is: there is skew.
Analyzing Infinite Data Streams
28
Windows come in different flavors:
Tumbling windows discretize a stream into non-overlapping windows.
Sliding Windows: slide over the stream of data.
Analyzing Infinite Data Streams
29
Watermarks: indicates that no elements with a timestamp older or equal to the
watermark timestamp should arrive for the specific window of data.
Triggers: decide when the window is evaluated or purged.
Analyzing Infinite Data Streams
30
Given the advances in streaming we can:
Trade-off latency with cost and accuracy
In certain use-cases replace batch processing with streaming
Analyzing Infinite Data Streams
31
Recent advances in Streaming are a result of the pioneer work:
MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.
The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803
Analyzing Infinite Data Streams
32
Apache Beam is the open source successor of Google’s DataFlow
It is becoming the standard api streaming. Provides the advanced semantics
needed for the current needs in streaming applications.
Streaming Systems Architecture
33
User provides a graph of computations through a high level API where data
flows on the edges of this graph. Each vertex its an operator which executes
a user operation-computation. For example: stream.map().keyBy()...
Operators can run in multiple instances and preserve state (unlike batch
processing where we have immutable datasets).
State can be persisted and restored in the presence of failures.
Analyzing Infinite Data Streams - Flink Example
34
sealed trait SensorType { def stype: String }
case object TemperatureSensor extends SensorType { val stype = "TEMP" }
case object HumiditySensor extends SensorType { val stype = "HUM" }
case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
35
class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int,
val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] {
final val serialVersionUID = 1L
@volatile var isRunning = true
var counter = 1
var timestamp = 0
val randomGen = Random
require(numberOfSensors > 0)
require(numberOfElements >= -1)
lazy val initialReading: Double = {
sensorType match {
case TemperatureSensor => 27.0
case HumiditySensor => 0.75
}
}
override def run(ctx: SourceContext[SensorData]): Unit = {
val counterCondition = {
if(numberOfElements == -1) {
x: Int => isRunning
} else {
x: Int => isRunning && counter <= x
}
}
while (counterCondition(numberOfElements)) {
Thread.sleep(10) // send sensor data every 10 milliseconds
val dataId = randomGen.nextInt(numberOfSensors) + 1
val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp)
ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs
timestamp = timestamp + 1
if (timestamp % watermarkTag == 0) { // watermark should be mod 0
ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds
}
counter = counter + 1
}
}
override def cancel(): Unit = {
// No cleanup needed
isRunning = false
}
}
The Source
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
36
object SensorSimple {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// set default env parallelism for all operators
env.setParallelism(2)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val numberOfSensors = 2
val watermarkTag = 10
val numberOfElements = 1000
val sensorDataStream =
env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))
sensorDataStream.writeAsText("inputData.txt")
val windowedKeyed = sensorDataStream
.keyBy(data => data.sensorId)
.timeWindow(Time.milliseconds(10))
windowedKeyed.max("value")
.writeAsText("outputMaxValue.txt")
windowedKeyed.apply(new SensorAverage())
.writeAsText("outputAverage.txt")
env.execute("Sensor Data Simple Statistics")
}
}
class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] {
def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = {
if (input.nonEmpty) {
val average = input.map(_.value).sum / input.size
out.collect(input.head.copy(value = average))
}
}
}
The Job
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
37
Operator 1 Operator 2
Watermark 1 (10)
0 3 6
2
7 5
849
Operators run the operations defined by the graph of
the streaming computation. Example Operators
(KeyBy, Map, FlatMap etc)
Two instances of the same operator with parallelism
2 (previous example).
Watermark N (10*N)
..
..
..
..
..
..
..
..
..
..
..
..
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...
time
file1 file2
window 2window 1
Streaming vs Batch Systems
38
Metric Batch Streaming
Data size per job TB to PB MB to TB (in flight)
Time between data arrival
and processing
Many minutes to hours Microseconds to minutes
Job execution times Minutes to hours Microseconds to minutes
World of Patterns
Pattern (in general) … is a perceptible regularity or a template (Wikipedia).
Software Patterns: well-defined, reusable solution to a commonly occurring
problem in software design eg. Template Method, Singleton etc.
Software Architecture Patterns: An architectural pattern is a general, reusable
solution to a commonly occurring problem in software architecture within a
given context (Wikipedia) eg. client-server n-tier.
39
World of Patterns
Software Architecture vs Software Design.
We use them everywhere but… they are not a silver bullet. Why?
40
Software Architecture before Lambda Architecture
Many definitions for software architecture.
“Architecture: ⟨system⟩ fundamental concepts or properties of a system in its environment embodied in its elements,
relationships, and in the principles of its design and evolution”. (ISO/IEC/IEEE 42010).
“Software architecture refers to the fundamental structures of a software system, the discipline of creating such
structures, and the documentation of these structures. These structures are needed to reason about the software
system.” Wikipedia
“It is about structure and vision”. Software architecture for developers, Simon Brown.
“The highest-level breakdown of a system into its parts; the decisions that are hard to change; there are multiple
architectures in a system; what is architecturally significant can change over a system's lifetime; and, in the end,
architecture boils down to whatever the important stuff is.” Patterns of Enterprise Application Architecture, Martin Fowler
41
Software Architecture is important
Architectural decisions are decisions that have non-local consequences and they
serve specific goals eg. in order to achieve a performance goal like high
throughput I decided to use buffering within my system.
Architectural decisions are important for your in-house project or your proposal if
you are a consultant.
42
Sound Architecture Principles: Why I Need it?
Scalability/Elasticity
Extensibility: requirements will change expect that
Minimized costs
Security awareness
Well designed APIs for integration
Well-tested, don’t go to production and cross fingers.
43
Follow common sense...
At the end of the day expect to throw everything out of the window under some
circumstances. Business matters the most.
Example: Non-functional requirements changed since load is huge and you are
becoming successful, maybe you are the next Facebook.
44
Software Architecture is important
...because there is high cost of not making specific decisions or making them not
early enough.
45
Software Architecture is important
How about the wrong decisions?
Image: http://www.awesomeinventions.com/wp-
content/uploads/2014/10/balcony.jpg
46
Software Architecture is important
Many more benefits where architecture is present:
A documented architecture assists communication
Guides implementation imposing constraints
Assists in technology decisions
Assists in cost and time estimation
Influences the structure of your organization and vice versa
47
Software Architecture LifeCycle
Steps:
Architectural Requirements
Architectural Design
Architectural Documentation
Architectural Evaluation / Implementation
48
Lambda Architecture - Intro
“Computing arbitrary functions on an arbitrary dataset in real time is a daunting
problem. There is no single tool that provides a complete solution. Instead,
you have to use a variety of tools and techniques to build a complete Big Data
system. The lambda architecture solves the problem of computing arbitrary
functions on arbitrary data in real time by decomposing the problem into three
layers: the batch layer, the serving layer, and the speed layer.”
49
Nathan Marz and James Warren, Big Data: Principles and best practices
of scalable real-time data systems, Manning Publications.
Photo: https://images-na.ssl-images-amazon.com/images/I/51Bd93AGuOL._SX258_BO1,204,203,200_.jpg
Lambda Architecture - Cont’d (1/5)
50
Image: http://lambda-architecture.net/img/la-overview_small.png
Batch Layer: perfect accuracy, indexed batch views
Serving Layer: random access query support based on batch & real-time views
Speed Layer: process real-time streams, provides real-time views, lower
accuracy
Master dataset: append-only, immutable set of raw data
Lambda Architecture - Cont’d (2/5)
Example components for each part:
Batch layer: Hadoop
Batch Output Indexing: Druid, Impala etc
Speed Output Indexing: Druid, Cassandra, HBase etc
Speed processing: Spark, Flink etc
51
Lambda Architecture - Cont’d (3/5)
Basic functions:
batch view = function (all data) <- high latency, high throughput
realtime view = function (realtime view, new data) <- low latency, low
throughput
query = function (batch view, realtime view ) <- eventual accuracy
52
Lambda Architecture - Cont’d (4/5)
Key Properties:
Eventual Accuracy
Batch is always behind in time, continuously produces batch outputs. Whenever a
new batch output is available updates the latest one. Finally batch layer will catch
up with the speed layer.
Complexity Isolation
53
Lambda Architecture - Cont’d (5/5)
Advantages:
Immutable data.
Reprocessing takes care code change, human error etc.
Disadvantages:
Operate/maintain two different systems (batch & streaming) is hard.
Programming in two different paradigms makes the code-base complex.
54
What about Data Lakes?
A data lake accumulates data from different applications.
It does not transform data in any way.
Access from multiple users, no data silos, data is not hidden in special
systems.
There is no schema following the data, only raw data. We apply a schema
when we read the data
Includes structured, semi-structured, and unstructured data
55
Data Lakes Categories
Data reservoirs: Governed accumulation of data for later use. Data are secured
and go under the process of ingestion, cleansing, profiling and indexing.
Exploratory lakes: Accumulation of data without governance for ad-hoc analysis
by data scientists et al to gain insights.
Analytical lakes: Ingest your data to feed data pipelines for analytics.
56
Data Lakes vs Data Warehouse
Can be a replacement of a data warehouse in several scenarios when that
makes sense.
57
Data Lake Data Warehouse
Schema Schema on-read Schema on-write
Users Data scientists,
people who need ad
hoc analysis
Business analysts
Data Structured, semi-
structured,
unstructured
Rigid structure
Flexibility High, reprocessing
is easy.
Low tied to business
processes.
Data Lakes usually fail!
Most project fail... you have been warned! Your next data lake can become
a big data swamp.
58
Image: http://www.sharenator.com/Demotivationals_pt_3_P/
Data Lakes extended with a Lambda Architecture
You can always use your Lambda Architecture on top of a data lake if that
makes sense. A data lake can be your DFS with specific services build
around it, like metadata management. It can make things easy especially
when you start small and try to figure out what you need.
It can be very simple where you use the batch layer for loading the data
from a source for streaming only. No presentation layer is needed.
How about Kafka?
59
Azure Data Lake
60
Image: https://azure.microsoft.com/en-us/solutions/data-lake/
How about Data Silos?
Separate containers of data.
The big data platform or the big data system at hand should unify business
information, development teams and data in a business useful way.
Think about a scenario with microservices, event sourcing and analytics.
61
Use Cases
Yahoo
Netflix
Flickr
62
Flickr’s Use case - The Problem
Magic View Feature: computer vision pipeline to generates a set of
computer vision tags and reverse indexes are created per user along
with aggregated tag info.
Initially only batch then a streaming layer was added for live experience.
Backfills needed because of missed photos from the streaming layer
(approximation errors) and code changes.
Backfills via streaming were slow due to the nature of RMW access pattern.
63
Flickr’s Use case - Solution
64
Result = Combiner(Query(data))
Implementing The Lambda Architecture
Smack stack based Lambda Architecture:
65
mesos
Spark
hdfs
Spark or Flink
Kafka Cassandra Query
app
Akka
driven
apps user
Machine Learning Support for Lambda Architecture
Build a model and serve it. Simple models vs complex models.
Spark for model build and flink for model service.
Parameter servers:
https://issues.apache.org/jira/browse/SPARK-6932
https://github.com/rjagerman/glint
http://parameterserver.org/
http://www.petuum.com/bosen.html
https://github.com/JohnLangford/vowpal_wabbit/wiki
66
Real World Implementation Tips
Jvm based technologies like Cassandra, Kafka need correct GC settings.
Monitoring is a must. Cassandra, Kafka etc provide jmx interfaces to get the
counter values you need. You need to know and understand which are useful
to monitor closely.
It is not wise to co-locate everything, you need to be care full about
components requirements. For example zookeeper should run on its own
box but if co-located it should have it own high-speed volume assigned for its
commit log.
Vendors offer specific requirements for production, stem from experience using
the technology in production. 67
Real World Implementation Tips
OS settings.
Misuse technologies. Example: Kafka is not a database.
Design decisions. Example: Time series data on Cassandra.
Data locality and data move. Example: Kafka rebalance.
Logging. How I monitor my job? Log correlation?
For batch processing you need a flexible orchestration tool like:
https://github.com/apache/incubator-airflow
Within your data-center vs across data-centers. On cloud: Availability zones
vs regions. 68
Beyond the Lambda Architecture
Kappa Architecture (2014)
Zeta Architecture (2015)
IoT-A Architecture (2010- 2013)
Butterfly Architecture (~2015)
Fast Data architecture (~2016)
69
Kappa Architecture
Introduced by Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent in 2014.
See https://www.oreilly.com/ideas/questioning-the-lambda-architecture
Lambda architecture is good but it is too much to try to keep in sync two layer and in practice it is hard to achieve
“The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be
universally agreed on by everyone doing it.”
Batch processing is a sub-set of streaming processing. Different technologies want to take advantage of this fact and provide a
holistic solution:
Flink, http://data-artisans.com/batch-is-a-special-case-of-streaming/
Spark, https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
70
Kappa Architecture
1. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows
for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.
2. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the
beginning of the retained data, but direct this output data to a new output table.
3. When the second job has caught up, switch the application to read from the new table.
4. Stop the old version of the job, and delete the old output table.
Re-processing is done only when code changes. 71
Image: https://dmgpayxepw99m.cloudfront.net/kappa-61d0afc292912b61ce62517fa2bd4309.png
Kappa Architecture Pros & Cons
72
Pros:
● Develop and maintain only one streaming system.
● Reprocessing only when code changes.
Cons:
● Need temp storage for the reprocessing streaming job.
Kappa Architecture - When to use?
73
● Algorithms of streaming and batch processing are the same.
● Batch and real-time outputs can be the same.
Zeta Architecture
Introduced by MapR for supporting as-it-happens business (March 2015).
Goals:
Exploit all existing hardware in the data center.
Back-up and disaster recovery support for real-time continuity
Tolerance for human mistake
End-to-End Security
Support google scale systems
74
Zeta Architecture - Components
Seven pluggable components:
Distributed File System: All applications write here.
Real-time Data Storage: Needed for high-speed business applications.
Pluggable Compute Model / Execution Engine: Different needs need
different engines.
Deployment / Container Management: Allows for a common way to deploy
resources.
75
Zeta Architecture - Components
Seven pluggable components:
Solution Architecture: Focuses on solving a specific business problem.
Enterprise Applications: Used to drive the architecture. Now they are
realized via existing components.
Dynamic and Global Resource Management: Allows dynamic allocation of
resources which fits the business needs each time.
76
Zeta Architecture
Components and reference applications
77
Image: https://www.mapr.com/zeta-architecture
Zeta Architecture Example
78Images: https://www.mapr.com/zeta-architecture
IoT-A Architecture
Targets IoT applications proposed by Michael Hausenblas (MapR, Mesosphere)
2015.
IoT leads to a Big Data architecture because:
High volume of data from sensors
Time-Series format of data or other type of formats.
Data are generated at high-speed and business needs real-time processing.
79
IoT-A Architecture
Basic Architecture:
Message Queue / Streaming Block (MQ/SP)
DB: A real-time DB for indexing sensor data. Low Latency.
DFS: The distributed file system where batch jobs can be run and batch
reports can be created.
80
IoT-A Architecture
81
http://iot-a.info/
IoT-A Architecture - Implementation Technologies
82
http://iot-a.info/
Butterfly Architecture
83
● Introduced by Milind Bhandarkar (Pivotal).
● The weak point of the Lambda architecture lies in the distributed file system which cannot serve
all layers.
● They propose the use of different memory technologies than DRAM (like storage class memory)
to implement an efficient object storage engine.
● They use different abstractions compared to files or dirs of DFS: datasets, dataframes,
eventstreams.
mutable immutable
unmanaged managed
log publish
Data frames
Data sets
Storage
ETL
Butterfly Image: http://sketch2draw.com/wp-
content/uploads/2013/05/butterfly_thumb.jpg
A Fast Data Architecture
84Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September
2016
Example IoT Application
85Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media,
September 2016
Streaming Implementations Status
86
Apache Spark: Structured Streaming in v2 starts the improvement of the
streaming engine. Still based on micro-batches but event-time support was
added.
Apache Flink: SQL API supported from v0.9 and on. Still important features are
on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.
Picking the Right Tool for Streaming
87
Criteria to choose:
Processing semantics (strong consistency is needed for correctness)
Latency guarantees
Deployment / Operation
Ecosystem build around it
Complex event processing (CEP)
Batch & Streaming API support
Community & Support
Picking the Right Tool for Streaming
88
Some tips
Pick Flink if you need sub-second latency and Beam support
Pick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for
training models, has mature deployment capabilities.
Pick Gearpump for materializing Akka Streams in a distributed fashion.
Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed
solution out of the box). (Check Confluent Platform for many useful tools around Kafka).
Questions?
Thank you!
89
References
Books:
Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications
to NoSQL: Bhushan Lakhe: 9781484212882: Amazon.com: Books
Designing Software Architectures: A Practical Approach (SEI Series in Software Engineering): Humberto Cervantes, Rick Kazman:
9780134390789: Amazon.com: Books
Big Data: Principles and best practices of scalable realtime data systems: Nathan Marz, James Warren: 9781617290343:
Amazon.com: Books
90
References - Cont’d
Web resources/Articles:
Questioning the Lambda Architecture - O'Reilly Media
Structured Streaming In Apache Spark | Databricks Blog
The world beyond batch: Streaming 101 - O'Reilly Media
The world beyond batch: Streaming 102 - O'Reilly Media
Data Centric Enterprise | MapR
Why local state is a fundamental primitive in stream processing - O'Reilly Media
Data processing architectures – Lambda and Kappa - Ericsson Research BlogEricsson Research Blog
2016 State of Fast Data Survey | OpsClarity
Zeta Architecture | MapR
Is Big Data Still a Thing? (The 2016 Big Data Landscape) – Matt Turck
91

More Related Content

What's hot

Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsRim Moussa
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_finalmanishduttpurohit
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
 
Ismis2014 dbaas expert
Ismis2014 dbaas expertIsmis2014 dbaas expert
Ismis2014 dbaas expertRim Moussa
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceDatabricks
 
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...HostedbyConfluent
 

What's hot (18)

Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
 
Ismis2014 dbaas expert
Ismis2014 dbaas expertIsmis2014 dbaas expert
Ismis2014 dbaas expert
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
 

Viewers also liked

The Trivento Transition: enterprise design case: change of identity, experien...
The Trivento Transition: enterprise design case: change of identity, experien...The Trivento Transition: enterprise design case: change of identity, experien...
The Trivento Transition: enterprise design case: change of identity, experien...entdesigners
 
Trivento Summercamp : Reactive with AngularJS & TypeSafe
Trivento Summercamp : Reactive with AngularJS & TypeSafeTrivento Summercamp : Reactive with AngularJS & TypeSafe
Trivento Summercamp : Reactive with AngularJS & TypeSafeHenk Jurriens
 
DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015
DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015
DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015Dennis Traub
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Librecon 2016 bilbao: kappa architecture IoT of the cars
Librecon 2016 bilbao:   kappa architecture IoT of the carsLibrecon 2016 bilbao:   kappa architecture IoT of the cars
Librecon 2016 bilbao: kappa architecture IoT of the carsJuantomás García Molina
 
High-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaHigh-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaCloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data ArchitecturesGuido Schmutz
 
Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016LibreCon
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (16)

The Trivento Transition: enterprise design case: change of identity, experien...
The Trivento Transition: enterprise design case: change of identity, experien...The Trivento Transition: enterprise design case: change of identity, experien...
The Trivento Transition: enterprise design case: change of identity, experien...
 
Trivento Summercamp : Reactive with AngularJS & TypeSafe
Trivento Summercamp : Reactive with AngularJS & TypeSafeTrivento Summercamp : Reactive with AngularJS & TypeSafe
Trivento Summercamp : Reactive with AngularJS & TypeSafe
 
DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015
DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015
DDD / Microservices @ Trivento Spring Camp, Utrecht, 2015
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Librecon 2016 bilbao: kappa architecture IoT of the cars
Librecon 2016 bilbao:   kappa architecture IoT of the carsLibrecon 2016 bilbao:   kappa architecture IoT of the cars
Librecon 2016 bilbao: kappa architecture IoT of the cars
 
High-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaHigh-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache Impala
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Trivento summercamp masterclass 9/9/2016

Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - TestKiran Naiga
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8Navaid Khan
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Navaid Khan
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data PlatformNavaid Khan
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and AnalyticsVMware Tanzu
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 

Similar to Trivento summercamp masterclass 9/9/2016 (20)

Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data Platform
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and Analytics
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
 
Streaming analytics
Streaming analyticsStreaming analytics
Streaming analytics
 

More from Stavros Kontopoulos

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...Stavros Kontopoulos
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkStavros Kontopoulos
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Stavros Kontopoulos
 

More from Stavros Kontopoulos (8)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on Flink
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 

Recently uploaded (20)

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 

Trivento summercamp masterclass 9/9/2016

  • 1. Lambda Architecture And Beyond Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc. Trivento Summercamp 2016 Amersfoort De oude Prodentfabriek
  • 3. Agenda A bit of history of Big Data Processing Batch Systems vs Streaming Systems What is Lambda Architecture? Advantages, Disadvantages? Use cases Data Lakes, Data Silos etc... Implementing Lambda Architecture, ML support, Implementation Tips Beyond the Lambda Architecture (Kappa, FastData, Zeta etc) 3
  • 5. Data Processing Batch processing: processing done on a bounded dataset. Stream Processing (Streaming): processing done on an unbounded datasets. Data items are pushed or pulled. Two categories of systems: batch vs streaming systems. 5
  • 6. Big Data - The story Internet scale apps moved data size from Gigabytes to Petabytes. Once upon a time there were traditional RDBMS like Oracle and Data Warehouses but volume, velocity and variety changed the game. 6
  • 7. Big Data - The story MapReduce was a major breakthrough (Google published the seminal paper in 2004). Nutch project already had an implementation in 2005 2006 becomes a subproject of Lucene with the name Hadoop. 2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it becomes a top-level apache project. Hadoop is good for batch processing.
  • 8. Big Data - The story Word Count example - Inverted Index. 8 Split 1 Split N doc1, doc2 ... ... doc300, doc100 MAP REDUCE (w1,1) … (w20,1) (w41,1) … (w1,1) Shuffle (w1, (1,1,1…)) ... (w41, (1,1,…)) ... (w1, 13) ... (w1, 3) ...
  • 9. Big Data - The story Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value store.” changed the DataBase world in 2007. NoSQL Databases along with general system like Hadoop solve problems cannot be solved with traditional RDBMs. Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus over more powerful cpus. 9
  • 10. Big Data - The story There is a major shift in the industry as batch processing is not enough any more. Batch jobs usually take hours if not days to complete, in many applications that is not acceptable. 10
  • 11. Big Data - The story The trend now is near-real time computation which implies streaming algorithms and needs new semantics. Fast Data (data in motion) & Big Data (data at rest) at the same time. The enterprise needs to get smarter, all major players across industries use ML on top of massive datasets to make better decisions. 11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530 https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg
  • 12. Big Data - The story OpsClarity report: 92% plan to increase their investment in stream processing applications in the next year 79% plan to reduce or eliminate investment in batch processing 32% use real time analysis to power core customer-facing applications 44% agreed that it is tedious to correlate issues across the pipeline 68% identified lack of experience and underlying complexity of new data frameworks as their barrier to adoption http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html 12
  • 13. Big Data - The story 13Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
  • 14. Big Data - The story 14 In OpsClarity report: ● Apache Kafka is the most popular broker technology (ingestion queue) ● HDFS the most used data sink ● Apache Spark is the most popular data processing tool.
  • 15. Big Data Landscape 15 Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
  • 16. Big Data System A Big Data System must have at least the following components at its core: DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS). Distributed Data processing tool like: Spark, Hadoop etc Tools and services to manage the previous systems. 16
  • 17. Big Data System - Layered View A Big Data System has at least an infrastructure layer and application layer. 17
  • 18. Big Data System Design Considerations / Problems Data Locality Data Versioning Code change Resource allocation Deployment/Operation Integration Backup/Failover Strategy Scaling Strategy 18
  • 19. Big Data System Quality A Big Data System should be: fault-tolerant easy to debug generic enough scalable extensible able to support ad-hoc queries high throughput able to support low latency reads/writes 19
  • 20. Big Data and Immutable Data Immutable data provide the following benefits: Fault-tolerance to human error (you can always replay history and fix things) Simplicity no index is needed for retrieve and update, just append newly arrived data. 20
  • 21. Big Data System - Delivery/Processing Semantics 21 In distributed systems failure is part of the game. What semantics I can achieve for message delivery? at-most-once delivery: for each message sent, that message is delivered zero or one times. at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it, such that at least one succeeds; messages may be duplicated but not lost. exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the message can neither be lost nor duplicated. In theory it is impossible to have exactly once delivery. In practice we might care more for exactly-once state changes and at-least once delivery. Example: Keeping state at some operator of the streaming graph.
  • 22. Batch Systems - The Hadoop Ecosystem 22 Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in March 2013. Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the next-generation replacement for MapReduce. Image: Lightbend Inc.
  • 23. Batch Systems - The Hadoop Ecosystem Hadoop clusters, the gold standard for big data from ~2008 to the present. Strengths: Lowest CapEx system for Big Data. Excellent for ingesting and integrating diverse datasets. Flexible: from classic analytics (aggregations and data warehousing) to machine learning. 23
  • 24. Batch Systems - The Hadoop Ecosystem Weaknesses: Complex administration. YARN can’t manage all distributed services. MapReduce, has poor performance, a difficult programming model, and doesn’t support stream processing. 24
  • 25. Analyzing Infinite Data Streams 25 What does it mean to run a SQL query on an unbounded data set. How should I deal with the late data which I see. What kind of time measurement should I use? Event-time, Processing time or Ingestion time? Accuracy of computations on bounded datasets vs on unbounded datasets Algorithms for streaming computations?
  • 26. Analyzing Infinite Data Streams 26 Two cases for processing: Single event processing: event transformation, trigger an alarm on an error event Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream.
  • 27. Analyzing Infinite Data Streams 27 Event aggregation introduces the concept of windowing wrt to the notion of time selected: Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection. Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second. System Arrival or Ingestion time (the time that events arrived at the streaming system). Ideally event time = Processing time. Reality is: there is skew.
  • 28. Analyzing Infinite Data Streams 28 Windows come in different flavors: Tumbling windows discretize a stream into non-overlapping windows. Sliding Windows: slide over the stream of data.
  • 29. Analyzing Infinite Data Streams 29 Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data. Triggers: decide when the window is evaluated or purged.
  • 30. Analyzing Infinite Data Streams 30 Given the advances in streaming we can: Trade-off latency with cost and accuracy In certain use-cases replace batch processing with streaming
  • 31. Analyzing Infinite Data Streams 31 Recent advances in Streaming are a result of the pioneer work: MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803
  • 32. Analyzing Infinite Data Streams 32 Apache Beam is the open source successor of Google’s DataFlow It is becoming the standard api streaming. Provides the advanced semantics needed for the current needs in streaming applications.
  • 33. Streaming Systems Architecture 33 User provides a graph of computations through a high level API where data flows on the edges of this graph. Each vertex its an operator which executes a user operation-computation. For example: stream.map().keyBy()... Operators can run in multiple instances and preserve state (unlike batch processing where we have immutable datasets). State can be persisted and restored in the presence of failures.
  • 34. Analyzing Infinite Data Streams - Flink Example 34 sealed trait SensorType { def stype: String } case object TemperatureSensor extends SensorType { val stype = "TEMP" } case object HumiditySensor extends SensorType { val stype = "HUM" } case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long) https://github.com/skonto/trivento-summercamp-2016
  • 35. Analyzing Infinite Data Streams - Flink Example 35 class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int, val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] { final val serialVersionUID = 1L @volatile var isRunning = true var counter = 1 var timestamp = 0 val randomGen = Random require(numberOfSensors > 0) require(numberOfElements >= -1) lazy val initialReading: Double = { sensorType match { case TemperatureSensor => 27.0 case HumiditySensor => 0.75 } } override def run(ctx: SourceContext[SensorData]): Unit = { val counterCondition = { if(numberOfElements == -1) { x: Int => isRunning } else { x: Int => isRunning && counter <= x } } while (counterCondition(numberOfElements)) { Thread.sleep(10) // send sensor data every 10 milliseconds val dataId = randomGen.nextInt(numberOfSensors) + 1 val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp) ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs timestamp = timestamp + 1 if (timestamp % watermarkTag == 0) { // watermark should be mod 0 ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds } counter = counter + 1 } } override def cancel(): Unit = { // No cleanup needed isRunning = false } } The Source https://github.com/skonto/trivento-summercamp-2016
  • 36. Analyzing Infinite Data Streams - Flink Example 36 object SensorSimple { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // set default env parallelism for all operators env.setParallelism(2) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val numberOfSensors = 2 val watermarkTag = 10 val numberOfElements = 1000 val sensorDataStream = env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements)) sensorDataStream.writeAsText("inputData.txt") val windowedKeyed = sensorDataStream .keyBy(data => data.sensorId) .timeWindow(Time.milliseconds(10)) windowedKeyed.max("value") .writeAsText("outputMaxValue.txt") windowedKeyed.apply(new SensorAverage()) .writeAsText("outputAverage.txt") env.execute("Sensor Data Simple Statistics") } } class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] { def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = { if (input.nonEmpty) { val average = input.map(_.value).sum / input.size out.collect(input.head.copy(value = average)) } } } The Job https://github.com/skonto/trivento-summercamp-2016
  • 37. Analyzing Infinite Data Streams - Flink Example 37 Operator 1 Operator 2 Watermark 1 (10) 0 3 6 2 7 5 849 Operators run the operations defined by the graph of the streaming computation. Example Operators (KeyBy, Map, FlatMap etc) Two instances of the same operator with parallelism 2 (previous example). Watermark N (10*N) .. .. .. .. .. .. .. .. .. .. .. .. 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22... time file1 file2 window 2window 1
  • 38. Streaming vs Batch Systems 38 Metric Batch Streaming Data size per job TB to PB MB to TB (in flight) Time between data arrival and processing Many minutes to hours Microseconds to minutes Job execution times Minutes to hours Microseconds to minutes
  • 39. World of Patterns Pattern (in general) … is a perceptible regularity or a template (Wikipedia). Software Patterns: well-defined, reusable solution to a commonly occurring problem in software design eg. Template Method, Singleton etc. Software Architecture Patterns: An architectural pattern is a general, reusable solution to a commonly occurring problem in software architecture within a given context (Wikipedia) eg. client-server n-tier. 39
  • 40. World of Patterns Software Architecture vs Software Design. We use them everywhere but… they are not a silver bullet. Why? 40
  • 41. Software Architecture before Lambda Architecture Many definitions for software architecture. “Architecture: ⟨system⟩ fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution”. (ISO/IEC/IEEE 42010). “Software architecture refers to the fundamental structures of a software system, the discipline of creating such structures, and the documentation of these structures. These structures are needed to reason about the software system.” Wikipedia “It is about structure and vision”. Software architecture for developers, Simon Brown. “The highest-level breakdown of a system into its parts; the decisions that are hard to change; there are multiple architectures in a system; what is architecturally significant can change over a system's lifetime; and, in the end, architecture boils down to whatever the important stuff is.” Patterns of Enterprise Application Architecture, Martin Fowler 41
  • 42. Software Architecture is important Architectural decisions are decisions that have non-local consequences and they serve specific goals eg. in order to achieve a performance goal like high throughput I decided to use buffering within my system. Architectural decisions are important for your in-house project or your proposal if you are a consultant. 42
  • 43. Sound Architecture Principles: Why I Need it? Scalability/Elasticity Extensibility: requirements will change expect that Minimized costs Security awareness Well designed APIs for integration Well-tested, don’t go to production and cross fingers. 43
  • 44. Follow common sense... At the end of the day expect to throw everything out of the window under some circumstances. Business matters the most. Example: Non-functional requirements changed since load is huge and you are becoming successful, maybe you are the next Facebook. 44
  • 45. Software Architecture is important ...because there is high cost of not making specific decisions or making them not early enough. 45
  • 46. Software Architecture is important How about the wrong decisions? Image: http://www.awesomeinventions.com/wp- content/uploads/2014/10/balcony.jpg 46
  • 47. Software Architecture is important Many more benefits where architecture is present: A documented architecture assists communication Guides implementation imposing constraints Assists in technology decisions Assists in cost and time estimation Influences the structure of your organization and vice versa 47
  • 48. Software Architecture LifeCycle Steps: Architectural Requirements Architectural Design Architectural Documentation Architectural Evaluation / Implementation 48
  • 49. Lambda Architecture - Intro “Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.” 49 Nathan Marz and James Warren, Big Data: Principles and best practices of scalable real-time data systems, Manning Publications. Photo: https://images-na.ssl-images-amazon.com/images/I/51Bd93AGuOL._SX258_BO1,204,203,200_.jpg
  • 50. Lambda Architecture - Cont’d (1/5) 50 Image: http://lambda-architecture.net/img/la-overview_small.png Batch Layer: perfect accuracy, indexed batch views Serving Layer: random access query support based on batch & real-time views Speed Layer: process real-time streams, provides real-time views, lower accuracy Master dataset: append-only, immutable set of raw data
  • 51. Lambda Architecture - Cont’d (2/5) Example components for each part: Batch layer: Hadoop Batch Output Indexing: Druid, Impala etc Speed Output Indexing: Druid, Cassandra, HBase etc Speed processing: Spark, Flink etc 51
  • 52. Lambda Architecture - Cont’d (3/5) Basic functions: batch view = function (all data) <- high latency, high throughput realtime view = function (realtime view, new data) <- low latency, low throughput query = function (batch view, realtime view ) <- eventual accuracy 52
  • 53. Lambda Architecture - Cont’d (4/5) Key Properties: Eventual Accuracy Batch is always behind in time, continuously produces batch outputs. Whenever a new batch output is available updates the latest one. Finally batch layer will catch up with the speed layer. Complexity Isolation 53
  • 54. Lambda Architecture - Cont’d (5/5) Advantages: Immutable data. Reprocessing takes care code change, human error etc. Disadvantages: Operate/maintain two different systems (batch & streaming) is hard. Programming in two different paradigms makes the code-base complex. 54
  • 55. What about Data Lakes? A data lake accumulates data from different applications. It does not transform data in any way. Access from multiple users, no data silos, data is not hidden in special systems. There is no schema following the data, only raw data. We apply a schema when we read the data Includes structured, semi-structured, and unstructured data 55
  • 56. Data Lakes Categories Data reservoirs: Governed accumulation of data for later use. Data are secured and go under the process of ingestion, cleansing, profiling and indexing. Exploratory lakes: Accumulation of data without governance for ad-hoc analysis by data scientists et al to gain insights. Analytical lakes: Ingest your data to feed data pipelines for analytics. 56
  • 57. Data Lakes vs Data Warehouse Can be a replacement of a data warehouse in several scenarios when that makes sense. 57 Data Lake Data Warehouse Schema Schema on-read Schema on-write Users Data scientists, people who need ad hoc analysis Business analysts Data Structured, semi- structured, unstructured Rigid structure Flexibility High, reprocessing is easy. Low tied to business processes.
  • 58. Data Lakes usually fail! Most project fail... you have been warned! Your next data lake can become a big data swamp. 58 Image: http://www.sharenator.com/Demotivationals_pt_3_P/
  • 59. Data Lakes extended with a Lambda Architecture You can always use your Lambda Architecture on top of a data lake if that makes sense. A data lake can be your DFS with specific services build around it, like metadata management. It can make things easy especially when you start small and try to figure out what you need. It can be very simple where you use the batch layer for loading the data from a source for streaming only. No presentation layer is needed. How about Kafka? 59
  • 60. Azure Data Lake 60 Image: https://azure.microsoft.com/en-us/solutions/data-lake/
  • 61. How about Data Silos? Separate containers of data. The big data platform or the big data system at hand should unify business information, development teams and data in a business useful way. Think about a scenario with microservices, event sourcing and analytics. 61
  • 63. Flickr’s Use case - The Problem Magic View Feature: computer vision pipeline to generates a set of computer vision tags and reverse indexes are created per user along with aggregated tag info. Initially only batch then a streaming layer was added for live experience. Backfills needed because of missed photos from the streaming layer (approximation errors) and code changes. Backfills via streaming were slow due to the nature of RMW access pattern. 63
  • 64. Flickr’s Use case - Solution 64 Result = Combiner(Query(data))
  • 65. Implementing The Lambda Architecture Smack stack based Lambda Architecture: 65 mesos Spark hdfs Spark or Flink Kafka Cassandra Query app Akka driven apps user
  • 66. Machine Learning Support for Lambda Architecture Build a model and serve it. Simple models vs complex models. Spark for model build and flink for model service. Parameter servers: https://issues.apache.org/jira/browse/SPARK-6932 https://github.com/rjagerman/glint http://parameterserver.org/ http://www.petuum.com/bosen.html https://github.com/JohnLangford/vowpal_wabbit/wiki 66
  • 67. Real World Implementation Tips Jvm based technologies like Cassandra, Kafka need correct GC settings. Monitoring is a must. Cassandra, Kafka etc provide jmx interfaces to get the counter values you need. You need to know and understand which are useful to monitor closely. It is not wise to co-locate everything, you need to be care full about components requirements. For example zookeeper should run on its own box but if co-located it should have it own high-speed volume assigned for its commit log. Vendors offer specific requirements for production, stem from experience using the technology in production. 67
  • 68. Real World Implementation Tips OS settings. Misuse technologies. Example: Kafka is not a database. Design decisions. Example: Time series data on Cassandra. Data locality and data move. Example: Kafka rebalance. Logging. How I monitor my job? Log correlation? For batch processing you need a flexible orchestration tool like: https://github.com/apache/incubator-airflow Within your data-center vs across data-centers. On cloud: Availability zones vs regions. 68
  • 69. Beyond the Lambda Architecture Kappa Architecture (2014) Zeta Architecture (2015) IoT-A Architecture (2010- 2013) Butterfly Architecture (~2015) Fast Data architecture (~2016) 69
  • 70. Kappa Architecture Introduced by Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent in 2014. See https://www.oreilly.com/ideas/questioning-the-lambda-architecture Lambda architecture is good but it is too much to try to keep in sync two layer and in practice it is hard to achieve “The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.” Batch processing is a sub-set of streaming processing. Different technologies want to take advantage of this fact and provide a holistic solution: Flink, http://data-artisans.com/batch-is-a-special-case-of-streaming/ Spark, https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html 70
  • 71. Kappa Architecture 1. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. 2. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. 3. When the second job has caught up, switch the application to read from the new table. 4. Stop the old version of the job, and delete the old output table. Re-processing is done only when code changes. 71 Image: https://dmgpayxepw99m.cloudfront.net/kappa-61d0afc292912b61ce62517fa2bd4309.png
  • 72. Kappa Architecture Pros & Cons 72 Pros: ● Develop and maintain only one streaming system. ● Reprocessing only when code changes. Cons: ● Need temp storage for the reprocessing streaming job.
  • 73. Kappa Architecture - When to use? 73 ● Algorithms of streaming and batch processing are the same. ● Batch and real-time outputs can be the same.
  • 74. Zeta Architecture Introduced by MapR for supporting as-it-happens business (March 2015). Goals: Exploit all existing hardware in the data center. Back-up and disaster recovery support for real-time continuity Tolerance for human mistake End-to-End Security Support google scale systems 74
  • 75. Zeta Architecture - Components Seven pluggable components: Distributed File System: All applications write here. Real-time Data Storage: Needed for high-speed business applications. Pluggable Compute Model / Execution Engine: Different needs need different engines. Deployment / Container Management: Allows for a common way to deploy resources. 75
  • 76. Zeta Architecture - Components Seven pluggable components: Solution Architecture: Focuses on solving a specific business problem. Enterprise Applications: Used to drive the architecture. Now they are realized via existing components. Dynamic and Global Resource Management: Allows dynamic allocation of resources which fits the business needs each time. 76
  • 77. Zeta Architecture Components and reference applications 77 Image: https://www.mapr.com/zeta-architecture
  • 78. Zeta Architecture Example 78Images: https://www.mapr.com/zeta-architecture
  • 79. IoT-A Architecture Targets IoT applications proposed by Michael Hausenblas (MapR, Mesosphere) 2015. IoT leads to a Big Data architecture because: High volume of data from sensors Time-Series format of data or other type of formats. Data are generated at high-speed and business needs real-time processing. 79
  • 80. IoT-A Architecture Basic Architecture: Message Queue / Streaming Block (MQ/SP) DB: A real-time DB for indexing sensor data. Low Latency. DFS: The distributed file system where batch jobs can be run and batch reports can be created. 80
  • 82. IoT-A Architecture - Implementation Technologies 82 http://iot-a.info/
  • 83. Butterfly Architecture 83 ● Introduced by Milind Bhandarkar (Pivotal). ● The weak point of the Lambda architecture lies in the distributed file system which cannot serve all layers. ● They propose the use of different memory technologies than DRAM (like storage class memory) to implement an efficient object storage engine. ● They use different abstractions compared to files or dirs of DFS: datasets, dataframes, eventstreams. mutable immutable unmanaged managed log publish Data frames Data sets Storage ETL Butterfly Image: http://sketch2draw.com/wp- content/uploads/2013/05/butterfly_thumb.jpg
  • 84. A Fast Data Architecture 84Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
  • 85. Example IoT Application 85Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
  • 86. Streaming Implementations Status 86 Apache Spark: Structured Streaming in v2 starts the improvement of the streaming engine. Still based on micro-batches but event-time support was added. Apache Flink: SQL API supported from v0.9 and on. Still important features are on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.
  • 87. Picking the Right Tool for Streaming 87 Criteria to choose: Processing semantics (strong consistency is needed for correctness) Latency guarantees Deployment / Operation Ecosystem build around it Complex event processing (CEP) Batch & Streaming API support Community & Support
  • 88. Picking the Right Tool for Streaming 88 Some tips Pick Flink if you need sub-second latency and Beam support Pick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for training models, has mature deployment capabilities. Pick Gearpump for materializing Akka Streams in a distributed fashion. Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed solution out of the box). (Check Confluent Platform for many useful tools around Kafka).
  • 90. References Books: Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL: Bhushan Lakhe: 9781484212882: Amazon.com: Books Designing Software Architectures: A Practical Approach (SEI Series in Software Engineering): Humberto Cervantes, Rick Kazman: 9780134390789: Amazon.com: Books Big Data: Principles and best practices of scalable realtime data systems: Nathan Marz, James Warren: 9781617290343: Amazon.com: Books 90
  • 91. References - Cont’d Web resources/Articles: Questioning the Lambda Architecture - O'Reilly Media Structured Streaming In Apache Spark | Databricks Blog The world beyond batch: Streaming 101 - O'Reilly Media The world beyond batch: Streaming 102 - O'Reilly Media Data Centric Enterprise | MapR Why local state is a fundamental primitive in stream processing - O'Reilly Media Data processing architectures – Lambda and Kappa - Ericsson Research BlogEricsson Research Blog 2016 State of Fast Data Survey | OpsClarity Zeta Architecture | MapR Is Big Data Still a Thing? (The 2016 Big Data Landscape) – Matt Turck 91