Trivento summercamp masterclass 9/9/2016

Lambda Architecture And
Beyond
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
Trivento Summercamp 2016
Amersfoort
De oude Prodentfabriek

Introduction
2
Introduction: Who Am I?

Agenda
A bit of history of Big Data Processing
Batch Systems vs Streaming Systems
What is Lambda Architecture?
Advantages, Disadvantages?
Use cases
Data Lakes, Data Silos etc...
Implementing Lambda Architecture, ML support, Implementation Tips
Beyond the Lambda Architecture (Kappa, FastData, Zeta etc)
3

Data Processing
Batch processing: processing done on a bounded dataset.
Stream Processing (Streaming): processing done on an unbounded datasets.
Data items are pushed or pulled.
Two categories of systems: batch vs streaming systems.
5

Big Data - The story
Internet scale apps moved data size from Gigabytes to Petabytes.
Once upon a time there were traditional RDBMS like Oracle and Data
Warehouses but volume, velocity and variety changed the game.
6

MapReduce was a major breakthrough (Google published the seminal paper in
2004).
Nutch project already had an implementation in 2005
2006 becomes a subproject of Lucene with the name Hadoop.
2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it
becomes a top-level apache project.
Hadoop is good for batch processing.

Word Count example - Inverted Index.
8
Split 1
Split N
doc1,
doc2 ...
...
doc300,
doc100
MAP REDUCE
(w1,1)
…
(w20,1)
(w41,1)
…
(w1,1)
Shuffle
(w1, (1,1,1…))
...
(w41, (1,1,…))
...
(w1, 13)
...
(w1, 3)
...

Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value
store.” changed the DataBase world in 2007.
NoSQL Databases along with general system like Hadoop solve problems
cannot be solved with traditional RDBMs.
Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus
over more powerful cpus.
9

There is a major shift in the industry as batch processing is not enough any
more.
Batch jobs usually take hours if not days to complete, in many applications that
is not acceptable.
10

The trend now is near-real time computation which implies streaming
algorithms and needs new semantics. Fast Data (data in motion) & Big
Data (data at rest) at the same time.
The enterprise needs to get smarter, all major players across industries
use ML on top of massive datasets to make better decisions.
11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530
https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg

OpsClarity report:
92% plan to increase their investment in stream processing applications in the
next year
79% plan to reduce or eliminate investment in batch processing
32% use real time analysis to power core customer-facing applications
44% agreed that it is tedious to correlate issues across the pipeline
68% identified lack of experience and underlying complexity of new data
frameworks as their barrier to adoption
http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
12

13Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

14
In OpsClarity report:
● Apache Kafka is the most popular broker technology (ingestion queue)
● HDFS the most used data sink
● Apache Spark is the most popular data processing tool.

Big Data Landscape
15
Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

Big Data System
A Big Data System must have at least the following components at its core:
DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).
Distributed Data processing tool like: Spark, Hadoop etc
Tools and services to manage the previous systems.
16

Big Data System - Layered View
A Big Data System has at least an infrastructure layer and application layer.
17

Big Data System Design Considerations / Problems
Data Locality
Data Versioning
Code change
Resource allocation
Deployment/Operation
Integration
Backup/Failover Strategy
Scaling Strategy
18

Big Data System Quality
A Big Data System should be:
fault-tolerant
easy to debug
generic enough
scalable
extensible
able to support ad-hoc queries
high throughput
able to support low latency reads/writes
19

Big Data and Immutable Data
Immutable data provide the following benefits:
Fault-tolerance to human error (you can always replay history and fix things)
Simplicity no index is needed for retrieve and update, just append newly arrived data.
20

Big Data System - Delivery/Processing Semantics
21
In distributed systems failure is part of the game. What semantics I can achieve for message delivery?
at-most-once delivery: for each message sent, that message is delivered zero or one times.
at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it,
such that at least one succeeds; messages may be duplicated but not lost.
exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the
message can neither be lost nor duplicated.
In theory it is impossible to have exactly once delivery.
In practice we might care more for exactly-once state changes and at-least once delivery. Example:
Keeping state at some operator of the streaming graph.

Batch Systems - The Hadoop Ecosystem
22
Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in
March 2013.
Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the
next-generation replacement for MapReduce.
Image: Lightbend Inc.

Hadoop clusters, the gold standard for big data from ~2008 to the present.
Strengths:
Lowest CapEx system for Big Data.
Excellent for ingesting and integrating diverse datasets.
Flexible: from classic analytics (aggregations and data warehousing) to machine learning.
23

Weaknesses:
Complex administration.
YARN can’t manage all distributed services.
MapReduce, has poor performance, a difficult programming model, and doesn’t support stream
processing.
24

Analyzing Infinite Data Streams
25
What does it mean to run a SQL query on an unbounded data set.
How should I deal with the late data which I see.
What kind of time measurement should I use? Event-time, Processing time or
Ingestion time?
Accuracy of computations on bounded datasets vs on unbounded datasets
Algorithms for streaming computations?

26
Two cases for processing:
Single event processing: event transformation, trigger an alarm on an error event
Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.

27
Event aggregation introduces the concept of windowing wrt to the notion of time
selected:
Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
System Arrival or Ingestion time (the time that events arrived at the streaming system).
Ideally event time = Processing time. Reality is: there is skew.

28
Windows come in different flavors:
Tumbling windows discretize a stream into non-overlapping windows.
Sliding Windows: slide over the stream of data.

29
Watermarks: indicates that no elements with a timestamp older or equal to the
watermark timestamp should arrive for the specific window of data.
Triggers: decide when the window is evaluated or purged.

30
Given the advances in streaming we can:
Trade-off latency with cost and accuracy
In certain use-cases replace batch processing with streaming

31
Recent advances in Streaming are a result of the pioneer work:
MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.
The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803

32
Apache Beam is the open source successor of Google’s DataFlow
It is becoming the standard api streaming. Provides the advanced semantics
needed for the current needs in streaming applications.

Streaming Systems Architecture
33
User provides a graph of computations through a high level API where data
flows on the edges of this graph. Each vertex its an operator which executes
a user operation-computation. For example: stream.map().keyBy()...
Operators can run in multiple instances and preserve state (unlike batch
processing where we have immutable datasets).
State can be persisted and restored in the presence of failures.

Analyzing Infinite Data Streams - Flink Example
34
sealed trait SensorType { def stype: String }
case object TemperatureSensor extends SensorType { val stype = "TEMP" }
case object HumiditySensor extends SensorType { val stype = "HUM" }
case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)
https://github.com/skonto/trivento-summercamp-2016

35
class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int,
val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] {
final val serialVersionUID = 1L
@volatile var isRunning = true
var counter = 1
var timestamp = 0
val randomGen = Random
require(numberOfSensors > 0)
require(numberOfElements >= -1)
lazy val initialReading: Double = {
sensorType match {
case TemperatureSensor => 27.0
case HumiditySensor => 0.75
}
}
override def run(ctx: SourceContext[SensorData]): Unit = {
val counterCondition = {
if(numberOfElements == -1) {
x: Int => isRunning
} else {
x: Int => isRunning && counter <= x
}
}
while (counterCondition(numberOfElements)) {
Thread.sleep(10) // send sensor data every 10 milliseconds
val dataId = randomGen.nextInt(numberOfSensors) + 1
val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp)
ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs
timestamp = timestamp + 1
if (timestamp % watermarkTag == 0) { // watermark should be mod 0
ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds
}
counter = counter + 1
}
}
override def cancel(): Unit = {
// No cleanup needed
isRunning = false
}
}
The Source

36
object SensorSimple {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// set default env parallelism for all operators
env.setParallelism(2)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val numberOfSensors = 2
val watermarkTag = 10
val numberOfElements = 1000
val sensorDataStream =
env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))
sensorDataStream.writeAsText("inputData.txt")
val windowedKeyed = sensorDataStream
.keyBy(data => data.sensorId)
.timeWindow(Time.milliseconds(10))
windowedKeyed.max("value")
.writeAsText("outputMaxValue.txt")
windowedKeyed.apply(new SensorAverage())
.writeAsText("outputAverage.txt")
env.execute("Sensor Data Simple Statistics")
}
}
class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] {
def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = {
if (input.nonEmpty) {
val average = input.map(_.value).sum / input.size
out.collect(input.head.copy(value = average))
}
}
}
The Job

37
Operator 1 Operator 2
Watermark 1 (10)
0 3 6
2
7 5
849
Operators run the operations defined by the graph of
the streaming computation. Example Operators
(KeyBy, Map, FlatMap etc)
Two instances of the same operator with parallelism
2 (previous example).
Watermark N (10*N)
..
..
..
..
..
..
..
..
..
..
..
..
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...
time
file1 file2
window 2window 1

Streaming vs Batch Systems
38
Metric Batch Streaming
Data size per job TB to PB MB to TB (in flight)
Time between data arrival
and processing
Many minutes to hours Microseconds to minutes
Job execution times Minutes to hours Microseconds to minutes

World of Patterns
Pattern (in general) … is a perceptible regularity or a template (Wikipedia).
Software Patterns: well-defined, reusable solution to a commonly occurring
problem in software design eg. Template Method, Singleton etc.
Software Architecture Patterns: An architectural pattern is a general, reusable
solution to a commonly occurring problem in software architecture within a
given context (Wikipedia) eg. client-server n-tier.
39

World of Patterns
Software Architecture vs Software Design.
We use them everywhere but… they are not a silver bullet. Why?
40

Software Architecture before Lambda Architecture
Many definitions for software architecture.
“Architecture: ⟨system⟩ fundamental concepts or properties of a system in its environment embodied in its elements,
relationships, and in the principles of its design and evolution”. (ISO/IEC/IEEE 42010).
“Software architecture refers to the fundamental structures of a software system, the discipline of creating such
structures, and the documentation of these structures. These structures are needed to reason about the software
system.” Wikipedia
“It is about structure and vision”. Software architecture for developers, Simon Brown.
“The highest-level breakdown of a system into its parts; the decisions that are hard to change; there are multiple
architectures in a system; what is architecturally significant can change over a system's lifetime; and, in the end,
architecture boils down to whatever the important stuff is.” Patterns of Enterprise Application Architecture, Martin Fowler
41

Software Architecture is important
Architectural decisions are decisions that have non-local consequences and they
serve specific goals eg. in order to achieve a performance goal like high
throughput I decided to use buffering within my system.
Architectural decisions are important for your in-house project or your proposal if
you are a consultant.
42

Sound Architecture Principles: Why I Need it?
Scalability/Elasticity
Extensibility: requirements will change expect that
Minimized costs
Security awareness
Well designed APIs for integration
Well-tested, don’t go to production and cross fingers.
43

Follow common sense...
At the end of the day expect to throw everything out of the window under some
circumstances. Business matters the most.
Example: Non-functional requirements changed since load is huge and you are
becoming successful, maybe you are the next Facebook.
44

...because there is high cost of not making specific decisions or making them not
early enough.
45

How about the wrong decisions?
Image: http://www.awesomeinventions.com/wp-
content/uploads/2014/10/balcony.jpg
46

Many more benefits where architecture is present:
A documented architecture assists communication
Guides implementation imposing constraints
Assists in technology decisions
Assists in cost and time estimation
Influences the structure of your organization and vice versa
47

Software Architecture LifeCycle
Steps:
Architectural Requirements
Architectural Design
Architectural Documentation
Architectural Evaluation / Implementation
48

Lambda Architecture - Intro
“Computing arbitrary functions on an arbitrary dataset in real time is a daunting
problem. There is no single tool that provides a complete solution. Instead,
you have to use a variety of tools and techniques to build a complete Big Data
system. The lambda architecture solves the problem of computing arbitrary
functions on arbitrary data in real time by decomposing the problem into three
layers: the batch layer, the serving layer, and the speed layer.”
49
Nathan Marz and James Warren, Big Data: Principles and best practices
of scalable real-time data systems, Manning Publications.
Photo: https://images-na.ssl-images-amazon.com/images/I/51Bd93AGuOL._SX258_BO1,204,203,200_.jpg

Lambda Architecture - Cont’d (1/5)
50
Image: http://lambda-architecture.net/img/la-overview_small.png
Batch Layer: perfect accuracy, indexed batch views
Serving Layer: random access query support based on batch & real-time views
Speed Layer: process real-time streams, provides real-time views, lower
accuracy
Master dataset: append-only, immutable set of raw data

Example components for each part:
Batch layer: Hadoop
Batch Output Indexing: Druid, Impala etc
Speed Output Indexing: Druid, Cassandra, HBase etc
Speed processing: Spark, Flink etc
51

Basic functions:
batch view = function (all data) <- high latency, high throughput
realtime view = function (realtime view, new data) <- low latency, low
throughput
query = function (batch view, realtime view ) <- eventual accuracy
52

Key Properties:
Eventual Accuracy
Batch is always behind in time, continuously produces batch outputs. Whenever a
new batch output is available updates the latest one. Finally batch layer will catch
up with the speed layer.
Complexity Isolation
53

Advantages:
Immutable data.
Reprocessing takes care code change, human error etc.
Disadvantages:
Operate/maintain two different systems (batch & streaming) is hard.
Programming in two different paradigms makes the code-base complex.
54

What about Data Lakes?
A data lake accumulates data from different applications.
It does not transform data in any way.
Access from multiple users, no data silos, data is not hidden in special
systems.
There is no schema following the data, only raw data. We apply a schema
when we read the data
Includes structured, semi-structured, and unstructured data
55

Data Lakes Categories
Data reservoirs: Governed accumulation of data for later use. Data are secured
and go under the process of ingestion, cleansing, profiling and indexing.
Exploratory lakes: Accumulation of data without governance for ad-hoc analysis
by data scientists et al to gain insights.
Analytical lakes: Ingest your data to feed data pipelines for analytics.
56

Data Lakes vs Data Warehouse
Can be a replacement of a data warehouse in several scenarios when that
makes sense.
57
Data Lake Data Warehouse
Schema Schema on-read Schema on-write
Users Data scientists,
people who need ad
hoc analysis
Business analysts
Data Structured, semi-
structured,
unstructured
Rigid structure
Flexibility High, reprocessing
is easy.
Low tied to business
processes.

Data Lakes usually fail!
Most project fail... you have been warned! Your next data lake can become
a big data swamp.
58
Image: http://www.sharenator.com/Demotivationals_pt_3_P/

Data Lakes extended with a Lambda Architecture
You can always use your Lambda Architecture on top of a data lake if that
makes sense. A data lake can be your DFS with specific services build
around it, like metadata management. It can make things easy especially
when you start small and try to figure out what you need.
It can be very simple where you use the batch layer for loading the data
from a source for streaming only. No presentation layer is needed.
How about Kafka?
59

Azure Data Lake
60
Image: https://azure.microsoft.com/en-us/solutions/data-lake/

How about Data Silos?
Separate containers of data.
The big data platform or the big data system at hand should unify business
information, development teams and data in a business useful way.
Think about a scenario with microservices, event sourcing and analytics.
61

Use Cases
Yahoo
Netflix
Flickr
62

Flickr’s Use case - The Problem
Magic View Feature: computer vision pipeline to generates a set of
computer vision tags and reverse indexes are created per user along
with aggregated tag info.
Initially only batch then a streaming layer was added for live experience.
Backfills needed because of missed photos from the streaming layer
(approximation errors) and code changes.
Backfills via streaming were slow due to the nature of RMW access pattern.
63

Flickr’s Use case - Solution
64
Result = Combiner(Query(data))

Implementing The Lambda Architecture
Smack stack based Lambda Architecture:
65
mesos
Spark
hdfs
Spark or Flink
Kafka Cassandra Query
app
Akka
driven
apps user

Machine Learning Support for Lambda Architecture
Build a model and serve it. Simple models vs complex models.
Spark for model build and flink for model service.
Parameter servers:
https://issues.apache.org/jira/browse/SPARK-6932
https://github.com/rjagerman/glint
http://parameterserver.org/
http://www.petuum.com/bosen.html
https://github.com/JohnLangford/vowpal_wabbit/wiki
66

Real World Implementation Tips
Jvm based technologies like Cassandra, Kafka need correct GC settings.
Monitoring is a must. Cassandra, Kafka etc provide jmx interfaces to get the
counter values you need. You need to know and understand which are useful
to monitor closely.
It is not wise to co-locate everything, you need to be care full about
components requirements. For example zookeeper should run on its own
box but if co-located it should have it own high-speed volume assigned for its
commit log.
Vendors offer specific requirements for production, stem from experience using
the technology in production. 67

Real World Implementation Tips
OS settings.
Misuse technologies. Example: Kafka is not a database.
Design decisions. Example: Time series data on Cassandra.
Data locality and data move. Example: Kafka rebalance.
Logging. How I monitor my job? Log correlation?
For batch processing you need a flexible orchestration tool like:
https://github.com/apache/incubator-airflow
Within your data-center vs across data-centers. On cloud: Availability zones
vs regions. 68

Beyond the Lambda Architecture
Kappa Architecture (2014)
Zeta Architecture (2015)
IoT-A Architecture (2010- 2013)
Butterfly Architecture (~2015)
Fast Data architecture (~2016)
69

Kappa Architecture
Introduced by Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent in 2014.
See https://www.oreilly.com/ideas/questioning-the-lambda-architecture
Lambda architecture is good but it is too much to try to keep in sync two layer and in practice it is hard to achieve
“The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be
universally agreed on by everyone doing it.”
Batch processing is a sub-set of streaming processing. Different technologies want to take advantage of this fact and provide a
holistic solution:
Flink, http://data-artisans.com/batch-is-a-special-case-of-streaming/
Spark, https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
70

Kappa Architecture
1. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows
for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.
2. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the
beginning of the retained data, but direct this output data to a new output table.
3. When the second job has caught up, switch the application to read from the new table.
4. Stop the old version of the job, and delete the old output table.
Re-processing is done only when code changes. 71
Image: https://dmgpayxepw99m.cloudfront.net/kappa-61d0afc292912b61ce62517fa2bd4309.png

Kappa Architecture Pros & Cons
72
Pros:
● Develop and maintain only one streaming system.
● Reprocessing only when code changes.
Cons:
● Need temp storage for the reprocessing streaming job.

Kappa Architecture - When to use?
73
● Algorithms of streaming and batch processing are the same.
● Batch and real-time outputs can be the same.

Zeta Architecture
Introduced by MapR for supporting as-it-happens business (March 2015).
Goals:
Exploit all existing hardware in the data center.
Back-up and disaster recovery support for real-time continuity
Tolerance for human mistake
End-to-End Security
Support google scale systems
74

Zeta Architecture - Components
Seven pluggable components:
Distributed File System: All applications write here.
Real-time Data Storage: Needed for high-speed business applications.
Pluggable Compute Model / Execution Engine: Different needs need
different engines.
Deployment / Container Management: Allows for a common way to deploy
resources.
75

Zeta Architecture - Components
Seven pluggable components:
Solution Architecture: Focuses on solving a specific business problem.
Enterprise Applications: Used to drive the architecture. Now they are
realized via existing components.
Dynamic and Global Resource Management: Allows dynamic allocation of
resources which fits the business needs each time.
76

Zeta Architecture
Components and reference applications
77
Image: https://www.mapr.com/zeta-architecture

Zeta Architecture Example
78Images: https://www.mapr.com/zeta-architecture

IoT-A Architecture
Targets IoT applications proposed by Michael Hausenblas (MapR, Mesosphere)
2015.
IoT leads to a Big Data architecture because:
High volume of data from sensors
Time-Series format of data or other type of formats.
Data are generated at high-speed and business needs real-time processing.
79

IoT-A Architecture
Basic Architecture:
Message Queue / Streaming Block (MQ/SP)
DB: A real-time DB for indexing sensor data. Low Latency.
DFS: The distributed file system where batch jobs can be run and batch
reports can be created.
80

IoT-A Architecture
81
http://iot-a.info/

IoT-A Architecture - Implementation Technologies
82
http://iot-a.info/

Butterfly Architecture
83
● Introduced by Milind Bhandarkar (Pivotal).
● The weak point of the Lambda architecture lies in the distributed file system which cannot serve
all layers.
● They propose the use of different memory technologies than DRAM (like storage class memory)
to implement an efficient object storage engine.
● They use different abstractions compared to files or dirs of DFS: datasets, dataframes,
eventstreams.
mutable immutable
unmanaged managed
log publish
Data frames
Data sets
Storage
ETL
Butterfly Image: http://sketch2draw.com/wp-
content/uploads/2013/05/butterfly_thumb.jpg

A Fast Data Architecture
84Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September
2016

Example IoT Application
85Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media,
September 2016

Streaming Implementations Status
86
Apache Spark: Structured Streaming in v2 starts the improvement of the
streaming engine. Still based on micro-batches but event-time support was
added.
Apache Flink: SQL API supported from v0.9 and on. Still important features are
on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.

Picking the Right Tool for Streaming
87
Criteria to choose:
Processing semantics (strong consistency is needed for correctness)
Latency guarantees
Deployment / Operation
Ecosystem build around it
Complex event processing (CEP)
Batch & Streaming API support
Community & Support

Picking the Right Tool for Streaming
88
Some tips
Pick Flink if you need sub-second latency and Beam support
Pick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for
training models, has mature deployment capabilities.
Pick Gearpump for materializing Akka Streams in a distributed fashion.
Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed
solution out of the box). (Check Confluent Platform for many useful tools around Kafka).

References
Books:
Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications
to NoSQL: Bhushan Lakhe: 9781484212882: Amazon.com: Books
Designing Software Architectures: A Practical Approach (SEI Series in Software Engineering): Humberto Cervantes, Rick Kazman:
9780134390789: Amazon.com: Books
Big Data: Principles and best practices of scalable realtime data systems: Nathan Marz, James Warren: 9781617290343:
Amazon.com: Books
90

References - Cont’d
Web resources/Articles:
Questioning the Lambda Architecture - O'Reilly Media
Structured Streaming In Apache Spark | Databricks Blog
The world beyond batch: Streaming 101 - O'Reilly Media
The world beyond batch: Streaming 102 - O'Reilly Media
Data Centric Enterprise | MapR
Why local state is a fundamental primitive in stream processing - O'Reilly Media
Data processing architectures – Lambda and Kappa - Ericsson Research BlogEricsson Research Blog
2016 State of Fast Data Survey | OpsClarity
Zeta Architecture | MapR
Is Big Data Still a Thing? (The 2016 Big Data Landscape) – Matt Turck
91

Trivento summercamp masterclass 9/9/2016

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (16)

Similar to Trivento summercamp masterclass 9/9/2016

Similar to Trivento summercamp masterclass 9/9/2016 (20)

More from Stavros Kontopoulos

More from Stavros Kontopoulos (8)

Recently uploaded

Recently uploaded (20)

Trivento summercamp masterclass 9/9/2016