Big Data and Fast Data - big and fast combined, is it possible?

2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 
WELCOME Big Data and Fast Data
big and fast combined – is it
possible?
Guido Schmutz und Albert Blarer
24. April 2013
24. April 2013
Big Data und Fast Data
1

2013 © Trivadis
Guido Schmutz
•  Working for Trivadis for more than 16 years
•  Oracle ACE Director for Fusion Middleware and SOA
•  Co-Author of different books
•  Consultant, Trainer Software Architect for Java, Oracle, SOA
and EDA
•  Member of Trivadis Architecture Board
•  Technology Manager @ Trivadis
•  More than 25 years of software development  
experience
•  Contact: guido.schmutz@trivadis.com
•  Blog: http://guidoschmutz.wordpress.com
•  Twitter: gschmutz
14.06.2012
2
Where and When should I use the Oracle Service Bus (OSB)

2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

2013 © Trivadis
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.
4
11 Trivadis Niederlassungen mit 
über 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-
budget: CHF 5.0 / EUR 4 Mio.
Finanziell unabhängig und 
nachhaltig profitabel
Erfahrung aus mehr als 1'900
Projekten pro Jahr bei über 800
Kunden
Stand 12/2012
Hamburg
Düsseldorf
Frankfurt
Freiburg
München
Wien
Basel
ZürichBern
Lausanne
4
Stuttgart
Datum
Trivadis – das Unternehmen

2013 © Trivadis
Credits
Nathan Marz
Author of „
Big Data – Principles and best practics of scalable
realtime data systems“ – Manning Press
Used to be working at Backtype and Twitter
Creator of
•  Storm
•  Cascalog
•  ElephantDB
24. April 2013
5

2013 © Trivadis
Agenda
1.  Big Data, what is it?
2.  Motivation
3.  The Lambda Architecture
4.  Implementing the Lambda Architecture
5.  Summary
24. April 2013
6

2013 © Trivadis
Big Data Definition (Gartner et al)
14.02.2013
Big Data 4 Sales
7
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
“Traditional” computing in RDBMS  
is not scalable enough.  
We search for “linear scalability”
“Only … structured information  
is not enough” – “95% of produced data in
unstructured”
Characteristics of Big Data: Its
Volume, Velocity and Variety in
combination
+ Veracity (IBM) - information uncertainty
+ Time to action ? – Big Data + Event Processing = Fast Data

2013 © Trivadis
Big Data Emerging Technologies
24. April 2013
8
§  MapReduce (e.g. Apache Hadoop)
§  Event Stream Processing & CEP (e.g. Storm or Esper)
§  New messaging systems (e.g. Apache Kafka)
§  Integration tools (e.g. Spring or Camus)
§  New database paradigms (e.g. NoSQL or NewSQL)
§  Data mining tools (e.g. Apache Mahout )
§  Data extraction and detection tools (e.g. Apache Tika )

2013 © Trivadis
14.02.2013
Big Data 4 Sales
9

2013 © Trivadis
Volume Development
0
20
40
60
80
100
0
2000
4000
6000
8000
2005 2007 2009 2011 2013 2015
AggregateUncertainty%
GlobalDataVolumeinExabytes
Year
Sensors:
“internet of
things”
Social Media:
video, audio,
text
VoIP:
Skype, MSN,
ICQ, ...
Enterprise Data:
data dictionary,
ERD, ...
24. April 2013
10

2013 © Trivadis
Velocity
24. April 2013
11
§  Velocity requirement examples:
§  Recommendation Engine
§  Predictive Analytics
§  Marketing Campaign Analysis
§  Customer Retention and Churn Analysis
§  Social Graph Analysis
§  Capital Markets Analysis
§  Risk Management
§  Rogue Trading
§  Fraud Detection
§  Retail Banking
§  Network Monitoring
§  Research and Development

2013 © Trivadis
Agenda
2.  Motivation
5.  Summary
24. April 2013
12

2013 © Trivadis
What is a data system?
•  A system that manages the storage and querying of data with a
lifetime measured in years encompassing every version of the
application to ever exist, every hardware failure and every human
mistake ever made.
•  A data system answers questions based on information that was
acquired in the past
•  Not all bits of information are equal
•  Some information is derived from other
24. April 2013
13

2013 © Trivadis
Desired Properties of a (Big) Data System
Robust and fault-tolerant
Low latency reads and updates
Scalable
General
Extensible
Allows ad hoc queries
Minimal maintenance
Debuggable
24. April 2013
14

2013 © Trivadis
Typical problem in today’s 
architecture/systems
Bugs will be deployed to production over the lifetime of a data system
Operational mistakes will be made
Humans are part of the overall system
•  Just like hard disks, CPUs, memory, software
•  design for human error like you design for any other fault
Examples of human error
•  Deploy a bug that increments counters by two instead of by one
•  Accidentally delete data from database
•  Accidental DOS on important internal service
Worst two consequences: data loss or data corruption
As long as an error doesn‘t lose or corrupt good data, you can fix what
went wrong
24. April 2013
15
Lack of Human Fault Tolerance

2013 © Trivadis
Mutability
The U and D in CRUD
A mutable system updates the current state of the world
Mutable systems inherently lack human fault-tolerance
Easy to corrupt or lose data
24. April 2013
16
Capturing change traditionally
Name City
Guido Berne
Albert Zurich
Name City
Guido Basel
Albert Zurich

2013 © Trivadis
Immutability
An immutable system captures historical records of events
Each event happens at a particular time and is always true
24. April 2013
17
Capturing change by storing events
Name City Timestamp
Guido Berne 1.8.1999
Albert Zurich 10.5.1988
Name City Timestamp
Guido Berne 1.8.1999
Albert Zurich 10.5.1988
Guido Basel 1.4.2013

2013 © Trivadis
Immutability
Immutability greatly restricts the range of errors that can cause data loss or
data corruption
Vastly more human fault-tolerant
Much easier to reason about systems based on immutability
Conclusion: Your source of truth should always be immutable
24. April 2013
18

2013 © Trivadis
What about traditional/today’s architectures ?  
Source of Truth is mutable!
Rather than build systems like this ….
24. April 2013
19
Mutable
Database
Application
(Query)
RDBMS
NoSQL
NewSQL
Mobile
Web
RIA
Rich Client
Source of Truth
Source of Truth

2013 © Trivadis
A different kind of architecture with immutable source of truth
… why not building them like this
24. April 2013
20
HDFS
NoSQL
NewSQL
RDBMS
View on
Data
Mobile
Web
RIA
Rich Client
Source of Truth
Immutable
data
View on
Data
Application
(Query)
Source of Truth

2013 © Trivadis
How to create the views on the Immutable data?
On the fly ?
Materialized, i.e. Pre-computed ?
24. April 2013
21
Immutable
data
View
Immutable
data
Pre- 
Computed 
Views
Query
Query

2013 © Trivadis
Data = the most raw information
Data is information which is not derived from anywhere else
•  The most raw form of information
•  Data is the special information from which everything else is derived
Questions on data can be answered by running functions that take data
as input
The most general purpose data system can answer questions by running
functions that take the entire dataset as input
query = function (all data)
The lambda architecture provides a general purpose approach for
implementing arbitrary functions on an arbitrary datasets
24. April 2013
22

2013 © Trivadis
Data = the most raw information
24. April 2013
23
1.2.13 Add iPAD 64GB
10.3.13 Add Sony RX-100
11..3.13 Add Canon GX-10
11.3.13 Remove Sony RX-100
12.3.13 Add Nikon S-100
14.4.13 Add BoseQC-15
15.4.13 Add MacBook Pro 15
20.4.13 Remove Canon GX10
iPAD 64GB
Nikon S-100
BoseQC-15
MacBook Pro 15
4derive derive
Favorite Product List Changes
Current Favorite  
Product List
Current
Product
Count
Raw information => data
Information => derived

2013 © Trivadis
Big Data and Batch Processing
24. April 2013
24
Immutable
data
Batch
View
Query??
Incoming
Data
How to compute the batch views ?
How to compute queries from the views ?

2013 © Trivadis
Big Data and Batch Processing
24. April 2013
25
Fully processed data Last full
batch period
Time for 
batch job
time
now
non-processed data
time
now
batch-processed data
§  Using only batch processing, leaves you always with a portion of non-
processed data.
Adapted from Ted Dunning (March 2012):
http://www.youtube.com/watch?v=7PcmbI5aC20
But we are not done yet …

2013 © Trivadis
Adding Real-Time Processing
24. April 2013
26
Immutable
data
Batch
Views
Query
?
Data
Stream
Realtime
Views
Incoming
Data
How to compute queries  
from the views ?How to compute real-time views

2013 © Trivadis
Adding Real-Time Processing
24. April 2013
27
1.2.13 Add iPAD 64GB
10.3.13 Add Sony RX-100
11..3.13 Add Canon GX-10
11.3.13 Remove Sony RX-100
12.3.13 Add Nikon S-100
14.4.13 Add BoseQC-15
15.4.13 Add MacBook Pro 15
20.4.13 Remove Canon GX10
Now Add Canon Scanner
iPAD 64GB
Nikon S-100
BoseQC-15
MacBook Pro 15
5
compute
Current Favorite  
Product List
Current
Product
Count
Now Canon ScannercomputeAdd Canon Scanner
Stream of
Immutable data
Views
Data Stream
Query

2013 © Trivadis
Big Data and Real Time Processing
24. April 2013
28
time
Fully processed data Last full
batch period
now
Time for 
batch job
batch processing 
worked fine here
(e.g. Hadoop)
real time processing 
works here
blended view for end user
Adapted from Ted Dunning (March 2012):
http://www.youtube.com/watch?v=7PcmbI5aC20

2013 © Trivadis
Agenda
2.  Motivation
5.  Summary
24. April 2013
29

2013 © Trivadis
Lambda Architecture
24. April 2013
30
Immutable
data
Batch
View
Query
Data
Stream
Realtime
View
Incoming
Data
Serving Layer
Speed Layer
Batch Layer
A
B
C D
E
F
G

2013 © Trivadis
Lambda Architecture
A.  All data is sent to both the batch and speed layer
B.  Master data set is an immutable, append-only set of data
C.  Batch layer pre-computes query functions from scratch, result is called Batch
Views. Batch layer constantly re-computes the batch views.
D.  Batch views are indexed and stored in a scalable database to get particular
values very quickly. Swaps in new batch views when they are available
E.  Speed layer compensates for the high latency of updates to the Batch Views in
the Serving layer.
F.  Uses fast incremental algorithms and read/write databases to produce real-
time views
G.  Queries are resolved by getting results from both batch and real-time views
24. April 2013
31

2013 © Trivadis
Layered Architecture
Stores the immutable constantly growing dataset
Computes arbitrary views from this dataset using BigData
technologies (can take hours)
Can be always recreated
Responsible for indexing and exposing the pre-computed batch
views so that they can be queried
Exposes the incremented real-time views
Merges the batch and the real-time views into a consistent result
Computes the views from the constant stream of data it receives
Needed to compensate for the high latency of the batch layer
Incremental model and views are transient
24. April 2013
32
Serving Layer
Batch Layer
Speed Layer

2013 © Trivadis
Agenda
2.  Motivation
5.  Summary
24. April 2013
33

2013 © Trivadis
Lambda Architecture
24. April 2013
34
Speed Layer
Precompute
Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge

2013 © Trivadis
Lambda Architecture
24. April 2013
35
one possible product/framework mapping
Speed Layer
Precompute
Views
query
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge

2013 © Trivadis
Implementing Batch Layer
Immutable Data
•  Append only
•  Normalized
•  Stores master copy of all data
Pre-computed information
•  Function that takes all data as input
query = function(all-data)
•  High Latency, Batch processing
•  Unrestrained computation
•  Horizontal scalable
24. April 2013
36
Immutable
data
Batch 
Views
compute
Precompute
Views
Batch Layer
Precomputed
information
All data
Batch
recompute
Batch Layer Serving Layer

2013 © Trivadis
Apache Hadoop HDFS
HDFS = the Hadoop Distributed File System
A distributed file storage system
Redundant storage
Designed to reliably store data using commodity hardware
Designed to expect hardware failures
Intended for large files
Designed for batch inserts
24. April 2013
37
Batch Layer

2013 © Trivadis
Apache Hadoop Map Reduce
24. April 2013
38
§  Hadoop Map Reduce is an open source implementation of the
MapReduce framework.
§  Map Reduce is
§  a programming model, introduced by Google, for processing large data sets,
in a distributed environment
§  De-facto standard to compute huge amounts of data
§  An execution framework for organizing and performing such computations
MAP
master
node
REDUCE
worker node 1
worker node 2
worker node 3
problem
data
solution
data
Batch Layer

2013 © Trivadis
Hadoop MapReduce Flow
24. April 2013
39
Source: Bill Graham, Twitter Inc.
Batch Layer

2013 © Trivadis
Hadoop MapReduce
24. April 2013
40
Batch Layer

2013 © Trivadis
Cascading
Application framework for Java developers to simply develop robust Data
Analytics and Data Management applications on Apache Hadoop
adds an abstraction layer over the Hadoop API
core concepts of the cascading API:
•  Pipe: a series of processing steps (parsing, looping, filtering, etc) defining the
data processing to be done
•  Flow: association of a pipe (or set of pipes) with a data-source and data-sink
24. April 2013
41
Batch Layer

2013 © Trivadis
Apache Pig
Apache Pig is a platform for analyzing large data sets
Key Properties
•  Ease of programming
•  Optimization opportunities
•  Extensibility
24. April 2013
43
Batch Layer

2013 © Trivadis
Implementing Serving Layer 
for Batch Views
Need a database that
•  Is batch-writable
•  Adding new information is atomic
•  Has fast random reads
•  Is scalable
•  Is highly available
•  Can be optimized for Storage
•  Information can be de-normalized
•  But no Random writes required!
•  Can be a simple database
24. April 2013
44
Serving Layer
batch view
batch view
Batch Layer
Precomputed
information
Immutable
data
Batch 
Views
compute
Batch Layer Serving Layer

2013 © Trivadis
SploutSQL
Full SQL => unlike NoSQL
For BigData => unlike RDBMS
Web latency & throughput => unlike Apache Hive, Apache Drill
Why does it scale
•  Data is partitioned
•  Partitions are distributed  
across nodes
•  Adding more nodes  
increase capacity
•  Generation does not  
impact serving
24. April 2013
45
Serving Layer
Source: Datasalt.

2013 © Trivadis
Implementing Speed Layer 
Stream Processing
Continuous computation
Transactional
Storing a limited window of data
•  Compensating for the last few  
hours of data
All the complexity is isolated in the  
speed layer
•  If anything goes wrong, it‘s  
autocorrected by the next batch run
24. April 2013
47
Speed Layer
Incremented
information
Process stream
Realtime
increment
Data
Stream
Realtime 
Views
derive
Speed Layer Serving Layer

2013 © Trivadis
Apache Kafka
A high throughput distributed messaging system
Originated at LinkedIn
Sequential disk access
24. April 2013
48

2013 © Trivadis
Twitter Storm – the “real-time Hadoop”
24. April 2013
49
§  Strom is a distributed and fault-tolerant real-time computing platform
§  data flow model, data flows through network of transformation entities
§  Key concepts
§  Tuple: ordered list of elements
§  Streams: unbounded sequence of tuples
§  Spouts: Source of streams
§  Bolts: Process tuples and create new streams
§  Topologies: directed graph of Spouts and Bolts
§  Use Cases
§  Stream Processing
§  Continuous Computation
§  Distributed RPC
SPOUT
BOLT
„MAP“ „REDUCE“
„PERSIST“
problem
data
data
source
solution
data
Speed Layer
Serving Layer
BOLT
BOLT

2013 © Trivadis
Twitter Trident
Higher level abstraction over Storm
Trident State
Grouped Stream
Functions, Filters
Aggregators
Query
Similar to Pig and Cascading
24. April 2013
51
Speed Layer
Serving Layer

2013 © Trivadis
for Real-Time Views
Incremental updates are made available as real-time views
Requires a database that support random read and random writes
•  Relational, NoSQL or NewSQL (in memory) databases can be used
•  Here we are typically not in the BigData range
Results are only needed until the data made it through the batch layer
Complexity isolation
24. April 2013
53
Data
Stream
Realtime 
Views
derive
real time view
real time view
Incremented
information

2013 © Trivadis
Cassandra
Fully distributed, no single-point-of-failure
Linearly scalable
Fault tolerant
Performant
Durable
Integrated caching
Tunable consistency
24. April 2013
54
Serving Layer

2013 © Trivadis
Merge of Batch and Realtime Views
An interesting feature of Storm /
Trident is the ability to execute
distributed RPC (DRPC) calls in
parallel
This can be used to implement the
merge functionality when a query is
executed
24. April 2013
55
Serving Layer
batch view
batch view
real time view
real time view
Realtime 
Views
Serving Layer
Batch Views
Merge
query

2013 © Trivadis
Summary – The lambda architecture
24. April 2013
58
§  The Lambda Architecture
§  Can discard batch views and real-time views and recreate everything from
scratch
§  Mistakes corrected via re-computation
§  Data storage layer optimized independently from query resolution layer
§  Still in a very early …. But a very interesting idea!
-  Today a zoo of technologies are needed => Operations won‘t like it
§  Different query language for batch and real time
§  An abstraction over batch and speed layer needed
-  Cascading and Trident are already similar
§  Industry standards needed!

2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 
THANK YOU.
Trivadis AG
Guido Schmutz & Albert Blarer
Europa-Strasse 5 
CH-8095 Glattbrugg
info@trivadis.com 
www.trivadis.com
24. April 2013
59

Big Data and Fast Data - big and fast combined, is it possible?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data and Fast Data - big and fast combined, is it possible?

Similar to Big Data and Fast Data - big and fast combined, is it possible? (20)

More from Guido Schmutz

More from Guido Schmutz (20)

Recently uploaded

Recently uploaded (20)

Big Data and Fast Data - big and fast combined, is it possible?