Tobias Johansson
@ntjohansson
27/10/2016
Big data analytics
Einstürzenden Neudaten: Building an analytics engine from scratch
• Big data analytics engine
• Focusing on simplicity from an usage perspective
• Single process containing
• Time-series repository
• Semi-structured repository
• Execution engine
• Etc.
• Written in Scala/C++/Lua
What is Valo
• REST based
What is Valo
PUT /streams/sensors/environment/air
{
“sampleTime”: { “type”: “datetime” },
“sensor” : { “type”: “contributor” },
“pollution” : { “type”: “double” }
}
POST /streams/sensors/environment/air
{
“sampleTime”: “2016/10/27 15:13:00”,
“sensor” : “131e90ad-e32a”,
“pollution” : 85.6
}
• Data friendly
What is Valo
POST /streams/sensors/environment/air
Content-Type: application/json
POST /streams/sensors/environment/air
Content-Type: application/cbor
POST /streams/sensors/environment/air
Content-Type: application/csv
POST /streams/sensors/environment/air
Content-Type: application/bson
Time-series Semi-structured
• Real-time and historical queries
What is Valo
Looks simple?
Trust me, it is not.
Looks simple?
Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries
Looks simple?
Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries
Know your cluster
It will crash
Know your cluster
• You need a cluster to run big data analytics on. But it is based on;
• Commodity hardware which can fail
• Unreliable network
Know your cluster
• Issues;
• Unreachable nodes
• Dropped messages
• Delayed messages
• No response
Know your cluster
• Issues;
• Unreachable nodes
• Dropped messages
• Delayed messages
• No response
• Split network
• Multiple working clusters
• Mutable state is likely to diverge
Know your cluster
• Accept these issues and don’t try to fight it. Make life simpler by;
• Not having a single point of failure
• No leaders
• No master/slave
• No special nodes
• Making it eventually consistent
• Use CRDTs for sets, counters, etc.
• Use vector-clocks for configuration
Know your data
• Do not treat all data the same
• Time-series repository
• CPU data, market data, ECG
• Semi-structured repository
• Log files, emails
• KV repository
• Configuration
• Unless you are Oracle or Microsoft, make your data immutable, append only.
• Streams are facts at points in time, and facts do not change
Know your data
• Build properties into your data distribution policies. Properties which;
• Maximise resilience
• Avoid replicas on the same physical server rack
• Optimise data locality
• Minimise number of data transfers required when adding/removing
nodes
• Deterministically tell where data lives in the cluster
• Where does data for T0 to T1 sit in the cluster?
Know your data
• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x x D x x x
E x x x E x x x
F x x F x x x
G x G x x x
K x x K x x x
L x x L x x
M x M x
N N
• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x D x x x
E x X x E x x x
F x X F x x x
G x X G x x x
K x x K x x x
L x x L x x
M x M x
N N
Know your algos
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Avg
Avg
Avg Avg
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Avg
Avg
Avg
Avg
Avg
Avg
Avg
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Avg
Avg
Avg Avg
Avg Avg Avg
Know your algos
Init: () -> β
Apply: β -> 'a list -> β
Reduce: β -> β -> β
Finalise: β -> 'r
class AverageDouble {
def apply(value: NamedDouble): Unit
def reset(): Unit
def merge(state: Parser)
def restore(state: Parser)
def getResult: NamedDouble
def save(gen: Generator)
}
Travelling algos
Avg AvgAvg
Avg Avg Avg
Node / Segment 1 2 3 4 5 6 8 9
A x
B x x
C x x x
D x x x
E x x x
F x x x
G x x x
K x x x
L x x
M x
N
from historical /streams/demo/infrastructure/itime
group by timeStamp window of 5 minutes every 5 minutes fill last, alpha
select alpha, timeStamp, last(a) as la
partition every 1 hour as implicit
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language
KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical modelsDistributed CRDTs
Transports
Real-time queries
./valo
www.valo.io
Thank you
Meet us at the Startup Area
tobias@valo.io
@ntjohansson
Algos
MicroTickFrequency
MicroVolatility
OnlineMisraGries
Anomaly
Histogram
Bivar
Univar
Skyline
EMA
MovingKurtosis
MovingDerivative
RecursiveEMA
MovingVariance
MovingVariance
Average
Sum
Sum
TopK
Quantiles
What has brought us here today

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

  • 1.
    Tobias Johansson @ntjohansson 27/10/2016 Big dataanalytics Einstürzenden Neudaten: Building an analytics engine from scratch
  • 2.
    • Big dataanalytics engine • Focusing on simplicity from an usage perspective • Single process containing • Time-series repository • Semi-structured repository • Execution engine • Etc. • Written in Scala/C++/Lua What is Valo
  • 3.
    • REST based Whatis Valo PUT /streams/sensors/environment/air { “sampleTime”: { “type”: “datetime” }, “sensor” : { “type”: “contributor” }, “pollution” : { “type”: “double” } } POST /streams/sensors/environment/air { “sampleTime”: “2016/10/27 15:13:00”, “sensor” : “131e90ad-e32a”, “pollution” : 85.6 }
  • 4.
    • Data friendly Whatis Valo POST /streams/sensors/environment/air Content-Type: application/json POST /streams/sensors/environment/air Content-Type: application/cbor POST /streams/sensors/environment/air Content-Type: application/csv POST /streams/sensors/environment/air Content-Type: application/bson Time-series Semi-structured
  • 5.
    • Real-time andhistorical queries What is Valo
  • 6.
  • 7.
    Looks simple? Trust me,it is not. Dynamo style clustering and vector-clocks Eventual consistency Gossip protocols Distributed algorithms Distributed execution engine Expression trees and runtime code generation Query rewriting and optimization Consistent hashing Time-series repository Semi-structured repository Data atomicity Back pressure Elasticity Advanced ML algorithms IO Actor systems Data distribution Cluster management B+ trees Query language KV-store REST-api Jump consistent hashing Off-heap memory Data formats Distributed joins Time semantics Gap-filling Statistical models Distributed CRDTs Transports Realtime queries
  • 8.
    Looks simple? Trust me,it is not. Dynamo style clustering and vector-clocks Eventual consistency Gossip protocols Distributed algorithms Distributed execution engine Expression trees and runtime code generation Query rewriting and optimization Consistent hashing Time-series repository Semi-structured repository Data atomicity Back pressure Elasticity Advanced ML algorithms IO Actor systems Data distribution Cluster management B+ trees Query language KV-store REST-api Jump consistent hashing Off-heap memory Data formats Distributed joins Time semantics Gap-filling Statistical models Distributed CRDTs Transports Realtime queries
  • 9.
  • 10.
    Know your cluster •You need a cluster to run big data analytics on. But it is based on; • Commodity hardware which can fail • Unreliable network
  • 11.
    Know your cluster •Issues; • Unreachable nodes • Dropped messages • Delayed messages • No response
  • 12.
    Know your cluster •Issues; • Unreachable nodes • Dropped messages • Delayed messages • No response • Split network • Multiple working clusters • Mutable state is likely to diverge
  • 13.
    Know your cluster •Accept these issues and don’t try to fight it. Make life simpler by; • Not having a single point of failure • No leaders • No master/slave • No special nodes • Making it eventually consistent • Use CRDTs for sets, counters, etc. • Use vector-clocks for configuration
  • 14.
  • 15.
    • Do nottreat all data the same • Time-series repository • CPU data, market data, ECG • Semi-structured repository • Log files, emails • KV repository • Configuration • Unless you are Oracle or Microsoft, make your data immutable, append only. • Streams are facts at points in time, and facts do not change Know your data
  • 16.
    • Build propertiesinto your data distribution policies. Properties which; • Maximise resilience • Avoid replicas on the same physical server rack • Optimise data locality • Minimise number of data transfers required when adding/removing nodes • Deterministically tell where data lives in the cluster • Where does data for T0 to T1 sit in the cluster? Know your data
  • 17.
    • Consistent hashing •Minimises number of data transfers in the cluster • Time-based distribution • Distribute data in the cluster in second, minute, hour, day buckets Know your data
  • 18.
    • Consistent hashing •Minimises number of data transfers in the cluster • Time-based distribution • Distribute data in the cluster in second, minute, hour, day buckets Know your data Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9 A x x A x B x x x B x x C x x x x C x x x D x x x x D x x x E x x x E x x x F x x F x x x G x G x x x K x x K x x x L x x L x x M x M x N N
  • 19.
    • Consistent hashing •Minimises number of data transfers in the cluster • Time-based distribution • Distribute data in the cluster in second, minute, hour, day buckets Know your data Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9 A x x A x B x x x B x x C x x x x C x x x D x x x D x x x E x X x E x x x F x X F x x x G x X G x x x K x x K x x x L x x L x x M x M x N N
  • 20.
  • 21.
    Know your algos fromhistorical /streams/demo/infrastructure/cpu select avg(kernel)
  • 22.
    Know your algos fromhistorical /streams/demo/infrastructure/cpu select avg(kernel) Avg Avg Avg Avg
  • 23.
    Know your algos fromhistorical /streams/demo/infrastructure/cpu select avg(kernel) Avg Avg Avg Avg Avg Avg Avg
  • 24.
    Know your algos fromhistorical /streams/demo/infrastructure/cpu select avg(kernel) Avg Avg Avg Avg Avg Avg Avg
  • 25.
    Know your algos Init:() -> β Apply: β -> 'a list -> β Reduce: β -> β -> β Finalise: β -> 'r class AverageDouble { def apply(value: NamedDouble): Unit def reset(): Unit def merge(state: Parser) def restore(state: Parser) def getResult: NamedDouble def save(gen: Generator) }
  • 26.
    Travelling algos Avg AvgAvg AvgAvg Avg Node / Segment 1 2 3 4 5 6 8 9 A x B x x C x x x D x x x E x x x F x x x G x x x K x x x L x x M x N from historical /streams/demo/infrastructure/itime group by timeStamp window of 5 minutes every 5 minutes fill last, alpha select alpha, timeStamp, last(a) as la partition every 1 hour as implicit
  • 27.
    Dynamo style clusteringand vector-clocks Eventual consistency Gossip protocols Distributed algorithms Distributed execution engine Expression trees and runtime code generation Query rewriting and optimization Consistent hashing Time-series repository Semi-structured repository Data atomicity Back pressure Elasticity Advanced ML algorithms IO Actor systems Data distribution Cluster management B+ trees Query language KV-store REST-api Jump consistent hashing Off-heap memory Data formats Distributed joins Time semantics Gap-filling Statistical modelsDistributed CRDTs Transports Real-time queries ./valo
  • 28.
    www.valo.io Thank you Meet usat the Startup Area tobias@valo.io @ntjohansson
  • 29.
  • 30.
    What has broughtus here today