Target Holding - Big Dikes and Big Data

Big Dikes and Big Data
12 november 2014
Big Data Groningen Meetup
Frens Jan Rumph
Michiel van der Ree

Target Holding
Big Data to Intelligence
Big Data Analytics is our key
competence
– using machine learning and
pattern recognition techniques to
extract value from large data sets
Founded in 2009 and founding
partner of Target
– Dutch Public-Private Cooperation on
Big Data, partners including IBM,
Oracle, Astron/Lofar, RUG, UMCG
Developed various innovative
algorithms and technology which
we apply across multiple market:
– Energy & Water management
– Media & Entertainment
– Healthy Ageing
– High Tech Systems

Target Holding
Big Data to Intelligence
Collect big data
– Domain specific data
– Web / public data & Social Media
– Sensor data
Enrich the data
– Feature extraction & machine learning
– Classification, ranking, forecasting,
segmentation, clustering, natural language
processing
Present & visualize

Stichting IJkdijk
IJkdijk
Field lab Livedijk (XL)

Reduced Time Series Representation
● Dijkgraaf: I see something weird at sensor X..
– .. have I seen it before at sensor X? (my talk)
– .. have I seen it before at other sensors? (Frens Jan's talk)
● Query sensor's history by example
● Time series might be..
– .. too big to store
– .. too big to analyze
● Solution: reduced representation
● Seminal techniques:
– Piecewise linear approximation
– Symbolic aggregate approximation

Piecewise Linear Approximation
Basic idea:
– Represent time series as a sequence of
straight lines
– Line can be connected (N/2) lines or
disconnected (N/3) lines
– High compression rates
– Segment as you like, dynamic lengths
Each line segment
has
• length
• left_height
Each line segment
has
• length
• left_height
• right_height

Symbolic Aggregate Approximation
Basic idea:
– Segment using fixed frame width
– Converts numerical time series into an
equivalent symbolic representation
– String analysis technique can be used for
analyzing time series
baabccbc

Symbolic Aggregate Approximation

Basic idea:
– A time series is decomposed in
monotonic segments of variable
lengths
– Each segment is fitted to a
monotonic shape and therefore
represented as a symbol of an
alphabet
– Symbolic (SAX) but the symbols also
capture shape and direction
3S Representation
Segment Symbolic Shape-Based Representation

Storing more than one symbol:
– “String Matching” (Levenshtein, Hamming, etc.)
→ INFORMATION LOSS!
– Each segment is approximated by:
● μ + σ θ f(xn)
– Physical meaning:
● μ → offset,
● σ → amplitude,
● θ → linear drift with regard to...
● f → shape
● N → longitude, # of data points
3S Representation
(f,μ,σ,θ,N)
Segment Symbolic Shape-Based Representation

Fast and accurate matching:
– Euclidean distance between
segments
– In constant time, i.e. independent
of segment length N
● summation of polynomials
– Allows for different invariances:
3S Representation
[(f,μ,σ,θ,N)i]
→ μ
→ σ
→ N
→ θ
Time Series Retrieval

Fast, flexible and accurate matching using the 3S representation:

But what if you want to search
in the history of multiple sensors?

Storage and processing
with 3S representation

use case : search timeseries by example
on timeseries for many sensors
● Storage and processing of sensor data hits the limits of
'traditional' database or file system based approaches
– given for 'enough' sensors.
● Technical dive in a distributed architecture:
– with distributed storage : Apache Cassandra
– with distributed processing : Apache Spark
(no guarantees, the ideal architecture highly depends on use case specifics ...)

Distributed Storage
● Distributed storage advantages:
– scalability more nodes more → storage and i/o capacity
– availability more nodes → make progress during failure
– reliability more nodes → don't lose data on failure
● Many solutions available, at Target Holding we extensively use
Apache Cassandra ( C* ) for high volume data
– because i.a. it scales well, is easy to operate and performs OK on disks
(don't need SSD per se), allows easier access of data in comparison to file
system
– also storage system of choice within DDSC

Distributed Processing
● Distributed processing advantages:
– scalability more nodes → more compute capacity
– availability more nodes → make progress during failure
– reliability more nodes → don't lose data on failure
– with local processing being CPU bound instead of IO bound
● With C* as a starting point, options are to: build our own, use
Hadoop M/R, or Apache Spark as we are investigating
– because of its integration with C*, high level abstraction, rich tool set,
stream processing capability
– and its getting a lot of traction in the Hadoop ecosystem
– (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)

Spark with Cassandra
● A typical Spark with Cassandra deployment collocates
Spark Workers with Cassandra nodes:
image courtesy of DataStax
● Allows data locality: push down filtering, transformations, etc.

● Distribution based on timeseries identifier, e.g. the sensor id
– or something which identifies the location of measurement, …
● Store the tuple <f, μ, σ, θ, N> together with a timestamp
– the full 3s timeseries for each sensor must fit completely on one node
● Goal: Find series of segments which are closest to
the example (simplified for presentation)
● Approach: Produce a global top-k out of local top-k's
(applies also without simplification)

● Locally find the best matches, then repeat globally
– Parallelizes and distributes most of the computation
– Limits IO to the communication of the local best matches
Group by
Read segments sensor id
possibly restricted by
sensor ids and time range
Select best
local matches
Take top k ordered
by distance
Create sliding window
over time series
Calculate distance
for each window
Take top k ordered by
distance
Zip with
example
Calculate euclidean
distance per segment pair Sum
Parallel distributed execution

Worker
c*
Worker
c*
Worker
c*
...
Master
Application
(aka driver)
Read segments Select best
local matches
Take top k ordered
by distance
Parallel distributed execution
Coordinate cluster

Apache Cassandra
● Key value store (with some enhancements)
● Based on Dynamo distribution and Big Table local storage
● Partitioned (distributed) map of
– sorted maps of
● primitives, structs
● maps, lists, sets
● counters (crdt)
warning …
● personal mental model …
● the truth is in the code …
● caveat emptor

CQL
● Cassandra Query Language helps with working with C*
/* 3s timeseries in CQL */
CREATE TABLE symbolic (
s text, -- sensor identifier
t timestamp, -- start of segment
o float, -- offset
a float, -- amplitude
d float, -- drift
f int, -- function / shape
l int, -- longitude
/* partition by sensor identifier,
order by timestamp */
PRIMARY KEY ((s), t)
)

Distribution & Consistency
● Partitioning based on hashed key in conjunction with positions
of node tokens.
image courtesy of DataStax
● Consistent replication when R + W > N
R = # nodes read from, W = # nodes written to,N = replication factor

Apache Spark
● Spark is a distributed computing platform with fairly
rich primitives operating on distributed data sets.
● Spark can be used with data from different data sources
– HDFS, Cassandra, elastic search to name a few
● Spark has libraries for: SQL, graph processing and
machine learning

Operator graphs
● It allows execution of DAGs
of operators
– without using disk for
intermediary results
– employs pipelining
if possible
– (cyclic / iterative data
flows are not cyclic ...)

Architecture
● Applications which allocate CPU's and memory
on Worker Nodes coordinated by a Master
● Applications schedule Jobs which are DAGs of Tasks
● Tasks consume & produce Resilient Distributed Datasets

Expressive 'language'
● Spark is developed in Scala, supports Java and Python.
● I consider Spark as expressive
val wordount = sc
.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)

Algorithm in Spark
val conf = new SparkConf()
.setAAA(...).setBBB(...) ...setZZZ(...)
val sc = new SparkContext(conf)
val example = sc.broadcast(Array(
new Segment(...), ..., new Segment(...)
))
val k = 10
val segments = sc
.cassandraTable(keyspace, table)
.map(fromRow)
Select best
local matches
Read segments

Select best
local matches
Distance for
each window
Algorithm in Spark
val matches = segments.mapPartitions(
_.groupBy(seg => seg.s)
.flatMap({
case (s, segs) =>
segs
.sliding(example.value.length)
.map(w => (
s, w,
w.zip(example.value)
.map(distance)
.map(math.abs)
.sum
))
}))
Group by sensor id
Create sliding window
over time series
Zip with example
Calculate euclidean distance
per segment pair
Sum

Select best
local matches
Algorithm in Spark
val top = matches
.takeOrdered(k)(Ordering.by(_._3))
Take top k ordered by distance
Take top k ordered by distance

Target Holding - Big Dikes and Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Target Holding - Big Dikes and Big Data

Similar to Target Holding - Big Dikes and Big Data (20)

Recently uploaded

Recently uploaded (20)

Target Holding - Big Dikes and Big Data