Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Target Holding - Big Dikes and Big Data
1. Big Dikes and Big Data
12 november 2014
Big Data Groningen Meetup
Frens Jan Rumph
Michiel van der Ree
2. Target Holding
Big Data to Intelligence
Big Data Analytics is our key
competence
– using machine learning and
pattern recognition techniques to
extract value from large data sets
Founded in 2009 and founding
partner of Target
– Dutch Public-Private Cooperation on
Big Data, partners including IBM,
Oracle, Astron/Lofar, RUG, UMCG
Developed various innovative
algorithms and technology which
we apply across multiple market:
– Energy & Water management
– Media & Entertainment
– Healthy Ageing
– High Tech Systems
3. Target Holding
Big Data to Intelligence
Collect big data
– Domain specific data
– Web / public data & Social Media
– Sensor data
Enrich the data
– Feature extraction & machine learning
– Classification, ranking, forecasting,
segmentation, clustering, natural language
processing
Present & visualize
5. Big Dikes and Big Data
Stichting IJkdijk
IJkdijk
Field lab Livedijk (XL)
6. Big Dikes and Big Data
Reduced Time Series Representation
● Dijkgraaf: I see something weird at sensor X..
– .. have I seen it before at sensor X? (my talk)
– .. have I seen it before at other sensors? (Frens Jan's talk)
● Query sensor's history by example
● Time series might be..
– .. too big to store
– .. too big to analyze
● Solution: reduced representation
● Seminal techniques:
– Piecewise linear approximation
– Symbolic aggregate approximation
7. Big Dikes and Big Data
Piecewise Linear Approximation
Basic idea:
– Represent time series as a sequence of
straight lines
– Line can be connected (N/2) lines or
disconnected (N/3) lines
– High compression rates
– Segment as you like, dynamic lengths
Each line segment
has
• length
• left_height
Each line segment
has
• length
• left_height
• right_height
8. Big Dikes and Big Data
Symbolic Aggregate Approximation
Basic idea:
– Segment using fixed frame width
– Converts numerical time series into an
equivalent symbolic representation
– String analysis technique can be used for
analyzing time series
baabccbc
9. Big Dikes and Big Data
Symbolic Aggregate Approximation
10. Big Dikes and Big Data
Symbolic Aggregate Approximation
11. Basic idea:
– A time series is decomposed in
monotonic segments of variable
lengths
– Each segment is fitted to a
monotonic shape and therefore
represented as a symbol of an
alphabet
– Symbolic (SAX) but the symbols also
capture shape and direction
3S Representation
Big Dikes and Big Data
Segment Symbolic Shape-Based Representation
12. Storing more than one symbol:
– “String Matching” (Levenshtein, Hamming, etc.)
→ INFORMATION LOSS!
– Each segment is approximated by:
● μ + σ θ f(xn)
– Physical meaning:
● μ → offset,
● σ → amplitude,
● θ → linear drift with regard to...
● f → shape
● N → longitude, # of data points
3S Representation
(f,μ,σ,θ,N)
Big Dikes and Big Data
Segment Symbolic Shape-Based Representation
13. Fast and accurate matching:
– Euclidean distance between
segments
– In constant time, i.e. independent
of segment length N
● summation of polynomials
– Allows for different invariances:
3S Representation
[(f,μ,σ,θ,N)i]
→ μ
→ σ
→ N
→ θ
Big Dikes and Big Data
Time Series Retrieval
14. Big Dikes and Big Data
Time Series Retrieval
Fast, flexible and accurate matching using the 3S representation:
15. Big Dikes and Big Data
Time Series Retrieval
But what if you want to search
in the history of multiple sensors?
17. Storage and processing
with 3S representation
use case : search timeseries by example
on timeseries for many sensors
● Storage and processing of sensor data hits the limits of
'traditional' database or file system based approaches
– given for 'enough' sensors.
● Technical dive in a distributed architecture:
– with distributed storage : Apache Cassandra
– with distributed processing : Apache Spark
(no guarantees, the ideal architecture highly depends on use case specifics ...)
18. Distributed Storage
● Distributed storage advantages:
– scalability more nodes more → storage and i/o capacity
– availability more nodes → make progress during failure
– reliability more nodes → don't lose data on failure
● Many solutions available, at Target Holding we extensively use
Apache Cassandra ( C* ) for high volume data
– because i.a. it scales well, is easy to operate and performs OK on disks
(don't need SSD per se), allows easier access of data in comparison to file
system
– also storage system of choice within DDSC
19. Distributed Processing
● Distributed processing advantages:
– scalability more nodes → more compute capacity
– availability more nodes → make progress during failure
– reliability more nodes → don't lose data on failure
– with local processing being CPU bound instead of IO bound
● With C* as a starting point, options are to: build our own, use
Hadoop M/R, or Apache Spark as we are investigating
– because of its integration with C*, high level abstraction, rich tool set,
stream processing capability
– and its getting a lot of traction in the Hadoop ecosystem
– (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)
20. Spark with Cassandra
● A typical Spark with Cassandra deployment collocates
Spark Workers with Cassandra nodes:
image courtesy of DataStax
● Allows data locality: push down filtering, transformations, etc.
21. Storage and processing
with 3S representation
● Distribution based on timeseries identifier, e.g. the sensor id
– or something which identifies the location of measurement, …
● Store the tuple <f, μ, σ, θ, N> together with a timestamp
– the full 3s timeseries for each sensor must fit completely on one node
● Goal: Find series of segments which are closest to
the example (simplified for presentation)
● Approach: Produce a global top-k out of local top-k's
(applies also without simplification)
22. Storage and processing
with 3S representation
● Locally find the best matches, then repeat globally
– Parallelizes and distributes most of the computation
– Limits IO to the communication of the local best matches
Group by
Read segments sensor id
possibly restricted by
sensor ids and time range
Select best
local matches
Take top k ordered
by distance
Create sliding window
over time series
Calculate distance
for each window
Take top k ordered by
distance
Zip with
example
Calculate euclidean
distance per segment pair Sum
Parallel distributed execution
23. Storage and processing
with 3S representation
Worker
c*
Worker
c*
Worker
c*
...
Master
Application
(aka driver)
Read segments Select best
local matches
Take top k ordered
by distance
Parallel distributed execution
Coordinate cluster
25. Apache Cassandra
● Key value store (with some enhancements)
● Based on Dynamo distribution and Big Table local storage
● Partitioned (distributed) map of
– sorted maps of
● primitives, structs
● maps, lists, sets
● counters (crdt)
warning …
● personal mental model …
● the truth is in the code …
● caveat emptor
26. CQL
● Cassandra Query Language helps with working with C*
/* 3s timeseries in CQL */
CREATE TABLE symbolic (
s text, -- sensor identifier
t timestamp, -- start of segment
o float, -- offset
a float, -- amplitude
d float, -- drift
f int, -- function / shape
l int, -- longitude
/* partition by sensor identifier,
order by timestamp */
PRIMARY KEY ((s), t)
)
27. Distribution & Consistency
● Partitioning based on hashed key in conjunction with positions
of node tokens.
image courtesy of DataStax
● Consistent replication when R + W > N
R = # nodes read from, W = # nodes written to,N = replication factor
29. Apache Spark
● Spark is a distributed computing platform with fairly
rich primitives operating on distributed data sets.
● Spark can be used with data from different data sources
– HDFS, Cassandra, elastic search to name a few
● Spark has libraries for: SQL, graph processing and
machine learning
30. Operator graphs
● It allows execution of DAGs
of operators
– without using disk for
intermediary results
– employs pipelining
if possible
– (cyclic / iterative data
flows are not cyclic ...)
31. Architecture
● Applications which allocate CPU's and memory
on Worker Nodes coordinated by a Master
● Applications schedule Jobs which are DAGs of Tasks
● Tasks consume & produce Resilient Distributed Datasets
32. Expressive 'language'
● Spark is developed in Scala, supports Java and Python.
● I consider Spark as expressive
val wordount = sc
.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
33. Storage and processing
with 3S representation
● Locally find the best matches, then repeat globally
– Parallelizes and distributes most of the computation
– Limits IO to the communication of the local best matches
Group by
Read segments sensor id
possibly restricted by
sensor ids and time range
Select best
local matches
Take top k ordered
by distance
Create sliding window
over time series
Calculate distance
for each window
Take top k ordered by
distance
Zip with
example
Calculate euclidean
distance per segment pair Sum
Parallel distributed execution
34. Algorithm in Spark
val conf = new SparkConf()
.setAAA(...).setBBB(...) ...setZZZ(...)
val sc = new SparkContext(conf)
val example = sc.broadcast(Array(
new Segment(...), ..., new Segment(...)
))
val k = 10
val segments = sc
.cassandraTable(keyspace, table)
.map(fromRow)
Select best
local matches
Read segments
35. Select best
local matches
Distance for
each window
Algorithm in Spark
val matches = segments.mapPartitions(
_.groupBy(seg => seg.s)
.flatMap({
case (s, segs) =>
segs
.sliding(example.value.length)
.map(w => (
s, w,
w.zip(example.value)
.map(distance)
.map(math.abs)
.sum
))
}))
Group by sensor id
Create sliding window
over time series
Zip with example
Calculate euclidean distance
per segment pair
Sum
36. Select best
local matches
Algorithm in Spark
val top = matches
.takeOrdered(k)(Ordering.by(_._3))
Take top k ordered by distance
Take top k ordered by distance