SlideShare a Scribd company logo
1 of 37
Download to read offline
Big Dikes and Big Data 
12 november 2014 
Big Data Groningen Meetup 
Frens Jan Rumph 
Michiel van der Ree
Target Holding 
Big Data to Intelligence 
Big Data Analytics is our key 
competence 
– using machine learning and 
pattern recognition techniques to 
extract value from large data sets 
Founded in 2009 and founding 
partner of Target 
– Dutch Public-Private Cooperation on 
Big Data, partners including IBM, 
Oracle, Astron/Lofar, RUG, UMCG 
Developed various innovative 
algorithms and technology which 
we apply across multiple market: 
– Energy & Water management 
– Media & Entertainment 
– Healthy Ageing 
– High Tech Systems
Target Holding 
Big Data to Intelligence 
Collect big data 
– Domain specific data 
– Web / public data & Social Media 
– Sensor data 
Enrich the data 
– Feature extraction & machine learning 
– Classification, ranking, forecasting, 
segmentation, clustering, natural language 
processing 
Present & visualize
3S timeseries representation
Big Dikes and Big Data 
Stichting IJkdijk 
IJkdijk 
Field lab Livedijk (XL)
Big Dikes and Big Data 
Reduced Time Series Representation 
● Dijkgraaf: I see something weird at sensor X.. 
– .. have I seen it before at sensor X? (my talk) 
– .. have I seen it before at other sensors? (Frens Jan's talk) 
● Query sensor's history by example 
● Time series might be.. 
– .. too big to store 
– .. too big to analyze 
● Solution: reduced representation 
● Seminal techniques: 
– Piecewise linear approximation 
– Symbolic aggregate approximation
Big Dikes and Big Data 
Piecewise Linear Approximation 
Basic idea: 
– Represent time series as a sequence of 
straight lines 
– Line can be connected (N/2) lines or 
disconnected (N/3) lines 
– High compression rates 
– Segment as you like, dynamic lengths 
Each line segment 
has 
• length 
• left_height 
Each line segment 
has 
• length 
• left_height 
• right_height
Big Dikes and Big Data 
Symbolic Aggregate Approximation 
Basic idea: 
– Segment using fixed frame width 
– Converts numerical time series into an 
equivalent symbolic representation 
– String analysis technique can be used for 
analyzing time series 
baabccbc
Big Dikes and Big Data 
Symbolic Aggregate Approximation
Big Dikes and Big Data 
Symbolic Aggregate Approximation
Basic idea: 
– A time series is decomposed in 
monotonic segments of variable 
lengths 
– Each segment is fitted to a 
monotonic shape and therefore 
represented as a symbol of an 
alphabet 
– Symbolic (SAX) but the symbols also 
capture shape and direction 
3S Representation 
Big Dikes and Big Data 
Segment Symbolic Shape-Based Representation
Storing more than one symbol: 
– “String Matching” (Levenshtein, Hamming, etc.) 
→ INFORMATION LOSS! 
– Each segment is approximated by: 
● μ + σ θ f(xn) 
– Physical meaning: 
● μ → offset, 
● σ → amplitude, 
● θ → linear drift with regard to... 
● f → shape 
● N → longitude, # of data points 
3S Representation 
(f,μ,σ,θ,N) 
Big Dikes and Big Data 
Segment Symbolic Shape-Based Representation
Fast and accurate matching: 
– Euclidean distance between 
segments 
– In constant time, i.e. independent 
of segment length N 
● summation of polynomials 
– Allows for different invariances: 
3S Representation 
[(f,μ,σ,θ,N)i] 
→ μ 
→ σ 
→ N 
→ θ 
Big Dikes and Big Data 
Time Series Retrieval
Big Dikes and Big Data 
Time Series Retrieval 
Fast, flexible and accurate matching using the 3S representation:
Big Dikes and Big Data 
Time Series Retrieval 
But what if you want to search 
in the history of multiple sensors?
Storage and processing 
with 3S representation
Storage and processing 
with 3S representation 
use case : search timeseries by example 
on timeseries for many sensors 
● Storage and processing of sensor data hits the limits of 
'traditional' database or file system based approaches 
– given for 'enough' sensors. 
● Technical dive in a distributed architecture: 
– with distributed storage : Apache Cassandra 
– with distributed processing : Apache Spark 
(no guarantees, the ideal architecture highly depends on use case specifics ...)
Distributed Storage 
● Distributed storage advantages: 
– scalability more nodes more → storage and i/o capacity 
– availability more nodes → make progress during failure 
– reliability more nodes → don't lose data on failure 
● Many solutions available, at Target Holding we extensively use 
Apache Cassandra ( C* ) for high volume data 
– because i.a. it scales well, is easy to operate and performs OK on disks 
(don't need SSD per se), allows easier access of data in comparison to file 
system 
– also storage system of choice within DDSC
Distributed Processing 
● Distributed processing advantages: 
– scalability more nodes → more compute capacity 
– availability more nodes → make progress during failure 
– reliability more nodes → don't lose data on failure 
– with local processing being CPU bound instead of IO bound 
● With C* as a starting point, options are to: build our own, use 
Hadoop M/R, or Apache Spark as we are investigating 
– because of its integration with C*, high level abstraction, rich tool set, 
stream processing capability 
– and its getting a lot of traction in the Hadoop ecosystem 
– (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)
Spark with Cassandra 
● A typical Spark with Cassandra deployment collocates 
Spark Workers with Cassandra nodes: 
image courtesy of DataStax 
● Allows data locality: push down filtering, transformations, etc.
Storage and processing 
with 3S representation 
● Distribution based on timeseries identifier, e.g. the sensor id 
– or something which identifies the location of measurement, … 
● Store the tuple <f, μ, σ, θ, N> together with a timestamp 
– the full 3s timeseries for each sensor must fit completely on one node 
● Goal: Find series of segments which are closest to 
the example (simplified for presentation) 
● Approach: Produce a global top-k out of local top-k's 
(applies also without simplification)
Storage and processing 
with 3S representation 
● Locally find the best matches, then repeat globally 
– Parallelizes and distributes most of the computation 
– Limits IO to the communication of the local best matches 
Group by 
Read segments sensor id 
possibly restricted by 
sensor ids and time range 
Select best 
local matches 
Take top k ordered 
by distance 
Create sliding window 
over time series 
Calculate distance 
for each window 
Take top k ordered by 
distance 
Zip with 
example 
Calculate euclidean 
distance per segment pair Sum 
Parallel distributed execution
Storage and processing 
with 3S representation 
Worker 
c* 
Worker 
c* 
Worker 
c* 
... 
Master 
Application 
(aka driver) 
Read segments Select best 
local matches 
Take top k ordered 
by distance 
Parallel distributed execution 
Coordinate cluster
Apache Cassandra
Apache Cassandra 
● Key value store (with some enhancements) 
● Based on Dynamo distribution and Big Table local storage 
● Partitioned (distributed) map of 
– sorted maps of 
● primitives, structs 
● maps, lists, sets 
● counters (crdt) 
warning … 
● personal mental model … 
● the truth is in the code … 
● caveat emptor
CQL 
● Cassandra Query Language helps with working with C* 
/* 3s timeseries in CQL */ 
CREATE TABLE symbolic ( 
s text, -- sensor identifier 
t timestamp, -- start of segment 
o float, -- offset 
a float, -- amplitude 
d float, -- drift 
f int, -- function / shape 
l int, -- longitude 
/* partition by sensor identifier, 
order by timestamp */ 
PRIMARY KEY ((s), t) 
)
Distribution & Consistency 
● Partitioning based on hashed key in conjunction with positions 
of node tokens. 
image courtesy of DataStax 
● Consistent replication when R + W > N 
R = # nodes read from, W = # nodes written to,N = replication factor
Apache Spark
Apache Spark 
● Spark is a distributed computing platform with fairly 
rich primitives operating on distributed data sets. 
● Spark can be used with data from different data sources 
– HDFS, Cassandra, elastic search to name a few 
● Spark has libraries for: SQL, graph processing and 
machine learning
Operator graphs 
● It allows execution of DAGs 
of operators 
– without using disk for 
intermediary results 
– employs pipelining 
if possible 
– (cyclic / iterative data 
flows are not cyclic ...)
Architecture 
● Applications which allocate CPU's and memory 
on Worker Nodes coordinated by a Master 
● Applications schedule Jobs which are DAGs of Tasks 
● Tasks consume & produce Resilient Distributed Datasets
Expressive 'language' 
● Spark is developed in Scala, supports Java and Python. 
● I consider Spark as expressive 
val wordount = sc 
.textFile("hdfs://...") 
.flatMap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _)
Storage and processing 
with 3S representation 
● Locally find the best matches, then repeat globally 
– Parallelizes and distributes most of the computation 
– Limits IO to the communication of the local best matches 
Group by 
Read segments sensor id 
possibly restricted by 
sensor ids and time range 
Select best 
local matches 
Take top k ordered 
by distance 
Create sliding window 
over time series 
Calculate distance 
for each window 
Take top k ordered by 
distance 
Zip with 
example 
Calculate euclidean 
distance per segment pair Sum 
Parallel distributed execution
Algorithm in Spark 
val conf = new SparkConf() 
.setAAA(...).setBBB(...) ...setZZZ(...) 
val sc = new SparkContext(conf) 
val example = sc.broadcast(Array( 
new Segment(...), ..., new Segment(...) 
)) 
val k = 10 
val segments = sc 
.cassandraTable(keyspace, table) 
.map(fromRow) 
Select best 
local matches 
Read segments
Select best 
local matches 
Distance for 
each window 
Algorithm in Spark 
val matches = segments.mapPartitions( 
_.groupBy(seg => seg.s) 
.flatMap({ 
case (s, segs) => 
segs 
.sliding(example.value.length) 
.map(w => ( 
s, w, 
w.zip(example.value) 
.map(distance) 
.map(math.abs) 
.sum 
)) 
})) 
Group by sensor id 
Create sliding window 
over time series 
Zip with example 
Calculate euclidean distance 
per segment pair 
Sum
Select best 
local matches 
Algorithm in Spark 
val top = matches 
.takeOrdered(k)(Ordering.by(_._3)) 
Take top k ordered by distance 
Take top k ordered by distance
Questions?

More Related Content

What's hot

DASH Locality Hierarchies (PADAL'16)
DASH Locality Hierarchies (PADAL'16)DASH Locality Hierarchies (PADAL'16)
DASH Locality Hierarchies (PADAL'16)Menlo Systems GmbH
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
 
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeIntroduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeFinTechopedia
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
FinalprojectpresentationSANTOSH WAYAL
 
Meet Hadoop Family: part 2
Meet Hadoop Family: part 2Meet Hadoop Family: part 2
Meet Hadoop Family: part 2caizer_x
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseAapo Kyrölä
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 

What's hot (20)

DASH Locality Hierarchies (PADAL'16)
DASH Locality Hierarchies (PADAL'16)DASH Locality Hierarchies (PADAL'16)
DASH Locality Hierarchies (PADAL'16)
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeIntroduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Meet Hadoop Family: part 2
Meet Hadoop Family: part 2Meet Hadoop Family: part 2
Meet Hadoop Family: part 2
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
 
NGBT_poster_v0.4
NGBT_poster_v0.4NGBT_poster_v0.4
NGBT_poster_v0.4
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 

Viewers also liked

Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesWebinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesDataStax
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetDataStax Academy
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Sebastian Verheughe
 
Best buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalBest buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalRichard Chan, MBA
 
Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Shabbir Akhtar
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Dan Cundiff
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsDemystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsHortonworks
 
Operating Model
Operating ModelOperating Model
Operating Modelrmuse70
 
Using APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceUsing APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceCA API Management
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysisTaposh Roy
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014Lisa Fischer
 

Viewers also liked (15)

Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesWebinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital Experiences
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at Target
 
Hadoop for the Masses
Hadoop for the MassesHadoop for the Masses
Hadoop for the Masses
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
 
Best buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalBest buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) final
 
Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsDemystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
 
Operating Model
Operating ModelOperating Model
Operating Model
 
Using APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceUsing APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail Experience
 
Best buy
Best buyBest buy
Best buy
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
 

Similar to Target Holding - Big Dikes and Big Data

PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkSpark Summit
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Deltares
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 

Similar to Target Holding - Big Dikes and Big Data (20)

PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
Data Science
Data ScienceData Science
Data Science
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 

Recently uploaded

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 

Recently uploaded (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 

Target Holding - Big Dikes and Big Data

  • 1. Big Dikes and Big Data 12 november 2014 Big Data Groningen Meetup Frens Jan Rumph Michiel van der Ree
  • 2. Target Holding Big Data to Intelligence Big Data Analytics is our key competence – using machine learning and pattern recognition techniques to extract value from large data sets Founded in 2009 and founding partner of Target – Dutch Public-Private Cooperation on Big Data, partners including IBM, Oracle, Astron/Lofar, RUG, UMCG Developed various innovative algorithms and technology which we apply across multiple market: – Energy & Water management – Media & Entertainment – Healthy Ageing – High Tech Systems
  • 3. Target Holding Big Data to Intelligence Collect big data – Domain specific data – Web / public data & Social Media – Sensor data Enrich the data – Feature extraction & machine learning – Classification, ranking, forecasting, segmentation, clustering, natural language processing Present & visualize
  • 5. Big Dikes and Big Data Stichting IJkdijk IJkdijk Field lab Livedijk (XL)
  • 6. Big Dikes and Big Data Reduced Time Series Representation ● Dijkgraaf: I see something weird at sensor X.. – .. have I seen it before at sensor X? (my talk) – .. have I seen it before at other sensors? (Frens Jan's talk) ● Query sensor's history by example ● Time series might be.. – .. too big to store – .. too big to analyze ● Solution: reduced representation ● Seminal techniques: – Piecewise linear approximation – Symbolic aggregate approximation
  • 7. Big Dikes and Big Data Piecewise Linear Approximation Basic idea: – Represent time series as a sequence of straight lines – Line can be connected (N/2) lines or disconnected (N/3) lines – High compression rates – Segment as you like, dynamic lengths Each line segment has • length • left_height Each line segment has • length • left_height • right_height
  • 8. Big Dikes and Big Data Symbolic Aggregate Approximation Basic idea: – Segment using fixed frame width – Converts numerical time series into an equivalent symbolic representation – String analysis technique can be used for analyzing time series baabccbc
  • 9. Big Dikes and Big Data Symbolic Aggregate Approximation
  • 10. Big Dikes and Big Data Symbolic Aggregate Approximation
  • 11. Basic idea: – A time series is decomposed in monotonic segments of variable lengths – Each segment is fitted to a monotonic shape and therefore represented as a symbol of an alphabet – Symbolic (SAX) but the symbols also capture shape and direction 3S Representation Big Dikes and Big Data Segment Symbolic Shape-Based Representation
  • 12. Storing more than one symbol: – “String Matching” (Levenshtein, Hamming, etc.) → INFORMATION LOSS! – Each segment is approximated by: ● μ + σ θ f(xn) – Physical meaning: ● μ → offset, ● σ → amplitude, ● θ → linear drift with regard to... ● f → shape ● N → longitude, # of data points 3S Representation (f,μ,σ,θ,N) Big Dikes and Big Data Segment Symbolic Shape-Based Representation
  • 13. Fast and accurate matching: – Euclidean distance between segments – In constant time, i.e. independent of segment length N ● summation of polynomials – Allows for different invariances: 3S Representation [(f,μ,σ,θ,N)i] → μ → σ → N → θ Big Dikes and Big Data Time Series Retrieval
  • 14. Big Dikes and Big Data Time Series Retrieval Fast, flexible and accurate matching using the 3S representation:
  • 15. Big Dikes and Big Data Time Series Retrieval But what if you want to search in the history of multiple sensors?
  • 16. Storage and processing with 3S representation
  • 17. Storage and processing with 3S representation use case : search timeseries by example on timeseries for many sensors ● Storage and processing of sensor data hits the limits of 'traditional' database or file system based approaches – given for 'enough' sensors. ● Technical dive in a distributed architecture: – with distributed storage : Apache Cassandra – with distributed processing : Apache Spark (no guarantees, the ideal architecture highly depends on use case specifics ...)
  • 18. Distributed Storage ● Distributed storage advantages: – scalability more nodes more → storage and i/o capacity – availability more nodes → make progress during failure – reliability more nodes → don't lose data on failure ● Many solutions available, at Target Holding we extensively use Apache Cassandra ( C* ) for high volume data – because i.a. it scales well, is easy to operate and performs OK on disks (don't need SSD per se), allows easier access of data in comparison to file system – also storage system of choice within DDSC
  • 19. Distributed Processing ● Distributed processing advantages: – scalability more nodes → more compute capacity – availability more nodes → make progress during failure – reliability more nodes → don't lose data on failure – with local processing being CPU bound instead of IO bound ● With C* as a starting point, options are to: build our own, use Hadoop M/R, or Apache Spark as we are investigating – because of its integration with C*, high level abstraction, rich tool set, stream processing capability – and its getting a lot of traction in the Hadoop ecosystem – (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)
  • 20. Spark with Cassandra ● A typical Spark with Cassandra deployment collocates Spark Workers with Cassandra nodes: image courtesy of DataStax ● Allows data locality: push down filtering, transformations, etc.
  • 21. Storage and processing with 3S representation ● Distribution based on timeseries identifier, e.g. the sensor id – or something which identifies the location of measurement, … ● Store the tuple <f, μ, σ, θ, N> together with a timestamp – the full 3s timeseries for each sensor must fit completely on one node ● Goal: Find series of segments which are closest to the example (simplified for presentation) ● Approach: Produce a global top-k out of local top-k's (applies also without simplification)
  • 22. Storage and processing with 3S representation ● Locally find the best matches, then repeat globally – Parallelizes and distributes most of the computation – Limits IO to the communication of the local best matches Group by Read segments sensor id possibly restricted by sensor ids and time range Select best local matches Take top k ordered by distance Create sliding window over time series Calculate distance for each window Take top k ordered by distance Zip with example Calculate euclidean distance per segment pair Sum Parallel distributed execution
  • 23. Storage and processing with 3S representation Worker c* Worker c* Worker c* ... Master Application (aka driver) Read segments Select best local matches Take top k ordered by distance Parallel distributed execution Coordinate cluster
  • 25. Apache Cassandra ● Key value store (with some enhancements) ● Based on Dynamo distribution and Big Table local storage ● Partitioned (distributed) map of – sorted maps of ● primitives, structs ● maps, lists, sets ● counters (crdt) warning … ● personal mental model … ● the truth is in the code … ● caveat emptor
  • 26. CQL ● Cassandra Query Language helps with working with C* /* 3s timeseries in CQL */ CREATE TABLE symbolic ( s text, -- sensor identifier t timestamp, -- start of segment o float, -- offset a float, -- amplitude d float, -- drift f int, -- function / shape l int, -- longitude /* partition by sensor identifier, order by timestamp */ PRIMARY KEY ((s), t) )
  • 27. Distribution & Consistency ● Partitioning based on hashed key in conjunction with positions of node tokens. image courtesy of DataStax ● Consistent replication when R + W > N R = # nodes read from, W = # nodes written to,N = replication factor
  • 29. Apache Spark ● Spark is a distributed computing platform with fairly rich primitives operating on distributed data sets. ● Spark can be used with data from different data sources – HDFS, Cassandra, elastic search to name a few ● Spark has libraries for: SQL, graph processing and machine learning
  • 30. Operator graphs ● It allows execution of DAGs of operators – without using disk for intermediary results – employs pipelining if possible – (cyclic / iterative data flows are not cyclic ...)
  • 31. Architecture ● Applications which allocate CPU's and memory on Worker Nodes coordinated by a Master ● Applications schedule Jobs which are DAGs of Tasks ● Tasks consume & produce Resilient Distributed Datasets
  • 32. Expressive 'language' ● Spark is developed in Scala, supports Java and Python. ● I consider Spark as expressive val wordount = sc .textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 33. Storage and processing with 3S representation ● Locally find the best matches, then repeat globally – Parallelizes and distributes most of the computation – Limits IO to the communication of the local best matches Group by Read segments sensor id possibly restricted by sensor ids and time range Select best local matches Take top k ordered by distance Create sliding window over time series Calculate distance for each window Take top k ordered by distance Zip with example Calculate euclidean distance per segment pair Sum Parallel distributed execution
  • 34. Algorithm in Spark val conf = new SparkConf() .setAAA(...).setBBB(...) ...setZZZ(...) val sc = new SparkContext(conf) val example = sc.broadcast(Array( new Segment(...), ..., new Segment(...) )) val k = 10 val segments = sc .cassandraTable(keyspace, table) .map(fromRow) Select best local matches Read segments
  • 35. Select best local matches Distance for each window Algorithm in Spark val matches = segments.mapPartitions( _.groupBy(seg => seg.s) .flatMap({ case (s, segs) => segs .sliding(example.value.length) .map(w => ( s, w, w.zip(example.value) .map(distance) .map(math.abs) .sum )) })) Group by sensor id Create sliding window over time series Zip with example Calculate euclidean distance per segment pair Sum
  • 36. Select best local matches Algorithm in Spark val top = matches .takeOrdered(k)(Ordering.by(_._3)) Take top k ordered by distance Take top k ordered by distance