Massive amounts of time series data continuously generated and collected calls for the development of distributed large scale time series data processing platforms. Indexing plays a critical role in speeding up time series similarity queries on which most of these systems rely. However, the state-of-the-art techniques, including the widely adopted iSAX-based indexes, fall short in leveraging the parallel power of modern distributed systems to efficiently construct an index over billions of time series data (TBs of data).
We propose an indexing framework based on Apache Spark, which is composed by a novel index tree, and the related new signature, to index and query billion-scale time series. This framework is composed of a global centralized index and local distributed indexes. This new index not only reduces the depth and the size of the index tree significantly, but also maintains the similarity relationship more effectively compared to existing techniques. We conducted extensive experiments on both synthetic and real-world datasets.
Evaluation results demonstrate that over a 1.0 Billion time series dataset, the construction of un-clustered index is about 60% faster than the state-of-the-art systems, whereas the construction of clustered index is 83% faster than the state-of-the-art systems.
Moreover, the average response time of Exact-Match queries is decreased by 50%, and the accuracy of the kNN-Approximate queries has increased from 3% to 40% compared to existing techniques.
Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang
1. Liang Zhang (lzhang6@wpi.edu)
Data Science Dept.,
Worcester Polytechnic Institute
Spark-ITS:
Indexing for Large-Scale Time
Series Data on Spark
#SAISEco5
2. Prof. Elke A. Rundensteiner
Prof. Mohamed Y. Eltabakh
Liang Zhang
Noura Alghamdi
Data Science Research Group @ Worcester Polytechnic Institute
Liang Zhang, Noura Alghamdi, Mohamed Y. Eltabakh, Elke A. Rundensteiner. TARDIS: Distributed Indexing Framework for
Big Time Series Data. Proceedings of 35th IEEE International Conference on Data Engineering ICDE, 2019
4. Time Series are Continuously Produced Everywhere
• How to deal with billions
of time series?
4
Climate data Web log
Stock priceEEG
5. Almost all Time Series Data Mining Tasks rely on
Similarity Query
5
Esling, Philippe, and Carlos Agon. "Time-series data mining." ACM (CSUR) 45.1 (2012): 12.
Clustering Motif Discovery
Classification
çΩ
Outlier Detection
Whole Matching
Subsequence Matching
6. Spark-ITS
• A new Index Tree and an effective Signature to simplify
the cardinality conversion and keep better similarity
• A Distributed Index Framework to support large-scale
time series dataset
• Efficient algorithms for Exact Match and kNN
Approximate queries process
6
7. Spark-ITS Overview
7
Global Index
Indexed Data
Local Index
Query
Partition
1. Sampling
2. Node Statistic
3. Build Index Tree
4. Assign Partition ID
1. Construct Local Structure
2. Construct Bloom Filter
1. Read and convert data
2. Shuffle data
8. Background: iSAX Representation
8
Shieh, Jin, and Eamonn Keogh. "iSAX: indexing and mining terabyte sized time series." SIGKDD ACM, 2008.
Camerra, A., Palpanas, T., Shieh, J., & Keogh, E. "iSAX 2.0: Indexing and mining one billion time series." ICDM, 2010
A time series of length 16
PAA representation with 4
segments
PAA: Piecewise Aggregate Approximation
iSAX: indexable Symbolic Aggregate approXimation
SAX representation with
4 segments and
cardinality 4
[11,10,01,00]
iSAX representation
with 4 segments and
variable cardinality
[1", 1", 𝟎𝟏 𝟒, 0"]
9. Word-level Similarity
9
-2
-1
0
1
2
1 3 5 7 9 11
A B C 111
110
101
100
011
010
001
000
-2
-1
0
1
2
1 3 5 7 9 11
A B C 111
110
101
100
011
010
001
000
State-of-the-art: Character-level Similarity Proposed: Word-level Similarity
B and C
are similar
A and C
are similar
A: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟏 𝟑, 𝟏 𝟏]
B: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏]
C: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏]
A: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]
B: [𝟎𝟎 𝟐, 𝟎𝟎 𝟐, 𝟎𝟏 𝟐, 𝟏𝟏 𝟐]
C: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]
20. 000*,001*,001*
Isax-t: 002
Exact Matching Query
20
Local Index
Master
Global Index
Query
Records in
leaf node
euDist = 0
Pid
6 Bloom Filter
No
Yes
Exist?
Pid:6
Worker
Partition 6
21. KNN Approximate Query: One Partition Access
21
Worker
000*,001*,001*
Isax-t: 002
Local Index
Master
iSAX-T Skeleton
Query
Pid
Pid:6
(euDist, rid)1. euDist
2. sort
3. take Top(K)Records in leaf
/ internal node
Partition 6
Records in
leaf/internal
node
1. euDist
2. sort
3. Top(K) dist
as
threshold
22. KNN Approximate Query: Multi-Partitions Access
22
000*,001*,001*
iSAX-T: 002
Partition 6
Local Index
Master
Worker
(euDist,rid)
Local Index
1. euDist
2. sort
3. take Top(K)
Sibling
Pid List
iSAX-T Skeleton
Records in
leaf/internal
node
1. euDist
2. sort
3. Top(K) dist
as
threshold
Records in leaf
/ internal node
Query
24. Experimental Setup
24
Dataset Size Length
Random Walk 1 billion 256
Texmex 1 billion 128
DNA 200 million 192
Noaa Climate 200 million 64
HW&SW Configuration
Spark 2.0.2, Standalone mode
Hadoop 2.7.3
Platform Ubuntu 16.04. LTS
HW 2 nodes, each node consist of 56 Xeon
E5 processors, 500G RAM, 7TB SATA
hard drive
The dataset is normalized
Each point is saved as float format
Source:
1. http://corpus-texmex.irisa.fr/
2. https://genmone.ucsc.edu
3. https://www.ncdc.gov/
1
2
3
State-of-the-Art: Yagoubi, Djamel-Edine, et al. "DPiSAX:
Massively Distributed Partitioned iSAX." ICDM 2017
The initial cardinality of the baseline system is the default value
and it needs a large initial value to guarantee enough bit level for
binary split.
Baseline Spark-ITS
Initial cardinality 512 64
Word length 8 8
Sampling percent 10% 10%
Leaf node split threshold
of Local index 1000 1000
25. Index Construction Time
25
0
500
1,000
1,500
2,000
200m 400m 600m 800m 1b 200m 400m 600m 800m 1b
Spark-ITS Baseline
(Minuts)
#Time Series
Global Index
Local Index
2,323
334
Dataset: Random Walk Benchmark
80+%
State-of-the-Art
26. 0
10
20
30
40
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
200m 400m 600m 800m 1b
(Minutes)
sampling
statistic
build index
assign Pid
State-of-Art
State-of-Art
State-of-Art
State-of-Art
State-of-Art
Sampling
Statistic
Build Index
Assign Pid
Index Construction Time: Breakdown
26
Global Index Time Breakdown Repartition and Local Index Time Breakdown
Dataset: Random Walk Benchmark
0
500
1,000
1,500
2,000
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
200m 400m 600m 800m 1b
(Minutes)
iSAX read and convert
Shuffle and build Local index
Read and conversion
State-of-Art
State-of-Art
State-of-Art
State-of-Art
State-of-Art
Shuffle and Build Local Index
29. Conclusion
• Index Tree
– Large fan-out decreases the depth of leaf nodes
– Keeps better similarity at Word-level
– The signature simplifies the conversion of cardinality
• Spark-ITS: Index Construction
– Block-sampling and node statistic collection to fast build global index
– Synchronously build local indices within a partition
– Constructs Index faster 80+%.
• Spark-ITS: Query
– Exact Matching: the time decreases by 50%.
– kNN approximate: the accuracy increases more than 10 fold.
29
30. Acknowledge Funding from...
Xianjin Tech Co., Ltd.
Saudi Arabian Cultural Mission
WPI Computer Science Dept.,
NSF CNS: 305258 II-EN
NSF CRI: 0551584