Model based similarity measure in time cloud


Published on

The presentation will be delivered by Thanh-Nguyen Ngo at the 14th Asia-Pacific Web Conference (APWeb) on April 12th, 2012 in Kunming, China.


This paper presents a new approach to measuring similarity over massive time-series data. Our approach is built on two principles: one is to parallelize the large amount computation using a scalable cloud serving system, called TimeCloud. The another is to benet from the lter-and-renement approach for query processing, such that similarity computation is eciently performed over approximated data at the lter step, and then the following renement step measures precise similarities for only a small number of candidates resulted from the ltering. To this end, we establish a set of rm theoretical backgrounds, as well as techniques for processing kNN queries. Our experimental results suggest that the approach proposed is ecient and scalable.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Model based similarity measure in time cloud

  1. 1. Model-Based SimilarityMeasure in TimeCloud Thanh-Nguyen Ngo Hoyoung Jeung Karl Aberer LSIR – IC – EPFL February 2012
  2. 2. Ouline Motivation Model-Based Time-Series Model-Based Similarity Measure kNN Processing Experiments Conclusion
  3. 3. Motivation The demand for storing and processing massive time-series in the cloud is growing rapidly Measuring a similarity is a fundamental operation in a wide range of applications that process temporally ordered data Computing similar time-series over a large volume of data still remains as a difficult problem
  4. 4. Model-Based Time-Series Definition (Time-Series) A time-series t of length n is a temporally ordered sequence t = [t1 , . . . , tn ] where point in time i is mapped to a d-dimensional attribute vector ti = (ti1 , . . . , tid ) of values tij with j ∈ {1, . . . , d}. A time-series is called univariate for d = 1 and multivariate for d > 1.
  5. 5. Model-Based Time-Series Definition (Common Points) Two points of two time-series are called common if they occur at the same time. Definition (Common Interval) The common interval of two segments or two time-series is the greatest interval [a, b] such that time a and b belong to both segments or time series. Two segments limited by the common interval are called common segments.
  6. 6. Model-Based Similarity Measure Definition (Euclidean Distance) The Euclidean distance between two time-series is also the Euclidean distance of their common segments s = [s1 , . . . , sn ] and t = [t1 , . . . , tn ] of length n, and it is defined as: n Eucl(s, t) = (si − ti )2 i=1
  7. 7. Model-Based Similarity Measure Definition (Maximum Error Bound of Time-Series) Given a time-series t = [t1 , . . . , tn ] and its representation t = [t1 , . . . , tn ] in its model. The maximum error bound of t over its model is a value meb(t) such that: |ti − ti | ≤ meb(t), ∀i = 1..n
  8. 8. Model-Based Similarity Measure Theorem Given two time-series s, t and their representations s , t in their models. Assume the common segments of s and t have n time series points. Then, √ |Eucl(s, t) − Eucl(s , t )| ≤ n(meb(s) + meb(t))
  9. 9. kNN Procesing - The Filter Stage Theorem Let ti and q be representations of ti and q in their models respectively. Let di be the distance between ti and q with the maximum error ei . Let ai = di − ei and bi = di + ei . Without loss of generality, assume b1 ≤ . . . ≤ bn . The candidate set S = {ti |ai ≤ bk } contains k nearest time-series of q and is minimal.
  10. 10. kNN Procesing - The Refinement Stage Theorem Let ti and q be representations of ti and q in their models respectively. Let di be the distance between ti and q with the maximum error ei . Let ai = di − ei and bi = di + ei . Without loss of generality, assume a1 ≤ . . . ≤ am . The set R = {ti |bi ≤ am−k+1 } is a subset of the result set.
  11. 11. Experiments 2.4GHz Intel Core2 Quad CPU Java implementation, Ubuntu 10.10 Default parameters length of time series: 512 number of nearest neighbors: 10 error ratio: 3% number of time series: 1, 000
  12. 12. Model-Based View Construction
  13. 13. Effect of Maximum Error Ratios
  14. 14. Effect of Number of Nearest Neighbors
  15. 15. Effect of Number of Time Series
  16. 16. Conclusion Process kNN queries based on model-based similarity measures Establish a set of theoretical foundations for approximated time-series data processing Build query processing mechanisms on the filter-and-refine approach Run more than three times faster than straightforward processing Facilitate scalability of the computation using the TimeCloud system
  17. 17. Questions?