Vasilis-DaMn05.ppt

599 views
529 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
599
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Vasilis-DaMn05.ppt

  1. 1. Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory (DEnLab) Dept. of Computer and Information Sciences Temple University Philadelphia, PA www.cis.temple.edu/~vasilis
  2. 2. Outline <ul><li>Introduction </li></ul><ul><ul><li>Motivation – Problems: </li></ul></ul><ul><ul><ul><li>Spatial domain </li></ul></ul></ul><ul><ul><ul><li>Time domain </li></ul></ul></ul><ul><ul><li>Challenges </li></ul></ul><ul><li>Spatial data </li></ul><ul><ul><li>Partitioning and Clustering </li></ul></ul><ul><ul><li>Detection of discriminative patterns </li></ul></ul><ul><ul><li>Results </li></ul></ul><ul><li>Temporal data </li></ul><ul><ul><li>Partitioning </li></ul></ul><ul><ul><li>Vector Quantization </li></ul></ul><ul><ul><li>Results </li></ul></ul><ul><li>Conclusions - Discussion </li></ul>
  3. 3. Introduction <ul><li>Large spatial and temporal databases </li></ul><ul><li>Meta-analysis of data pooled from multiple studies </li></ul><ul><li>Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data </li></ul>
  4. 4. Problem <ul><li>Spatial Data Mining: </li></ul><ul><ul><li>Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.: </li></ul></ul><ul><ul><ul><li>associations among image data or among image and non-image data </li></ul></ul></ul><ul><ul><ul><li>discriminative areas among groups of images </li></ul></ul></ul><ul><ul><ul><li>rules/patterns </li></ul></ul></ul><ul><ul><ul><li>similar images to a query image (queries by content) </li></ul></ul></ul>
  5. 5. Challenges <ul><li>How to apply data mining techniques to images? </li></ul><ul><li>Learning from images directly </li></ul><ul><li>Heterogeneity and variability of image data </li></ul><ul><li>Preprocessing (segmentation, spatial normalization, etc) </li></ul><ul><li>Exploration of high correlation between neighboring objects </li></ul><ul><li>Large dimensionality </li></ul><ul><li>Complexity of associations </li></ul><ul><li>Efficient management of topological/distance information </li></ul><ul><li>Spatial knowledge representation / Spatial Access Methods (SAMs) </li></ul>
  6. 6. Example: Association Mining – Spatial Data <ul><li>Discover associations among spatial and non-spatial data: </li></ul><ul><ul><li>Images {i 1 , i 2 ,…, i L } </li></ul></ul><ul><ul><li>Spatial regions {s 1 , s 2 ,…, s K } </li></ul></ul><ul><ul><li>Non-spatial variables {c 1 , c 2 ,…, c M } </li></ul></ul>c 1 c 2 c 3 c 1 c 7 c 2 c 9 c 6 i 1 i 2 i 3 i 4 i 5 i 6 i 7
  7. 7. Example: fMRI contrast maps Control Patient
  8. 8. Applications <ul><li>Medical Imaging, Bioinformatics, Geography, Meteorology, etc.. </li></ul>
  9. 9. Voxel-based Analysis <ul><li>No model on the image data </li></ul><ul><li>Each voxel’s changes analyzed independently - a map of statistical significance is built </li></ul><ul><li>Discriminatory significance measured by statistical tests (t-test, ranksum test, F-test, etc) </li></ul><ul><li>Statistical Parametric Mapping (SPM) </li></ul><ul><li>Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars) </li></ul><ul><li>Cluster voxels by findings </li></ul>[V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]
  10. 10. Analysis by grouping of voxels <ul><li>Grouping of voxels (atlas-based) </li></ul><ul><ul><li>Prior knowledge increases sensitivity </li></ul></ul><ul><ul><li>Data reduction: 10 7 voxels R regions (structures) </li></ul></ul><ul><ul><li>Map a ROI onto at least one region </li></ul></ul><ul><ul><li>As good as the atlas being used </li></ul></ul><ul><li>M non-spatial variables, R regions </li></ul><ul><li>Analysis </li></ul><ul><li>Categorical structural variables </li></ul><ul><li>Continuous structural variables </li></ul><ul><li>M x R contingency tables, Chi-square/Fisher exact test </li></ul><ul><li>multiple comparison problem </li></ul><ul><li>log-linear analysis, multivariate Bayesian </li></ul><ul><li>Logistic regression, Mann-Whitney </li></ul>
  11. 11. Dynamic Recursive Partitioning <ul><li>Adaptive partitioning of a 3D volume </li></ul>
  12. 12. Dynamic Recursive Partitioning <ul><li>Adaptive partitioning of a 3D volume </li></ul><ul><li>Partitioning criterion: </li></ul><ul><ul><li>discriminative power of feature(s) of hyper-rectangle and </li></ul></ul><ul><ul><li>size of hyper-rectangle </li></ul></ul>
  13. 13. Dynamic Recursive Partitioning <ul><li>Adaptive partitioning of a 3D volume </li></ul><ul><li>Partitioning criterion: </li></ul><ul><ul><li>discriminative power of feature(s) of hyper-rectangle and </li></ul></ul><ul><ul><li>size of hyper-rectangle </li></ul></ul>
  14. 14. Dynamic Recursive Partitioning <ul><li>Adaptive partitioning of a 3D volume </li></ul><ul><li>Partitioning criterion: </li></ul><ul><ul><li>discriminative power of feature(s) of hyper-rectangle and </li></ul></ul><ul><ul><li>size of hyper-rectangle </li></ul></ul>
  15. 15. Dynamic Recursive Partitioning <ul><li>Adaptive partitioning of a 3D volume </li></ul><ul><li>Partitioning criterion: </li></ul><ul><ul><li>discriminative power of feature(s) of hyper-rectangle and </li></ul></ul><ul><ul><li>size of hyper-rectangle </li></ul></ul><ul><li>Extract features from discriminative regions </li></ul><ul><li>Reduce multiple comparison problem </li></ul><ul><ul><li>(# tests = # partitions < # voxels) </li></ul></ul><ul><li>tests downward closed </li></ul>[V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]
  16. 16. Other Methods for Spatial Data Classification <ul><li>Distributional Distances: </li></ul><ul><ul><li>- Mahalanobis distance </li></ul></ul><ul><ul><li>- Kullback-Leibler divergence (parametric, non-parametric) </li></ul></ul><ul><li>Maximum Likelihood: </li></ul><ul><ul><li>- Estimate probability densities and compute likelihood </li></ul></ul><ul><ul><ul><li>EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian) </li></ul></ul></ul><ul><li>Static partitioning: </li></ul><ul><ul><li>Reduction of the # of attributes as compared to voxel-wise analysis </li></ul></ul><ul><ul><li>Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization </li></ul></ul>Distinguishing among distributions: D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp. 261-280, Mar. 2005. * * * * * * * * *
  17. 17. Experimental Results Areas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance [D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004] Comparison of number of tests performed     Number of tests Thresh. Depth DRP Voxel Wise 0.05 3 569 201774 0.05 4 4425 201774 0.01 4 4665 201774 71 66 77 Kullback-Leibler / k-means 68 57 79 Kullback-Leibler / EM 80 83 77 Maximum Likelihood / k-means 72 67 77 Maximum Likelihood / EM 91 96 87 4 0.01 90 100 80 4 0.05 93 100 87 3 0.05 ranksum 93 100 87 4 0.01 92 100 84 4 0.05 94 100 89 3 0.05 t-test 88 93 82 3 0.4 correlation DRP Total Patients Controls Tree depth Threshold Criterion Classification Accuracy (%) Method
  18. 18. Experimental Results <ul><li>Impact: </li></ul><ul><li>Assist in interpretation of images (e.g., facilitating diagnosis) </li></ul><ul><li>Enable researchers to integrate, manipulate and analyze large volumes of image data </li></ul>(a) (b) Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data
  19. 19. Time Sequence Analysis <ul><li>Time series data abound in many applications … </li></ul><ul><li>Challenges: </li></ul><ul><ul><li>High dimensionality </li></ul></ul><ul><ul><li>Large number of sequences </li></ul></ul><ul><ul><li>Similarity metric definition </li></ul></ul><ul><li>Similarity analysis (e.g., find stocks similar to that of IBM) </li></ul><ul><li>Goals: high accuracy, (high speed) in similarity searches among time series and in discovering interesting patterns </li></ul><ul><li>Applications: clustering, classification, similarity searches, summarization </li></ul>Time Sequence: A sequence (ordered collection) of real values: X = x 1 , x 2 ,…, x n
  20. 20. Dimensionality Reduction Techniques <ul><ul><ul><li>DFT: Discrete Fourier Transform </li></ul></ul></ul><ul><ul><ul><li>DWT: Discrete Wavelet Transform </li></ul></ul></ul><ul><ul><ul><li>SVD: Singular Value Decomposition </li></ul></ul></ul><ul><ul><ul><li>APCA: Adaptive Piecewise Constant Approximation </li></ul></ul></ul><ul><ul><ul><li>PAA: Piecewise Aggregate Approximation </li></ul></ul></ul><ul><ul><ul><li>SAX: Symbolic Aggregate approXimation </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul>
  21. 21. Similarity distances for time series A more intuitive idea: two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995) <ul><ul><li>Euclidean Distance: </li></ul></ul><ul><ul><ul><li>most common, sensitive to shifts </li></ul></ul></ul><ul><ul><li>Envelope-based DTW: </li></ul></ul><ul><ul><ul><li>faster: O(n) </li></ul></ul></ul><ul><ul><li>Dynamic Time Warping: </li></ul></ul><ul><ul><ul><li>improving accuracy but slow: O(n 2 ) </li></ul></ul></ul>
  22. 22. Partitioning – Piecewise Constant Approximations Original time series (n points) Piecewise constant approximation (PCA) or Piecewise Aggregate Approximation (PAA), [Yi and Faloutsos ’00, Keogh et al, ’00] (n ' segments) Adaptive Piecewise Constant Approximation (APCA), [Keogh et al., ’01] (n &quot; segments)
  23. 23. Multiresolution Vector Quantized approximation (MVQ) Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key-subsequences 1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved 2) Takes multiple resolutions into account – keeps both local and global information 3) Unlike wavelets partially ignores the ordering of ‘codewords’ 3) Can exploit prior knowledge about the data 4) Employs a new distance metric [V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]
  24. 24. Methodology Codebook s=16 Generation Series Transformation Series Encoding 1121000000001000 1200010011000000 1000000012001100 1000000011002100 0001010100110010 1010000100100011 …… c m d b c a i f a j b b m i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p …… s l
  25. 25. Methodology <ul><ul><li>Creating a ‘vocabulary’ </li></ul></ul>Frequently appearing patterns in subsequences <ul><ul><li>Output: </li></ul></ul><ul><ul><li>A codebook with s codewords </li></ul></ul>( f i is the frequency of the i th codeword in X) Q: How to create? A: Use Vector Quantization , in particular, the Generalized Lloyd Algorithm (GLA) Representing time series X = x 1 , x 2 ,…, x n f = ( f 1 ,f 2 ,…, f s ) is encoded with a new representation
  26. 26. Methodology New distance metric: The histogram model is used to calculate similarity at each resolution level: with 1 2...s f i,t f i,q
  27. 27. Methodology <ul><li>Time series summarization: </li></ul><ul><li>High level information (frequently appearing patterns) is more useful </li></ul><ul><li>The new representation can provide this kind of information </li></ul>Both codeword (pattern) 3 & 5 show up 2 times
  28. 28. Methodology Problems of frequency based encoding: <ul><li>It is hard to define an approximate resolution (codeword length) </li></ul><ul><li>It may lose global information </li></ul>
  29. 29. Methodology Solution: Use multiple resolutions: <ul><li>It is hard to define an approximate resolution (codeword length) </li></ul><ul><li>It may lose global information </li></ul>
  30. 30. Methodology Proposed distance metric: Weighted sum of similarities, at all resolution levels similarity @ level i <ul><li>where c is the number of resolution levels </li></ul><ul><li>lacking any prior knowledge equal weights to all resolution levels works well most of the time </li></ul>
  31. 31. MVQ: Example of Codebooks <ul><li>Codebook for the first level </li></ul><ul><li>Codebook for the second level (more codewords since there are more details) </li></ul>
  32. 32. Experiments Datasets <ul><li>SYNDATA (control chart data): synthetic </li></ul><ul><li>CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program </li></ul><ul><li>RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day </li></ul>
  33. 33. Experiments <ul><li>Best Match Searching: </li></ul>Matching accuracy: % of knn’s (found by different approaches) that are in same class
  34. 34. Experiments <ul><li>Best Match Searching </li></ul>SYNDATA CAMMOUSE 0.51   Euclidean 0.83   [1 1 1 1 1] MVQ 0.46 [0 0 0 0 1] 0.48 [0 0 0 1 0] 0.65 [0 0 1 0 0] 0.70 [0 1 0 0 0] 0.55 [1 0 0 0 0] Single level VQ Accuracy   Weight Vector Method 0.58 Euclidean 0.83 [1 1 1 1 1] MVQ 0.60 [0 0 0 0 1] 0.56 [0 0 0 1 0] 0.44 [0 0 1 0 0] 0.60 [0 1 0 0 0] 0.56 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method
  35. 35. Experiments <ul><li>Best Match Searching </li></ul>(a) (b) Precision-recall for different methods (a) on SYNDATA dataset (b) on CAMMOUSE dataset MVQ MVQ
  36. 36. Experiments <ul><li>Clustering experiments </li></ul>Given two clusterings, G=G 1 , G 2 , …, G K (the true clusters), and A = A 1 , A 2 , …, A k (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as: with [Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]
  37. 37. Experiments <ul><li>Clustering experiments </li></ul>. SYNDATA RTT 0.55 Euclidean 0.80 DTW 0.65 SAX 0.67 DFT 0.82 [1 1 1 1 1] MVQ 0.49 [0 0 0 0 1] 0.51 [0 0 0 1 0] 0.63 [0 0 1 0 0] 0.71 [0 1 0 0 0] 0.69 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method 0.50 Euclidean 0.62 DTW 0.54 SAX 0.54 DFT 0.81 [0 0 0 1 1] MVQ 0.79 [0 0 0 0 1] 0.80 [0 0 0 1 0] 0.57 [0 0 1 0 0] 0.52 [0 1 0 0 0] 0.55 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method
  38. 38. Experiments Summarization (SYNDATA) Typical series:
  39. 39. Experiments First Level Second Level
  40. 40. <ul><li>Given two time series t1 and t2 as follows: </li></ul>MVQ: Example: Two Time Series <ul><li>In the first level, they are encoded with the same codeword (3), so they are not distinguishable </li></ul><ul><li>In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12. </li></ul>
  41. 41. <ul><li>Hilbert Space Filling Curve </li></ul><ul><li>Binning </li></ul><ul><li>Statistical tests of significance on groups of points </li></ul><ul><li>Identification of discriminative areas by back-projection </li></ul>Analysis of images by projection to 1D (a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its V mean measurement, (c) the discriminative voxels after applying the t-test with θ =0.05 (a) (b) (c) [D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]
  42. 42. Applying time series techniques   Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance. (a) (b) <ul><li>Variation: Concatenate the values of statistically significant areas  spatial sequences </li></ul><ul><li>Pattern analysis using the similarity between spatial sequences and time sequences </li></ul><ul><ul><li>SVD, DFT, DWT, PCA (clustering accuracy: 89-100%) </li></ul></ul>Results: 87%-98% classification accuracy (t-test, CATX) [Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]
  43. 43. Conclusions <ul><li>‘ Find patterns/interesting things’ efficiently and robustly in spatial and temporal data </li></ul><ul><li>Use of partitioning and clustering </li></ul><ul><li>Analysis at multiple resolutions </li></ul><ul><li>Reduction of the number of tests performed </li></ul><ul><li>Intelligent exploration of the space to find discriminative areas </li></ul><ul><li>Reduction of dimensionality </li></ul><ul><li>Symbolic representation </li></ul><ul><li>Nice summarization </li></ul>
  44. 44. Collaborators <ul><li>Faculty: </li></ul><ul><li>Zoran Obradovic </li></ul><ul><li>Orest Boyko </li></ul><ul><li>James Gee </li></ul><ul><li>Andrew Saykin </li></ul><ul><li>Christos Faloutsos </li></ul><ul><li>Christos Davatzikos </li></ul><ul><li>Edward Herskovits </li></ul><ul><li>Fillia Makedon </li></ul><ul><li>Dragoljub Pokrajac </li></ul><ul><li>Students: </li></ul><ul><li>Despina Kontos </li></ul><ul><li>Qiang Wang </li></ul><ul><li>Guo Li </li></ul><ul><li>Others: </li></ul><ul><li>James Ford </li></ul><ul><li>Alexandar Lazarevic </li></ul>
  45. 45. Thank you! <ul><li>Acknowledgements </li></ul><ul><li>This research has been funded by: </li></ul><ul><ul><li>National Science Foundation CAREER award 0237921 </li></ul></ul><ul><ul><li>National Science Foundation Grant 0083423 </li></ul></ul><ul><ul><li>National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA </li></ul></ul>

×