Vasilis-DaMn05.ppt
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
797
On Slideshare
796
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 1

http://www.slideshare.net 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory (DEnLab) Dept. of Computer and Information Sciences Temple University Philadelphia, PA www.cis.temple.edu/~vasilis
  • 2. Outline
    • Introduction
      • Motivation – Problems:
        • Spatial domain
        • Time domain
      • Challenges
    • Spatial data
      • Partitioning and Clustering
      • Detection of discriminative patterns
      • Results
    • Temporal data
      • Partitioning
      • Vector Quantization
      • Results
    • Conclusions - Discussion
  • 3. Introduction
    • Large spatial and temporal databases
    • Meta-analysis of data pooled from multiple studies
    • Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data
  • 4. Problem
    • Spatial Data Mining:
      • Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.:
        • associations among image data or among image and non-image data
        • discriminative areas among groups of images
        • rules/patterns
        • similar images to a query image (queries by content)
  • 5. Challenges
    • How to apply data mining techniques to images?
    • Learning from images directly
    • Heterogeneity and variability of image data
    • Preprocessing (segmentation, spatial normalization, etc)
    • Exploration of high correlation between neighboring objects
    • Large dimensionality
    • Complexity of associations
    • Efficient management of topological/distance information
    • Spatial knowledge representation / Spatial Access Methods (SAMs)
  • 6. Example: Association Mining – Spatial Data
    • Discover associations among spatial and non-spatial data:
      • Images {i 1 , i 2 ,…, i L }
      • Spatial regions {s 1 , s 2 ,…, s K }
      • Non-spatial variables {c 1 , c 2 ,…, c M }
    c 1 c 2 c 3 c 1 c 7 c 2 c 9 c 6 i 1 i 2 i 3 i 4 i 5 i 6 i 7
  • 7. Example: fMRI contrast maps Control Patient
  • 8. Applications
    • Medical Imaging, Bioinformatics, Geography, Meteorology, etc..
  • 9. Voxel-based Analysis
    • No model on the image data
    • Each voxel’s changes analyzed independently - a map of statistical significance is built
    • Discriminatory significance measured by statistical tests (t-test, ranksum test, F-test, etc)
    • Statistical Parametric Mapping (SPM)
    • Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars)
    • Cluster voxels by findings
    [V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]
  • 10. Analysis by grouping of voxels
    • Grouping of voxels (atlas-based)
      • Prior knowledge increases sensitivity
      • Data reduction: 10 7 voxels R regions (structures)
      • Map a ROI onto at least one region
      • As good as the atlas being used
    • M non-spatial variables, R regions
    • Analysis
    • Categorical structural variables
    • Continuous structural variables
    • M x R contingency tables, Chi-square/Fisher exact test
    • multiple comparison problem
    • log-linear analysis, multivariate Bayesian
    • Logistic regression, Mann-Whitney
  • 11. Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
  • 12. Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
  • 13. Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
  • 14. Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
  • 15. Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
    • Extract features from discriminative regions
    • Reduce multiple comparison problem
      • (# tests = # partitions < # voxels)
    • tests downward closed
    [V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]
  • 16. Other Methods for Spatial Data Classification
    • Distributional Distances:
      • - Mahalanobis distance
      • - Kullback-Leibler divergence (parametric, non-parametric)
    • Maximum Likelihood:
      • - Estimate probability densities and compute likelihood
        • EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian)
    • Static partitioning:
      • Reduction of the # of attributes as compared to voxel-wise analysis
      • Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization
    Distinguishing among distributions: D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp. 261-280, Mar. 2005. * * * * * * * * *
  • 17. Experimental Results Areas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance [D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004] Comparison of number of tests performed     Number of tests Thresh. Depth DRP Voxel Wise 0.05 3 569 201774 0.05 4 4425 201774 0.01 4 4665 201774 71 66 77 Kullback-Leibler / k-means 68 57 79 Kullback-Leibler / EM 80 83 77 Maximum Likelihood / k-means 72 67 77 Maximum Likelihood / EM 91 96 87 4 0.01 90 100 80 4 0.05 93 100 87 3 0.05 ranksum 93 100 87 4 0.01 92 100 84 4 0.05 94 100 89 3 0.05 t-test 88 93 82 3 0.4 correlation DRP Total Patients Controls Tree depth Threshold Criterion Classification Accuracy (%) Method
  • 18. Experimental Results
    • Impact:
    • Assist in interpretation of images (e.g., facilitating diagnosis)
    • Enable researchers to integrate, manipulate and analyze large volumes of image data
    (a) (b) Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data
  • 19. Time Sequence Analysis
    • Time series data abound in many applications …
    • Challenges:
      • High dimensionality
      • Large number of sequences
      • Similarity metric definition
    • Similarity analysis (e.g., find stocks similar to that of IBM)
    • Goals: high accuracy, (high speed) in similarity searches among time series and in discovering interesting patterns
    • Applications: clustering, classification, similarity searches, summarization
    Time Sequence: A sequence (ordered collection) of real values: X = x 1 , x 2 ,…, x n
  • 20. Dimensionality Reduction Techniques
        • DFT: Discrete Fourier Transform
        • DWT: Discrete Wavelet Transform
        • SVD: Singular Value Decomposition
        • APCA: Adaptive Piecewise Constant Approximation
        • PAA: Piecewise Aggregate Approximation
        • SAX: Symbolic Aggregate approXimation
  • 21. Similarity distances for time series A more intuitive idea: two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)
      • Euclidean Distance:
        • most common, sensitive to shifts
      • Envelope-based DTW:
        • faster: O(n)
      • Dynamic Time Warping:
        • improving accuracy but slow: O(n 2 )
  • 22. Partitioning – Piecewise Constant Approximations Original time series (n points) Piecewise constant approximation (PCA) or Piecewise Aggregate Approximation (PAA), [Yi and Faloutsos ’00, Keogh et al, ’00] (n ' segments) Adaptive Piecewise Constant Approximation (APCA), [Keogh et al., ’01] (n &quot; segments)
  • 23. Multiresolution Vector Quantized approximation (MVQ) Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key-subsequences 1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved 2) Takes multiple resolutions into account – keeps both local and global information 3) Unlike wavelets partially ignores the ordering of ‘codewords’ 3) Can exploit prior knowledge about the data 4) Employs a new distance metric [V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]
  • 24. Methodology Codebook s=16 Generation Series Transformation Series Encoding 1121000000001000 1200010011000000 1000000012001100 1000000011002100 0001010100110010 1010000100100011 …… c m d b c a i f a j b b m i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p …… s l
  • 25. Methodology
      • Creating a ‘vocabulary’
    Frequently appearing patterns in subsequences
      • Output:
      • A codebook with s codewords
    ( f i is the frequency of the i th codeword in X) Q: How to create? A: Use Vector Quantization , in particular, the Generalized Lloyd Algorithm (GLA) Representing time series X = x 1 , x 2 ,…, x n f = ( f 1 ,f 2 ,…, f s ) is encoded with a new representation
  • 26. Methodology New distance metric: The histogram model is used to calculate similarity at each resolution level: with 1 2...s f i,t f i,q
  • 27. Methodology
    • Time series summarization:
    • High level information (frequently appearing patterns) is more useful
    • The new representation can provide this kind of information
    Both codeword (pattern) 3 & 5 show up 2 times
  • 28. Methodology Problems of frequency based encoding:
    • It is hard to define an approximate resolution (codeword length)
    • It may lose global information
  • 29. Methodology Solution: Use multiple resolutions:
    • It is hard to define an approximate resolution (codeword length)
    • It may lose global information
  • 30. Methodology Proposed distance metric: Weighted sum of similarities, at all resolution levels similarity @ level i
    • where c is the number of resolution levels
    • lacking any prior knowledge equal weights to all resolution levels works well most of the time
  • 31. MVQ: Example of Codebooks
    • Codebook for the first level
    • Codebook for the second level (more codewords since there are more details)
  • 32. Experiments Datasets
    • SYNDATA (control chart data): synthetic
    • CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program
    • RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day
  • 33. Experiments
    • Best Match Searching:
    Matching accuracy: % of knn’s (found by different approaches) that are in same class
  • 34. Experiments
    • Best Match Searching
    SYNDATA CAMMOUSE 0.51   Euclidean 0.83   [1 1 1 1 1] MVQ 0.46 [0 0 0 0 1] 0.48 [0 0 0 1 0] 0.65 [0 0 1 0 0] 0.70 [0 1 0 0 0] 0.55 [1 0 0 0 0] Single level VQ Accuracy   Weight Vector Method 0.58 Euclidean 0.83 [1 1 1 1 1] MVQ 0.60 [0 0 0 0 1] 0.56 [0 0 0 1 0] 0.44 [0 0 1 0 0] 0.60 [0 1 0 0 0] 0.56 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method
  • 35. Experiments
    • Best Match Searching
    (a) (b) Precision-recall for different methods (a) on SYNDATA dataset (b) on CAMMOUSE dataset MVQ MVQ
  • 36. Experiments
    • Clustering experiments
    Given two clusterings, G=G 1 , G 2 , …, G K (the true clusters), and A = A 1 , A 2 , …, A k (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as: with [Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]
  • 37. Experiments
    • Clustering experiments
    . SYNDATA RTT 0.55 Euclidean 0.80 DTW 0.65 SAX 0.67 DFT 0.82 [1 1 1 1 1] MVQ 0.49 [0 0 0 0 1] 0.51 [0 0 0 1 0] 0.63 [0 0 1 0 0] 0.71 [0 1 0 0 0] 0.69 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method 0.50 Euclidean 0.62 DTW 0.54 SAX 0.54 DFT 0.81 [0 0 0 1 1] MVQ 0.79 [0 0 0 0 1] 0.80 [0 0 0 1 0] 0.57 [0 0 1 0 0] 0.52 [0 1 0 0 0] 0.55 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method
  • 38. Experiments Summarization (SYNDATA) Typical series:
  • 39. Experiments First Level Second Level
  • 40.
    • Given two time series t1 and t2 as follows:
    MVQ: Example: Two Time Series
    • In the first level, they are encoded with the same codeword (3), so they are not distinguishable
    • In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12.
  • 41.
    • Hilbert Space Filling Curve
    • Binning
    • Statistical tests of significance on groups of points
    • Identification of discriminative areas by back-projection
    Analysis of images by projection to 1D (a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its V mean measurement, (c) the discriminative voxels after applying the t-test with θ =0.05 (a) (b) (c) [D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]
  • 42. Applying time series techniques   Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance. (a) (b)
    • Variation: Concatenate the values of statistically significant areas  spatial sequences
    • Pattern analysis using the similarity between spatial sequences and time sequences
      • SVD, DFT, DWT, PCA (clustering accuracy: 89-100%)
    Results: 87%-98% classification accuracy (t-test, CATX) [Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]
  • 43. Conclusions
    • ‘ Find patterns/interesting things’ efficiently and robustly in spatial and temporal data
    • Use of partitioning and clustering
    • Analysis at multiple resolutions
    • Reduction of the number of tests performed
    • Intelligent exploration of the space to find discriminative areas
    • Reduction of dimensionality
    • Symbolic representation
    • Nice summarization
  • 44. Collaborators
    • Faculty:
    • Zoran Obradovic
    • Orest Boyko
    • James Gee
    • Andrew Saykin
    • Christos Faloutsos
    • Christos Davatzikos
    • Edward Herskovits
    • Fillia Makedon
    • Dragoljub Pokrajac
    • Students:
    • Despina Kontos
    • Qiang Wang
    • Guo Li
    • Others:
    • James Ford
    • Alexandar Lazarevic
  • 45. Thank you!
    • Acknowledgements
    • This research has been funded by:
      • National Science Foundation CAREER award 0237921
      • National Science Foundation Grant 0083423
      • National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA