Vasilis-DaMn05.ppt
Upcoming SlideShare
Loading in...5
×
 

Vasilis-DaMn05.ppt

on

  • 744 views

 

Statistics

Views

Total Views
744
Views on SlideShare
743
Embed Views
1

Actions

Likes
0
Downloads
3
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Vasilis-DaMn05.ppt Vasilis-DaMn05.ppt Presentation Transcript

  • Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory (DEnLab) Dept. of Computer and Information Sciences Temple University Philadelphia, PA www.cis.temple.edu/~vasilis
  • Outline
    • Introduction
      • Motivation – Problems:
        • Spatial domain
        • Time domain
      • Challenges
    • Spatial data
      • Partitioning and Clustering
      • Detection of discriminative patterns
      • Results
    • Temporal data
      • Partitioning
      • Vector Quantization
      • Results
    • Conclusions - Discussion
  • Introduction
    • Large spatial and temporal databases
    • Meta-analysis of data pooled from multiple studies
    • Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data
  • Problem
    • Spatial Data Mining:
      • Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.:
        • associations among image data or among image and non-image data
        • discriminative areas among groups of images
        • rules/patterns
        • similar images to a query image (queries by content)
  • Challenges
    • How to apply data mining techniques to images?
    • Learning from images directly
    • Heterogeneity and variability of image data
    • Preprocessing (segmentation, spatial normalization, etc)
    • Exploration of high correlation between neighboring objects
    • Large dimensionality
    • Complexity of associations
    • Efficient management of topological/distance information
    • Spatial knowledge representation / Spatial Access Methods (SAMs)
  • Example: Association Mining – Spatial Data
    • Discover associations among spatial and non-spatial data:
      • Images {i 1 , i 2 ,…, i L }
      • Spatial regions {s 1 , s 2 ,…, s K }
      • Non-spatial variables {c 1 , c 2 ,…, c M }
    c 1 c 2 c 3 c 1 c 7 c 2 c 9 c 6 i 1 i 2 i 3 i 4 i 5 i 6 i 7
  • Example: fMRI contrast maps Control Patient
  • Applications
    • Medical Imaging, Bioinformatics, Geography, Meteorology, etc..
  • Voxel-based Analysis
    • No model on the image data
    • Each voxel’s changes analyzed independently - a map of statistical significance is built
    • Discriminatory significance measured by statistical tests (t-test, ranksum test, F-test, etc)
    • Statistical Parametric Mapping (SPM)
    • Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars)
    • Cluster voxels by findings
    [V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]
  • Analysis by grouping of voxels
    • Grouping of voxels (atlas-based)
      • Prior knowledge increases sensitivity
      • Data reduction: 10 7 voxels R regions (structures)
      • Map a ROI onto at least one region
      • As good as the atlas being used
    • M non-spatial variables, R regions
    • Analysis
    • Categorical structural variables
    • Continuous structural variables
    • M x R contingency tables, Chi-square/Fisher exact test
    • multiple comparison problem
    • log-linear analysis, multivariate Bayesian
    • Logistic regression, Mann-Whitney
  • Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
  • Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
  • Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
  • Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
  • Dynamic Recursive Partitioning
    • Adaptive partitioning of a 3D volume
    • Partitioning criterion:
      • discriminative power of feature(s) of hyper-rectangle and
      • size of hyper-rectangle
    • Extract features from discriminative regions
    • Reduce multiple comparison problem
      • (# tests = # partitions < # voxels)
    • tests downward closed
    [V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]
  • Other Methods for Spatial Data Classification
    • Distributional Distances:
      • - Mahalanobis distance
      • - Kullback-Leibler divergence (parametric, non-parametric)
    • Maximum Likelihood:
      • - Estimate probability densities and compute likelihood
        • EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian)
    • Static partitioning:
      • Reduction of the # of attributes as compared to voxel-wise analysis
      • Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization
    Distinguishing among distributions: D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp. 261-280, Mar. 2005. * * * * * * * * *
  • Experimental Results Areas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance [D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004] Comparison of number of tests performed     Number of tests Thresh. Depth DRP Voxel Wise 0.05 3 569 201774 0.05 4 4425 201774 0.01 4 4665 201774 71 66 77 Kullback-Leibler / k-means 68 57 79 Kullback-Leibler / EM 80 83 77 Maximum Likelihood / k-means 72 67 77 Maximum Likelihood / EM 91 96 87 4 0.01 90 100 80 4 0.05 93 100 87 3 0.05 ranksum 93 100 87 4 0.01 92 100 84 4 0.05 94 100 89 3 0.05 t-test 88 93 82 3 0.4 correlation DRP Total Patients Controls Tree depth Threshold Criterion Classification Accuracy (%) Method
  • Experimental Results
    • Impact:
    • Assist in interpretation of images (e.g., facilitating diagnosis)
    • Enable researchers to integrate, manipulate and analyze large volumes of image data
    (a) (b) Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data
  • Time Sequence Analysis
    • Time series data abound in many applications …
    • Challenges:
      • High dimensionality
      • Large number of sequences
      • Similarity metric definition
    • Similarity analysis (e.g., find stocks similar to that of IBM)
    • Goals: high accuracy, (high speed) in similarity searches among time series and in discovering interesting patterns
    • Applications: clustering, classification, similarity searches, summarization
    Time Sequence: A sequence (ordered collection) of real values: X = x 1 , x 2 ,…, x n
  • Dimensionality Reduction Techniques
        • DFT: Discrete Fourier Transform
        • DWT: Discrete Wavelet Transform
        • SVD: Singular Value Decomposition
        • APCA: Adaptive Piecewise Constant Approximation
        • PAA: Piecewise Aggregate Approximation
        • SAX: Symbolic Aggregate approXimation
  • Similarity distances for time series A more intuitive idea: two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)
      • Euclidean Distance:
        • most common, sensitive to shifts
      • Envelope-based DTW:
        • faster: O(n)
      • Dynamic Time Warping:
        • improving accuracy but slow: O(n 2 )
  • Partitioning – Piecewise Constant Approximations Original time series (n points) Piecewise constant approximation (PCA) or Piecewise Aggregate Approximation (PAA), [Yi and Faloutsos ’00, Keogh et al, ’00] (n ' segments) Adaptive Piecewise Constant Approximation (APCA), [Keogh et al., ’01] (n &quot; segments)
  • Multiresolution Vector Quantized approximation (MVQ) Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key-subsequences 1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved 2) Takes multiple resolutions into account – keeps both local and global information 3) Unlike wavelets partially ignores the ordering of ‘codewords’ 3) Can exploit prior knowledge about the data 4) Employs a new distance metric [V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]
  • Methodology Codebook s=16 Generation Series Transformation Series Encoding 1121000000001000 1200010011000000 1000000012001100 1000000011002100 0001010100110010 1010000100100011 …… c m d b c a i f a j b b m i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p …… s l
  • Methodology
      • Creating a ‘vocabulary’
    Frequently appearing patterns in subsequences
      • Output:
      • A codebook with s codewords
    ( f i is the frequency of the i th codeword in X) Q: How to create? A: Use Vector Quantization , in particular, the Generalized Lloyd Algorithm (GLA) Representing time series X = x 1 , x 2 ,…, x n f = ( f 1 ,f 2 ,…, f s ) is encoded with a new representation
  • Methodology New distance metric: The histogram model is used to calculate similarity at each resolution level: with 1 2...s f i,t f i,q
  • Methodology
    • Time series summarization:
    • High level information (frequently appearing patterns) is more useful
    • The new representation can provide this kind of information
    Both codeword (pattern) 3 & 5 show up 2 times
  • Methodology Problems of frequency based encoding:
    • It is hard to define an approximate resolution (codeword length)
    • It may lose global information
  • Methodology Solution: Use multiple resolutions:
    • It is hard to define an approximate resolution (codeword length)
    • It may lose global information
  • Methodology Proposed distance metric: Weighted sum of similarities, at all resolution levels similarity @ level i
    • where c is the number of resolution levels
    • lacking any prior knowledge equal weights to all resolution levels works well most of the time
  • MVQ: Example of Codebooks
    • Codebook for the first level
    • Codebook for the second level (more codewords since there are more details)
  • Experiments Datasets
    • SYNDATA (control chart data): synthetic
    • CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program
    • RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day
  • Experiments
    • Best Match Searching:
    Matching accuracy: % of knn’s (found by different approaches) that are in same class
  • Experiments
    • Best Match Searching
    SYNDATA CAMMOUSE 0.51   Euclidean 0.83   [1 1 1 1 1] MVQ 0.46 [0 0 0 0 1] 0.48 [0 0 0 1 0] 0.65 [0 0 1 0 0] 0.70 [0 1 0 0 0] 0.55 [1 0 0 0 0] Single level VQ Accuracy   Weight Vector Method 0.58 Euclidean 0.83 [1 1 1 1 1] MVQ 0.60 [0 0 0 0 1] 0.56 [0 0 0 1 0] 0.44 [0 0 1 0 0] 0.60 [0 1 0 0 0] 0.56 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method
  • Experiments
    • Best Match Searching
    (a) (b) Precision-recall for different methods (a) on SYNDATA dataset (b) on CAMMOUSE dataset MVQ MVQ
  • Experiments
    • Clustering experiments
    Given two clusterings, G=G 1 , G 2 , …, G K (the true clusters), and A = A 1 , A 2 , …, A k (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as: with [Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]
  • Experiments
    • Clustering experiments
    . SYNDATA RTT 0.55 Euclidean 0.80 DTW 0.65 SAX 0.67 DFT 0.82 [1 1 1 1 1] MVQ 0.49 [0 0 0 0 1] 0.51 [0 0 0 1 0] 0.63 [0 0 1 0 0] 0.71 [0 1 0 0 0] 0.69 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method 0.50 Euclidean 0.62 DTW 0.54 SAX 0.54 DFT 0.81 [0 0 0 1 1] MVQ 0.79 [0 0 0 0 1] 0.80 [0 0 0 1 0] 0.57 [0 0 1 0 0] 0.52 [0 1 0 0 0] 0.55 [1 0 0 0 0] Single level VQ Accuracy Weight Vector Method
  • Experiments Summarization (SYNDATA) Typical series:
  • Experiments First Level Second Level
    • Given two time series t1 and t2 as follows:
    MVQ: Example: Two Time Series
    • In the first level, they are encoded with the same codeword (3), so they are not distinguishable
    • In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12.
    • Hilbert Space Filling Curve
    • Binning
    • Statistical tests of significance on groups of points
    • Identification of discriminative areas by back-projection
    Analysis of images by projection to 1D (a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its V mean measurement, (c) the discriminative voxels after applying the t-test with θ =0.05 (a) (b) (c) [D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]
  • Applying time series techniques   Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance. (a) (b)
    • Variation: Concatenate the values of statistically significant areas  spatial sequences
    • Pattern analysis using the similarity between spatial sequences and time sequences
      • SVD, DFT, DWT, PCA (clustering accuracy: 89-100%)
    Results: 87%-98% classification accuracy (t-test, CATX) [Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]
  • Conclusions
    • ‘ Find patterns/interesting things’ efficiently and robustly in spatial and temporal data
    • Use of partitioning and clustering
    • Analysis at multiple resolutions
    • Reduction of the number of tests performed
    • Intelligent exploration of the space to find discriminative areas
    • Reduction of dimensionality
    • Symbolic representation
    • Nice summarization
  • Collaborators
    • Faculty:
    • Zoran Obradovic
    • Orest Boyko
    • James Gee
    • Andrew Saykin
    • Christos Faloutsos
    • Christos Davatzikos
    • Edward Herskovits
    • Fillia Makedon
    • Dragoljub Pokrajac
    • Students:
    • Despina Kontos
    • Qiang Wang
    • Guo Li
    • Others:
    • James Ford
    • Alexandar Lazarevic
  • Thank you!
    • Acknowledgements
    • This research has been funded by:
      • National Science Foundation CAREER award 0237921
      • National Science Foundation Grant 0083423
      • National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA