Svd filtered temporal usage clustering

SVD Filtered Temporal Usage
Pattern Analysis & Clustering
Liang XieLiang XieLiang XieLiang Xie
SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009
San Antonio, TXSan Antonio, TXSan Antonio, TXSan Antonio, TX

Business Objective
� Provide a robust algorithm to cluster customers based on their temporal
transactional data ;
� Issues :
� Data
� High Dimensionality: 360 features, multi-million records
� Capture amplitude at different resolution
� High volatility due to noise
� Possible Outliers
� Algorithm
� Robustness
� Efficiency
� Easy to implement in SAS!
� We Choose a SVD based algorithm
� Successful application on Gene-Expression Analysis by Alter et al (PNAS, 2000)

SVD as a Filter
� SVD Definition:
� Singular Value Decomposition is a mathematical tool to decompose
rectangular matrix
� Left Eigenvector matrix U can be regarded as an input rotation matrix;
Sigma is the scaling matrix, and right Eigenvector matrix V is output
matrix
� SVD is similar to Fourier analysis
� Filter:
� Each row of X is a linear combination of right Eigenvectors
� Each column of X is a linear combination of left Eigenvectors
'VUX Σ=

Relationship Between PCA and SVD
� SAS/STAT doesn’t explicitly support SVD
� We can tweak SAS/STAT to do SVD by link one computation method of
SVD to PCA
� SVD and PCA are essentially the same: SVD on the covariance matrix of
original data X is equivalent to PCA of X
� PCA on non-centered covariance matrix of X is equivalent to SVD of X,
with proper scaling
')'( VSVXXSVD =

SVD in SAS/STAT
� We call PROC PRINCOMP to conduct SVD in SAS/STAT
� The uncorrected covariance matrix in PROC PRINCOM is X’X/n, not X’X,
therefore the singular value matrix should be scaled by
� PROC PRINCOMPPROC PRINCOMPPROC PRINCOMPPROC PRINCOMP NOINT COV SING=
� ‘COV’ computes the principal components from the covariance matrix
� ‘NOINT’ omits the intercept from the model
� ‘SING=’ specifies the singularity criterion to ensure accuracy
n

Performance
� Accuracy
� Test the code on Hilbert matrix
� Specify ‘SING=1e-16’, our result is comparable to those obtained from R
and MATLAB
� Efficiency
� Test the code on an arbitrary rectangular matrix with 1.7million rows and
400 columns
� On a Core2Duo 1.86Ghz PC, it takes SAS 7min56sec to finish all data
processing and computations, user CPU time is 5min52sec
� Note that 32-bit Windows version RRRRRRRR is not able to handle data this big:
> X<-matrix(runif(1.7E6*400), ncol=400)
Error in runif(1700000 * 400) :
cannot allocate vector of length 680000000
� Multi-thread/Parallel SVD algorithm from SAS is highly desired!!

Temporal Usage Pattern Analysis
� Time series usage data from customers for one year at 60min interval
� Hourly usage data is normalized to:
� Year total
� Monthly Total
� We want to identify segments with distinct usage pattern over one
year, so that marketing department is able to design customized
messages to them

Traditional Approach
� Direct K-means clustering using PROC FASTCLUS on all features
� Problems:
� Not Robust: Subjective to outliers
� Ambiguity in choosing optimal number of clusters a prior
� High dimensionality will affect the distance measure between each pair:
� In high dimensional spaces, distances between points become relatively
uniform
� Combining Robustness and High Dimensionality, we could get segments
that are occupied by only a few observations which is usually not desired
� K-means clustering algorithm doesn’t take the time series nature into
consideration. All features are considered independent

Our Approach
� Apply SVD to the original data, obtain Eigenvectors and singular values
� Remove components associated with the first singular value (Low Pass
Filtering)
� Apply SVD again to the SVD Filtered matrix
� Calculate Pearson correlation of each observation to the right
Eigenvectors obtained in previous step
� Apply k-means clustering algorithm to this correlation elements matrix

Some Notes
� For a data matrix containing 360 days’ profile, we only need to use a
few of the correlation elements. We use correlation up to 85%
variation is accounted for in the data
� To determine optimal number of clusters, we applied Bayesian
Information Criteria. This measurement is very robust and simple to
calculate:
� BIC=Distortion + (Num of Var)*log(Num of Obs)*K
� Distortion=sum of total variance of each cluster=sum of Distance from
PROC FASTCLUS output
� With hourly data, we separate the analysis in two steps:
� Daily Level
� Hourly Level for a ‘typical day’ in a month
� Apply the SVD Filtered Clustering algorithm in each step

Simulated Data
� We simulate data using
Heterogeneous Mixed Model of
Verbeke
� High Usage among Month B-D
and Month H
� Some outliers were deliberately
generated by adding abnormal
ad-hoc error terms

Clustering Result on Filtered Data

THANK YOUTHANK YOUTHANK YOUTHANK YOU
� You can reach me at:
� xie1978@yahoo.com
� www.linkedin.com/liangxie
� My Blog:
� http://sas-programming.blogspot.com

Svd filtered temporal usage clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Svd filtered temporal usage clustering

Similar to Svd filtered temporal usage clustering (20)

Recently uploaded

Recently uploaded (20)

Svd filtered temporal usage clustering