Mining Group Correlations over Data Streams 2011/12/02 Publication/ICCSE 2011 Presenter/Yuan-Chung Chang
Outline <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>Fundermental Theory </li></ul><ul><li>MGDS a...
Introduction <ul><li>Mining correlation over steams attracts a lot of attentions recently. </li></ul><ul><li>Group correla...
Related Work <ul><li>The correlation analysis of multidimensional or multiple data streams </li></ul><ul><ul><li>StreamSVD...
Fundermental Theory (1/7) <ul><li>Definition </li></ul><ul><ul><li>Correlation coefficients </li></ul></ul><ul><ul><ul><li...
Fundermental Theory (2/7) <ul><li>Definition </li></ul><ul><ul><li>Multidimensional data stream </li></ul></ul><ul><ul><ul...
Fundermental Theory (3/7) <ul><li>Definition </li></ul><ul><ul><li>Base window </li></ul></ul><ul><ul><ul><li>Suppose ther...
Fundermental Theory (4/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>Correlation analysis is a way of  measuring the l...
Fundermental Theory (5/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>Theorem 1 </li></ul></ul><ul><ul><ul><li>Suppose ...
Fundermental Theory (6/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>Theorem 1 </li></ul></ul><ul><ul><ul><li>Here   ...
Fundermental Theory (7/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>We transform the sample variance of any two data ...
MGDS algorithms
Algorithm 1: Generate base statistic
Algorithm 2 : Analysis algorithm
Experiment (1/6) <ul><li>Computer platform  </li></ul><ul><ul><li>Intel (R) Core(TM)2 Quad CPU Q8400 / 2.66GHz / 3G / 250G...
Experiment (2/6) <ul><li>Data set </li></ul><ul><ul><li>Linear data set with noise </li></ul></ul><ul><ul><ul><li>the valu...
Experiment (3/6) <ul><li>The relationship between streams’ quantity and used time of per analysis </li></ul><ul><ul><li>We...
Experiment (4/6)
Experiment (5/6) <ul><li>The relationship between the size of base window and correlation coefficient </li></ul><ul><ul><l...
Experiment (6/6)
Conclusions <ul><li>This paper proposes MGDS algorithm  based on base window .  </li></ul><ul><li>MGDS algorithm overcomes...
<ul><li>Thank You  </li></ul><ul><li>For Your Listening </li></ul><ul><li>Q & A </li></ul>
Upcoming SlideShare
Loading in...5
×

Mining group correlations over data streams

1,147
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,147
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining group correlations over data streams

  1. 1. Mining Group Correlations over Data Streams 2011/12/02 Publication/ICCSE 2011 Presenter/Yuan-Chung Chang
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>Fundermental Theory </li></ul><ul><li>MGDS algorithms </li></ul><ul><li>Experiment </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Introduction <ul><li>Mining correlation over steams attracts a lot of attentions recently. </li></ul><ul><li>Group correlation analysis over data streams is relatively few. </li></ul><ul><li>Existing literatures are mainly focused on a single time window, with large space and time complexity . </li></ul><ul><li>This paper proposes an online canonical correlation analysis algorithm called MGDS (Mining Group Data Streams). </li></ul><ul><ul><li>the MGDS algorithm dynamically maintains a few statistics from raw data to calculate correlation. </li></ul></ul>
  4. 4. Related Work <ul><li>The correlation analysis of multidimensional or multiple data streams </li></ul><ul><ul><li>StreamSVD algorithm (2003) </li></ul></ul><ul><ul><ul><li>StreamSVD samples the observed values depend on low rank approximation theory, and uses SVD theory to analyze the correlation of streams. </li></ul></ul></ul><ul><ul><li>StreamCCA algorithm (2006) </li></ul></ul><ul><ul><ul><li>StreamCCA applied canonical correlation analysis(CCA) of the classical statistical theory to the field of data streams. </li></ul></ul></ul><ul><li>StreamSVD and StreamCCA both need to keep the whole historical values of streams and can’t get the correlation of changing time range . </li></ul>
  5. 5. Fundermental Theory (1/7) <ul><li>Definition </li></ul><ul><ul><li>Correlation coefficients </li></ul></ul><ul><ul><ul><li>x ji 、 y jk respectively represent the j th values of the i th, k th data streams X i , Y k </li></ul></ul></ul>
  6. 6. Fundermental Theory (2/7) <ul><li>Definition </li></ul><ul><ul><li>Multidimensional data stream </li></ul></ul><ul><ul><ul><li>Multidimensional data streams can be viewed as a mapping of one dimensional data streams. </li></ul></ul></ul><ul><ul><ul><li>e.g. at time j , the values of N streams is x j1 , x j2 , …, x ji , …, x jN , then value of corresponding multidimensional data stream is [ x j1 , x j2 , …, x ji , …, x jN ] . </li></ul></ul></ul>
  7. 7. Fundermental Theory (3/7) <ul><li>Definition </li></ul><ul><ul><li>Base window </li></ul></ul><ul><ul><ul><li>Suppose there are N streams and the current time is t , in the time window w , the observed values [ x ti … x (t+w-1)i ] (1  i  N) of N streams consist of base window. </li></ul></ul></ul><ul><ul><li>Correlation query window </li></ul></ul><ul><ul><ul><li>A set of successive base window </li></ul></ul></ul>
  8. 8. Fundermental Theory (4/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>Correlation analysis is a way of measuring the linear relationship between two sets of data streams. </li></ul></ul><ul><ul><ul><li>Canonical variable U i generated by X represents most of information of X </li></ul></ul></ul><ul><ul><ul><li>Canonical variable V i generated by Y represents most of information of Y </li></ul></ul></ul><ul><ul><ul><li>a i T and b i T , which represent the weight of different dimensions of U i and V i in the correlation, are linear transformation. </li></ul></ul></ul>
  9. 9. Fundermental Theory (5/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>Theorem 1 </li></ul></ul><ul><ul><ul><li>Suppose p  q and let the random vectors X p  1 and Y q  1 have Cov(X)=  11(p  p) , Cov(Y)=  22(q  q) , Cov(X,Y)=  12(p  q) , where  is full rank. </li></ul></ul></ul><ul><ul><ul><li>For coefficient vectors a p×1 and b q×1 , form the linear combinations U=a T X and V=b T Y . </li></ul></ul></ul><ul><ul><ul><li>The first canonical variate pair  maxCorr(U 1 ,V 1 )=  1 , where U 1 =e 1 T  11 -1/2 X and V 1 =f 1 T  11 -1/2 Y . </li></ul></ul></ul><ul><ul><ul><li>The k th pair of canonical variates  maxCorr(U k ,V k )=  k , where U k =e k T  11 -1/2 X and V k =f k T  11 -1/2 Y . </li></ul></ul></ul>
  10. 10. Fundermental Theory (6/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>Theorem 1 </li></ul></ul><ul><ul><ul><li>Here  1 2  2 2  …  p 2 are the eigenvectors of  11 -1/2  12  22 -1  21  11 -1/2 , and e 1 , e 2 , …, e p are the associated(p  1) eigenvectors. </li></ul></ul></ul><ul><ul><li>If the sample covariance of normalized observed values of streams is S , to get the correlation of two sets of data streams and identify the leading data streams in the correlation analysis,  11 -1/2  12  22 -1  21  11 -1/2 in theorem 1 should be replaced by S 11 -1/2 S 12 S 22 -1 S 21 S 11 -1/2 , thus we can make correlation analysis. </li></ul></ul>
  11. 11. Fundermental Theory (7/7) <ul><li>Principle of CCA </li></ul><ul><ul><li>We transform the sample variance of any two data streams and explain the statistics needed to keep as follows. </li></ul></ul>
  12. 12. MGDS algorithms
  13. 13. Algorithm 1: Generate base statistic
  14. 14. Algorithm 2 : Analysis algorithm
  15. 15. Experiment (1/6) <ul><li>Computer platform </li></ul><ul><ul><li>Intel (R) Core(TM)2 Quad CPU Q8400 / 2.66GHz / 3G / 250G </li></ul></ul><ul><ul><li>OS is Windows xp sp3 </li></ul></ul><ul><ul><li>using Matlab 2007b to run programs and generate synthetic data sets </li></ul></ul>
  16. 16. Experiment (2/6) <ul><li>Data set </li></ul><ul><ul><li>Linear data set with noise </li></ul></ul><ul><ul><ul><li>the values of every stream are got from linear increasing data of interval [0, 50000] and added by random values generated by N (0, 3 2 ) . </li></ul></ul></ul><ul><ul><ul><li>we hope to have a high correlation. </li></ul></ul></ul><ul><ul><li>Gauss data set </li></ul></ul><ul><ul><ul><li>the values of stream of the group X and Y are separately satisfied by N (50,15 2 ) and N (100,25 2 ) . </li></ul></ul></ul><ul><ul><ul><li>we hope to have a low correlation. </li></ul></ul></ul><ul><ul><li>Real stock data set </li></ul></ul><ul><ul><ul><li>15 stock data of Shenzhen Securities and Shanghai Securities from Jan.2005 to Dec.2010 . </li></ul></ul></ul><ul><ul><ul><li>we hope to have a very high correlation. </li></ul></ul></ul>
  17. 17. Experiment (3/6) <ul><li>The relationship between streams’ quantity and used time of per analysis </li></ul><ul><ul><li>We use gauss data set in this experiment to find the influence of MGDS and naive algorithm to every correlation analysis corresponding to values of ( p+q ) streams. </li></ul></ul><ul><ul><li>The size of base window is 500 values, the number of correlation query window is 30 , the quantities of streams is { p=40,60,80,100,120 ; q=60,90,120,150,180 }. </li></ul></ul><ul><ul><li>The naive algorithm calculates high-level statistic from raw values instead of base statistic. </li></ul></ul>
  18. 18. Experiment (4/6)
  19. 19. Experiment (5/6) <ul><li>The relationship between the size of base window and correlation coefficient </li></ul><ul><ul><li>We use gauss data set, linear data set and real stock data set to find the effectiveness of MGDS with the changing size of base window. </li></ul></ul><ul><ul><li>The number of streams is p=q=15 , the number of correlation query window is 5 , and the changing size of base window is { W=50,100,150,200,250 }. </li></ul></ul>
  20. 20. Experiment (6/6)
  21. 21. Conclusions <ul><li>This paper proposes MGDS algorithm based on base window . </li></ul><ul><li>MGDS algorithm overcomes the weakness of keeping all the values of other algorithms, and compresses original values to statistics, and correlation analysis is only based on these statistics, thus space and time complexity are reduced greatly . </li></ul><ul><li>The correlation analysis range of MGDS, not like other algorithms, is not limited in a single window, but can change flexibly depend on requirements , and the results of MGDS algorithm are accurate . </li></ul>
  22. 22. <ul><li>Thank You </li></ul><ul><li>For Your Listening </li></ul><ul><li>Q & A </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×