Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains. How to summarize streaming data efficiently and incrementally ? Statement: Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.
  • DTA.ppt

    1. 1. Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun
    2. 2. Motivation <ul><li>Goal: incremental pattern discovery on streaming applications </li></ul><ul><ul><li>Streams: </li></ul></ul><ul><ul><ul><li>E1: Environmental sensor networks </li></ul></ul></ul><ul><ul><ul><li>E2: Cluster/data center monitoring </li></ul></ul></ul><ul><ul><li>Graphs: </li></ul></ul><ul><ul><ul><li>E3: Social network analysis </li></ul></ul></ul><ul><ul><li>Tensors: </li></ul></ul><ul><ul><ul><li>E4: Network forensics </li></ul></ul></ul><ul><ul><ul><li>E5: Financial auditing </li></ul></ul></ul><ul><ul><ul><li>E6: fMRI: Brain image analysis </li></ul></ul></ul><ul><li>How to summarize streaming data effectively and incrementally ? </li></ul>
    3. 3. E3: Social network analysis <ul><li>Traditionally, people focus on static networks and find community structures </li></ul><ul><li>We plan to monitor the change of the community structure over time and identify abnormal individuals </li></ul>
    4. 4. E4: Network forensics <ul><li>Directional network flows </li></ul><ul><li>A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] </li></ul><ul><ul><li>450 GB/hour with compression </li></ul></ul><ul><li>Task: Identify abnormal traffic pattern and find out the cause </li></ul>normal traffic abnormal traffic destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie source destination
    5. 5. Static Data model <ul><li>For a timestamp, the stream measurements can be modeled using a tensor </li></ul><ul><li>Dimension: a single stream </li></ul><ul><ul><li>E.g, <Christos, “graph”> </li></ul></ul><ul><li>Mode: a group of dimensions of the same kind. </li></ul><ul><ul><li>E.g., Source, Destination, Port </li></ul></ul>Destination Source Time = 0
    6. 6. Static Data model (cont.) <ul><li>Tensor </li></ul><ul><ul><li>Formally, </li></ul></ul><ul><ul><li>Generalization of matrices </li></ul></ul><ul><ul><li>Represented as multi-array, data cube. </li></ul></ul>Example Correspondence Order 3D array Matrix Vector 3 rd 2 nd 1st
    7. 7. Dynamic Data model (our focus) <ul><li>Streams come with structure </li></ul><ul><ul><li>( time, source, destination, port) </li></ul></ul><ul><ul><li>( time , author, keyword) </li></ul></ul>time Destination Source
    8. 8. Dynamic Data model (cont.) <ul><li>Tensor Streams </li></ul><ul><ul><li>A sequence of Mth order tensor </li></ul></ul>time … author keyword … where n is increasing over time Example Correspondence Order 3D arrays Time evolving graphs Multiple streams 3 rd 2 nd 1st time …
    9. 9. Dynamic tensor analysis U Destination U Source Old cores Source Destination New Tensor Old Tensors
    10. 10. Roadmap <ul><li>Motivation and main ideas </li></ul><ul><li>Background and related work </li></ul><ul><li>Dynamic and streaming tensor analysis </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusion </li></ul>
    11. 11. Background – Singular value decomposition (SVD) <ul><li>SVD </li></ul><ul><ul><li>Best rank k approximation in L2 </li></ul></ul><ul><ul><li>PCA is an important application of SVD </li></ul></ul>Y A m n  m n U V T k k k U T R R R
    12. 12. Latent semantic indexing (LSI) <ul><li>Singular vectors are useful for clustering or correlation detection </li></ul>pattern cluster query cache = DM DB x x document-concept concept-term concept-association frequent
    13. 13. Tensor Operation: Matricize X (d) <ul><li>Unfold a tensor into a matrix </li></ul>Acknowledge to Tammy Kolda for this slide 5 7 6 8 1 3 2 4
    14. 14. Tensor Operation: Mode-product <ul><li>Multiply a tensor with a matrix </li></ul>source destination port group  source destination port group source
    15. 15. Related work Our Work <ul><li>Graph mining </li></ul><ul><li>Explorative: [Faloutsos04][Kumar99] </li></ul><ul><li>[Leskovec05]… </li></ul><ul><li>Algorithmic: [Yan05][Cormode05]… </li></ul><ul><li>Stream mining </li></ul><ul><li>Scan data once to identify patterns </li></ul><ul><ul><li>Sampling: [Vitter85], [Gibbons98] </li></ul></ul><ul><ul><li>Sketches: [Indyk00], [Cormode03] </li></ul></ul><ul><li>Multilinear analysis </li></ul><ul><li>Tensor decompositions: Tucker, PARAFAC, HOSVD </li></ul><ul><li>Low Rank approximation </li></ul><ul><li>PCA, SVD : orthogonal based projection </li></ul>
    16. 16. Roadmap <ul><li>Motivation and main ideas </li></ul><ul><li>Background and related work </li></ul><ul><li>Dynamic and streaming tensor analysis </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusion </li></ul>
    17. 17. Tensor analysis <ul><li>Given a sequence of tensors </li></ul><ul><li>find the projection matrices </li></ul><ul><li>such that the reconstruction error e </li></ul><ul><li>is minimized: </li></ul>… … t Note that this is a generalization of PCA when n is a constant
    18. 18. Why do we care? <ul><li>Anomaly detection </li></ul><ul><ul><li>Reconstruction error driven </li></ul></ul><ul><ul><li>Multiple resolution </li></ul></ul><ul><li>Multiway latent semantic indexing (LSI) </li></ul>Philip Yu Michael Stonebreaker Query Pattern time
    19. 19. 1 st order DTA - problem <ul><li>Given x 1 …x n where each x i  R N , find </li></ul><ul><li>U  R N  R such that the error e is </li></ul><ul><li>small: </li></ul>n N x 1 x n … . ? time Sensors U T indoor outdoor Y Sensors R Note that Y = XU
    20. 20. 1 st order DTA <ul><li>Input: new data vector x  R N , old variance matrix C  R N  N </li></ul><ul><li>Output: new projection matrix U  R N  R </li></ul><ul><li>Algorithm: </li></ul><ul><li>1. update variance matrix C new = x T x + C </li></ul><ul><li>2. Diagonalize U  U T = C new </li></ul><ul><li>3. Determine the rank R and return U </li></ul>x T C U U T x C new Diagonalization has to be done for every new x ! Old X x time
    21. 21. 1 st order STA <ul><li>Adjust U smoothly when new data arrive without diagonalization [VLDB05] </li></ul><ul><li>For each new point x </li></ul><ul><ul><li>Project onto current line </li></ul></ul><ul><ul><li>Estimate error </li></ul></ul><ul><ul><li>Rotate line in the direction of the error and in proportion to its magnitude </li></ul></ul><ul><li>For each new point x and for i = 1, …, k : </li></ul><ul><li>y i := U i T x (proj. onto U i ) </li></ul><ul><li>d i   d i + y i 2 (energy  i -th eigenval.) </li></ul><ul><li>e i := x – y i U i (error) </li></ul><ul><li>U i  U i + (1/ d i ) y i e i (update estimate) </li></ul><ul><li>x  x – y i U i (repeat with remainder) </li></ul>error U Sensor 1 Sensor 2
    22. 22. M th order DTA
    23. 23. M th order DTA – complexity <ul><li>Storage: </li></ul><ul><li>O(  N i ) , i.e., size of an input tensor at a single timestamp </li></ul><ul><li>Computation: </li></ul><ul><li> N i 3 (or  N i 2 ) diagonalization of C </li></ul><ul><li>+  N i  N i matrix multiplication X (d) T X (d) </li></ul><ul><li>For low order tensor(<3), diagonalization is the main cost </li></ul><ul><li>For high order tensor, matrix multiplication is the main cost </li></ul>
    24. 24. M th order STA <ul><li>Run 1 st order STA along each mode </li></ul><ul><li>Complexity: </li></ul><ul><ul><li>Storage: O(  N i ) </li></ul></ul><ul><ul><li>Computation:  R i  N i which is smaller than DTA </li></ul></ul>y 1 U 1 x e 1 U 1 updated
    25. 25. Roadmap <ul><li>Motivation and main ideas </li></ul><ul><li>Background and related work </li></ul><ul><li>Dynamic and streaming tensor analysis </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusion </li></ul>
    26. 26. Experiment <ul><li>Objectives </li></ul><ul><ul><li>Computational efficiency </li></ul></ul><ul><ul><li>Accurate approximation </li></ul></ul><ul><ul><li>Real applications </li></ul></ul><ul><ul><ul><li>Anomaly detection </li></ul></ul></ul><ul><ul><ul><li>Clustering </li></ul></ul></ul>
    27. 27. Data set 1: Network data <ul><ul><li>TCP flows collected at CMU backbone </li></ul></ul><ul><ul><li>Raw data 500GB with compression </li></ul></ul><ul><ul><li>Construct 3 rd order tensors with hourly windows with <source, destination, port> </li></ul></ul><ul><ul><li>Each tensor: 500  500  100 </li></ul></ul><ul><ul><li>1200 timestamps (hours) </li></ul></ul>Sparse data 10AM to 11AM on 01/06/2005 value Power-law distribution
    28. 28. Data set 2: Bibliographic data (DBLP) <ul><li>Papers from VLDB and KDD conferences </li></ul><ul><li>Construct 2nd order tensors with yearly windows with <author, keywords> </li></ul><ul><li>Each tensor: 4584  3741 </li></ul><ul><li>11 timestamps (years) </li></ul>
    29. 29. Computational cost <ul><li>OTA is the offline tensor analysis </li></ul><ul><li>Performance metric : CPU time (sec) </li></ul><ul><li>Observations: </li></ul><ul><ul><li>DTA and STA are orders of magnitude faster than OTA </li></ul></ul><ul><ul><li>The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time) </li></ul></ul>3 rd order network tensor 2 nd order DBLP tensor
    30. 30. Accuracy comparison <ul><li>Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% </li></ul><ul><li>Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. </li></ul>3 rd order network tensor 2 nd order DBLP tensor
    31. 31. Network anomaly detection <ul><li>Reconstruction error gives indication of anomalies. </li></ul><ul><li>Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin). </li></ul>Reconstruction error over time Normal traffic Abnormal traffic
    32. 32. Multiway LSI <ul><li>Two groups are correctly identified: Databases and Data mining </li></ul><ul><li>People and concepts are drifting over time </li></ul>DB DM 2004 2004 1995 Year streams,pattern,support, cluster, index,gener,queri jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal distribut,systems,view,storage,servic,process,cache surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel queri,parallel,optimization,concurr, objectorient michael carey, michael stonebreaker, h. jagadish, hector garcia-molina Keywords Authors
    33. 33. Conclusion <ul><li>Tensor stream is a general data model </li></ul><ul><li>DTA/STA incrementally decompose tensors into core tensors and projection matrices </li></ul><ul><li>The result of DTA/STA can be used in other applications </li></ul><ul><ul><li>Anomaly detection </li></ul></ul><ul><ul><li>Multiway LSI </li></ul></ul>
    34. 34. <ul><li>The world is not flat, neither should data mining be. </li></ul>Final word: Think structurally! Contact: Jimeng Sun [email_address]