Mining Unusual Patterns in Data Streams in Multi-Dimensional ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 05/10/10
  • Mining Unusual Patterns in Data Streams in Multi-Dimensional ...

    1. 1. Mining Unusual Patterns in Data Streams in Multi-Dimensional Space Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www. cs . uiuc . edu /~ hanj
    2. 2. Outline <ul><li>Characteristics of data streams </li></ul><ul><li>Mining unusual patterns in data streams </li></ul><ul><li>Multi-dimensional regression analysis of data streams </li></ul><ul><li>Stream cubing and stream OLAP methods </li></ul><ul><li>Mining other kinds of patterns in data streams </li></ul><ul><li>Research problems </li></ul>
    3. 3. Data Streams <ul><li>Data Streams </li></ul><ul><ul><li>Data streams — continuous, ordered, changing, fast, huge amount </li></ul></ul><ul><ul><li>Traditional DBMS — data stored in finite, persistent data sets </li></ul></ul><ul><li>Characteristics </li></ul><ul><ul><li>Huge volumes of continuous data, possibly infinite </li></ul></ul><ul><ul><li>Fast changing and requires fast, real-time response </li></ul></ul><ul><ul><li>Data stream captures nicely our data processing needs of today </li></ul></ul><ul><ul><li>Random access is expensive — single linear scan algorithm ( can only have one look ) </li></ul></ul><ul><ul><li>Store only the summary of the data seen thus far </li></ul></ul><ul><ul><li>Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing </li></ul></ul>
    4. 4. Stream Data Applications <ul><li>Telecommunication calling records </li></ul><ul><li>Business: credit card transaction flows </li></ul><ul><li>Network monitoring and traffic engineering </li></ul><ul><li>Financial market: stock exchange </li></ul><ul><li>Engineering & industrial processes: power supply & manufacturing </li></ul><ul><li>Sensor, monitoring & surveillance: video streams </li></ul><ul><li>Security monitoring </li></ul><ul><li>Web logs and Web page click streams </li></ul><ul><li>Massive data sets (even saved but random access is too expensive) </li></ul>
    5. 5. Challenges of Stream Data Mining <ul><li>Multiple, continuous, rapid, time-varying, ordered streams </li></ul><ul><li>Main memory computation </li></ul><ul><li>Mining queries are either continuous or ad-hoc </li></ul><ul><li>Mining queries are often complex </li></ul><ul><ul><li>Involving multiple streams, large amount of data, and history </li></ul></ul><ul><ul><li>Finding patterns, models, anomaly, differences, … </li></ul></ul><ul><li>Mining dynamics (changes, trends and evolutions) of data streams </li></ul><ul><li>Multi-level/multi-dimensional processing and data mining </li></ul><ul><ul><li>Most stream data are at pretty low-level or multi-dimensional in nature </li></ul></ul>
    6. 6. Stream Data Mining Tasks <ul><li>Multi-dimensional (on-line) analysis of streams </li></ul><ul><li>Clustering data streams </li></ul><ul><li>Classification of data streams </li></ul><ul><li>Mining frequent patterns in data streams </li></ul><ul><li>Mining sequential patterns in data streams </li></ul><ul><li>Mining partial periodicity in data streams </li></ul><ul><li>Mining notable gradients in data streams </li></ul><ul><li>Mining outliers and unusual patterns in data streams </li></ul><ul><li>…… </li></ul>
    7. 7. Multi-Dimensional Stream Analysis: Examples <ul><li>Analysis of Web click streams </li></ul><ul><ul><li>Raw data at low levels: seconds, web page addresses, user IP addresses, … </li></ul></ul><ul><ul><li>Analysts want: changes, trends, unusual patterns, at reasonable levels of details </li></ul></ul><ul><ul><li>E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours .” </li></ul></ul><ul><li>Analysis of power consumption streams </li></ul><ul><ul><li>Raw data: power consumption flow for every household, every minute </li></ul></ul><ul><ul><li>Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago </li></ul></ul>
    8. 8. A Key Step — Stream Data Reduction <ul><li>Challenges of OLAPing stream data </li></ul><ul><ul><li>Raw data cannot be stored </li></ul></ul><ul><ul><li>Simple aggregates are not powerful enough </li></ul></ul><ul><ul><li>History shape and patterns at different levels are desirable: multi-dimensional regression analysis </li></ul></ul><ul><li>Proposal </li></ul><ul><ul><li>A scalable multi-dimensional stream “data cube” that can aggregate regression model of stream data efficiently without accessing the raw data </li></ul></ul><ul><li>Stream data compression </li></ul><ul><ul><li>Compress the stream data to support memory- and time-efficient multi-dimensional regression analysis </li></ul></ul>
    9. 9. Regression Cube for Time-Series <ul><li>Initially, one time-series per base cell </li></ul><ul><ul><li>Too costly to store all these time-series </li></ul></ul><ul><ul><li>Too costly to compute regression at multi-dimensional space </li></ul></ul><ul><li>Regression cube </li></ul><ul><ul><li>Base cube: only store regression parameters of base cells (e.g., 2 points vs. 1000 points) </li></ul></ul><ul><ul><li>All the upper level cuboids can be computed precisely for linear regression on both standard dimensions and time dimensions </li></ul></ul><ul><ul><li>For quadratic regression, we need 5 points </li></ul></ul><ul><ul><li>In general, we need: </li></ul></ul><ul><ul><ul><li>where k = 2 for quadratic. </li></ul></ul></ul>
    10. 10. Basics of General Linear Regression <ul><li>n tuples in one cell: ( x i , y i ), i = 1..n , where y i is the measure attribute to be analyzed </li></ul><ul><li>For sample i , a vector of k user-defined predictors u i : </li></ul><ul><li>The linear regression model: </li></ul><ul><li>where η is a k × 1 vector of regression parameters </li></ul>
    11. 11. Theory of General Linear Regression <ul><li>Collect into the model matrix U </li></ul><ul><li>The ordinary least square (OLS) estimate of is the argument that minimizes the residue sum of squares function </li></ul><ul><li>Main theorem to determine the OLS regression parameters </li></ul>
    12. 12. Linearly Compressed Representation (LCR) <ul><li>Stream data compression for multi-dimensional regression analysis </li></ul><ul><li>Define, for i, j = 0 ,…,k- 1: </li></ul><ul><li>The linearly compressed representation (LCR) of one cell: </li></ul><ul><li>Size of LCR of one cell: </li></ul><ul><li>quadratic in k , independent of the number of tuples n in one cell </li></ul>
    13. 13. Matrix Form of LCR <ul><li>LCR consists of and , where </li></ul><ul><li>and </li></ul><ul><li>where </li></ul><ul><ul><ul><li>provides OLS regression parameters essential for regression analysis </li></ul></ul></ul><ul><ul><ul><li>is an auxiliary matrix that facilitates aggregations of LCR in standard and regression dimensions in a data cube environment </li></ul></ul></ul><ul><ul><ul><li>LCR only stores the upper triangle of </li></ul></ul></ul>
    14. 14. Aggregation in Standard Dimensions <ul><li>Given LCR of m cells that differ in one standard dimension, what is the LCR of the cell aggregated in that dimension? </li></ul><ul><ul><li>for m base cells </li></ul></ul><ul><ul><li>for an aggregated cell </li></ul></ul><ul><li>The lossless aggregation formula </li></ul>
    15. 15. Stock Price Example — Aggregation in Standard Dimensions <ul><li>Simple linear regression on time series data </li></ul><ul><ul><li>Cells of two companies </li></ul></ul><ul><ul><li>After aggregation: </li></ul></ul>
    16. 16. Aggregation in Regression Dimensions <ul><li>Given LCR of m cells that differ in one regression dimension, what is the LCR of the cell aggregated in that dimension </li></ul><ul><li>for m base cells </li></ul><ul><li>for the aggregated cell </li></ul><ul><li>The lossless aggregation formula </li></ul>
    17. 17. Stock Price Example — Aggregation in Time Dimension <ul><li>Cells of two adjacent </li></ul><ul><li>time intervals: </li></ul><ul><li>After aggregation </li></ul>
    18. 18. Feasibility of Stream Regression Analysis <ul><li>Efficient storage and scalable (independent of the number of tuples in data cells) </li></ul><ul><li>Lossless aggregation without accessing the raw data </li></ul><ul><li>Fast aggregation: computationally efficient </li></ul><ul><li>Regression models of data cells at all levels </li></ul><ul><li>General results: covered a large and the most popular class of regression </li></ul><ul><ul><li>Including quadratic, polynomial, and nonlinear models </li></ul></ul>
    19. 19. A Stream Cube Architecture <ul><li>A tilted time frame </li></ul><ul><ul><li>Different time granularities </li></ul></ul><ul><ul><ul><li>second, minute, quarter, hour, day, week, … </li></ul></ul></ul><ul><li>Critical layers </li></ul><ul><ul><li>Minimum interest layer (m-layer) </li></ul></ul><ul><ul><li>Observation layer (o-layer) </li></ul></ul><ul><ul><li>User: watches at o-layer and occasionally needs to drill-down down to m-layer </li></ul></ul><ul><li>Partial materialization of stream cubes </li></ul><ul><ul><li>Full materialization: too space and time consuming </li></ul></ul><ul><ul><li>No materialization: slow response at query time </li></ul></ul><ul><ul><li>Partial materialization: what do we mean “partial”? </li></ul></ul>
    20. 20. A Tilted Time-Frame Model Up to 7 days: Up to a year: Logarithmic (exponential) scale: 4t 2t 1t Time Now 4t 8t 16t 31 days 24 hours 4 qtrs 12 months Time Now 24hrs 4qtrs 15minutes 7 days Time Now 25sec.
    21. 21. Benefits of Tilted Time-Frame Model <ul><li>Each cell stores the measures according to tilt-time-frame </li></ul><ul><ul><li>Limited memory space: Impossible to store the history in full scale </li></ul></ul><ul><li>Emphasis more on recent data </li></ul><ul><ul><li>Most applications emphasize on recent data (slide window) </li></ul></ul><ul><li>Natural partition on different time granularities </li></ul><ul><ul><li>Putting different weights on remote data </li></ul></ul><ul><ul><li>Useful even for uniform weight </li></ul></ul><ul><li>Tilted time-frame forms a new time dimension </li></ul><ul><ul><li>for mining changes and evolutions </li></ul></ul><ul><li>Essential for mining unusual patterns or outliers </li></ul><ul><ul><li>Finding those with dramatic changes </li></ul></ul><ul><ul><li>E.g., exceptional stocks —not following the trends </li></ul></ul>
    22. 22. Two Critical Layers in the Stream Cube (*, theme, quarter) (user-group, URL-group, minute) m-layer (minimal interest) (individual-user, URL, second) (primitive) stream data layer o-layer (observation)
    23. 23. On-Line Materialization vs. On-Line Computation <ul><li>On-line materialization </li></ul><ul><ul><li>Materialization takes precious resources and time </li></ul></ul><ul><ul><ul><li>Only incremental materialization (with slide window) </li></ul></ul></ul><ul><ul><li>Only materialize “cuboids” of the critical layers? </li></ul></ul><ul><ul><ul><li>Some intermediate cells that should be materialized </li></ul></ul></ul><ul><ul><li>Popular path approach vs. exception cell approach </li></ul></ul><ul><ul><ul><li>Materialize intermediate cells along the popular paths </li></ul></ul></ul><ul><ul><ul><li>Exception cells: how to set up exception thresholds? </li></ul></ul></ul><ul><ul><ul><li>Notice exceptions do not have monotonic behaviour </li></ul></ul></ul><ul><li>Computation problem </li></ul><ul><ul><li>How to compute and store stream cubes efficiently? </li></ul></ul><ul><ul><li>How to discover unusual cells between the critical layer? </li></ul></ul>
    24. 24. Stream Cube Structure: from m-layer to o-layer (A1, *, C1) (A1, *, C2) (A1, *, C2) (A1, *, C2) (A1, B1, C2) (A1, B2, C1) (A2, *, C2) (A2, B1, C1) (A1, B2, C2) (A2, B1, C2) A2, B2, C1) (A2, B2, C2)
    25. 25. Stream Cube Computation <ul><li>Cube structure from m-layer to o-layer </li></ul><ul><li>Three approaches </li></ul><ul><ul><li>All cuboids approach </li></ul></ul><ul><ul><ul><li>Materializing all cells (too much in both space and time) </li></ul></ul></ul><ul><ul><li>Exceptional cells approach </li></ul></ul><ul><ul><ul><li>Materializing only exceptional cells (saves space but not time to compute and definition of exception is not flexible ) </li></ul></ul></ul><ul><ul><li>Popular path approach </li></ul></ul><ul><ul><ul><li>Computing and materializing cells only along a popular path </li></ul></ul></ul><ul><ul><ul><li>Using H-tree structure to store computed cells (which form the stream cube —a selectively materialized cube ) </li></ul></ul></ul>
    26. 26. An H-Tree Cubing Structure root entertainment sports politics uic uiuc uic uiuc jeff mary jeff Jim Q.I. Q.I. Q.I. Observation layer Minimal int. layer Regression: Sum: xxxx Cnt: yyyy Quant-Info
    27. 27. Partial Materialization Using H-Tree <ul><li>H-tree: </li></ul><ul><ul><li>Introduced for computing data cubes and iceberg cubes </li></ul></ul><ul><ul><ul><li>J. Han, J. Pei, G. Dong, and K. Wang, “ Efficient Computation of Iceberg Cubes with Complex Measures ”, SIGMOD'01 </li></ul></ul></ul><ul><ul><li>Compressed database, fast cubing, and space preserving in cube computation </li></ul></ul><ul><li>Using H-tree for partial stream cubing </li></ul><ul><ul><li>Space preserving: </li></ul></ul><ul><ul><ul><li>Intermediate aggregates can be computed incrementally and saved in tree nodes </li></ul></ul></ul><ul><ul><li>Facilitate computing other cells and multi-dimensional analysis </li></ul></ul><ul><ul><li>H-tree with computed cells can be viewed as “stream cube” </li></ul></ul>
    28. 28. a) Time vs. m-layer size b) Space vs. m-layer size Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K)
    29. 29. Time and Space vs. the Number of Levels a) Time vs. # levels b) Space vs. # levels
    30. 30. Mining Other Unusual Patterns in Stream Data <ul><li>Clustering and outlier analysis for stream mining </li></ul><ul><ul><li>Clustering data streams (Guha, Motwani et al. 2000-2002) </li></ul></ul><ul><ul><li>History-sensitive, high-quality incremental clustering </li></ul></ul><ul><li>Classification of stream data </li></ul><ul><ul><li>Evolution of decision trees: Domingos et al. (2000, 2001) </li></ul></ul><ul><ul><li>Incremental integration of new streams in decision-tree induction </li></ul></ul><ul><li>Frequent pattern analysis </li></ul><ul><ul><li>Approximate frequent patterns ( Manku & Motwani VLDB’02) </li></ul></ul><ul><ul><li>Evolution and dramatic changes of frequent patterns </li></ul></ul>
    31. 31. Conclusions <ul><li>Stream data mining: A rich and largely unexplored field </li></ul><ul><ul><li>Current research focus in database community: </li></ul></ul><ul><ul><ul><li>DSMS system architecture, continuous query processing, supporting mechanisms </li></ul></ul></ul><ul><ul><li>Stream data mining and stream OLAP analysis </li></ul></ul><ul><ul><ul><li>Powerful tools for finding general and unusual patterns </li></ul></ul></ul><ul><ul><ul><li>Effectiveness, efficiency and scalability: lots of open problems </li></ul></ul></ul><ul><li>Our philosophy: </li></ul><ul><ul><li>A multi-dimensional stream analysis framework </li></ul></ul><ul><ul><li>Time is a special dimension: tilted time frame </li></ul></ul><ul><ul><li>What to compute and what to save? — Critical layers </li></ul></ul><ul><ul><li>Very partial materialization/precomputation : popular path approach </li></ul></ul><ul><ul><li>Mining dynamics of stream data </li></ul></ul>
    32. 32. References <ul><li>B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data stream systems ” , PODS'02 (tutorial). </li></ul><ul><li>S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record, 30:109--120, 2001. </li></ul><ul><li>Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing stream data: Is it feasible?”, DMKD'02. </li></ul><ul><li>Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression analysis of time-series data streams”, VLDB'02. </li></ul><ul><li>P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00. </li></ul><ul><li>M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only get one look”, SIGMOD'02 (tutorial). </li></ul><ul><li>J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over continuous data streams”, SIGMOD'01. </li></ul><ul><li>S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”, FOCS'00. </li></ul><ul><li>G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01. </li></ul>
    33. 33. www. cs . uiuc . edu /~ hanj <ul><li>Thank you !!! </li></ul>