Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MAIDS: Mining Alarming Incidents  in Data Streams A discussion on the MAIDS project  May 20, 2003
Outline <ul><li>Characteristics of data streams </li></ul><ul><li>Mining dynamics in data streams </li></ul><ul><li>Multi-...
Characteristics of Data Streams <ul><li>Data Streams </li></ul><ul><ul><li>Data streams — continuous, ordered, changing, f...
Stream Data Applications <ul><li>Telecommunication calling records </li></ul><ul><li>Business: credit card transaction flo...
Architecture: Stream Query Processing Scratch Space (Main memory and/or Disk) User/Application Stream Query Processor Resu...
Challenges of Stream Data Processing <ul><li>Multiple, continuous, rapid, time-varying, ordered  streams </li></ul><ul><li...
Processing Stream Queries <ul><li>Query types </li></ul><ul><ul><li>One-time query vs.  continuous query  (being evaluated...
Projects on DSMS (Data Stream Management System)  <ul><li>Research projects and system prototypes </li></ul><ul><ul><li>ST...
Stream Data Mining vs. Stream Querying <ul><li>Stream mining — A more challenging task </li></ul><ul><ul><li>It shares mos...
Stream Data Mining Tasks <ul><li>Multi-dimensional (on-line) statistical analysis  of streams </li></ul><ul><li>Clustering...
Challenges for Mining Dynamics in Data Streams <ul><li>Most stream data are at pretty low-level or multi-dimensional in na...
Multi-Dimensional Stream Analysis: Examples <ul><li>Analysis of  Web click streams </li></ul><ul><ul><li>Raw data at low l...
A Key Step — Stream Data Reduction <ul><li>Challenges of OLAPing stream data </li></ul><ul><ul><li>Raw data cannot be stor...
A Tilted Time-Frame Model Up to 7 days: Up to a year: Logarithmic (exponential) scale: 2t 1t Time Now 4t 8t 16t 31   days ...
Benefits of Tilted Time-Frame Model <ul><li>Each cell stores the measures according to tilt-time-frame </li></ul><ul><ul><...
A Stream Cube Architecture <ul><li>A   tilted time frame   </li></ul><ul><ul><li>Different time granularities </li></ul></...
Two Critical Layers in the Stream Cube (*, theme, quarter) (user-group, URL-group, minute) m-layer (minimal interest) (ind...
Stream Cube Structure: from m-layer to o-layer (A1, *, C1) (A1, *, C2) (A1, *, C2) (A1, *, C2) (A1, B1, C2) (A1, B2, C1) (...
An H-Tree Cubing Structure root entertainment sports politics uic uiuc uic uiuc jeff mary jeff Jim Q.I. Q.I. Q.I. Observat...
Benefits of H-Tree and H-Cubing <ul><li>H-tree and H-cubing  </li></ul><ul><ul><li>Developed for computing data cubes and ...
Use of Stream Cubing: Regression Analysis <ul><li>Regression modeling of cells at all dimensions and levels </li></ul><ul>...
Clustering Data Streams <ul><li>Network intrusion detection: one example  </li></ul><ul><ul><li>Detect bursts of activitie...
Clustering  Evolving Data Streams <ul><li>Why clustering evolving data streams?  </li></ul><ul><ul><li>Finding evolutions ...
Decision Tree Induction with Stream Data <ul><li>VFDT/CVFDT  </li></ul><ul><ul><li>P. Domingos and G. Hulten, “Mining high...
Single-Pass Algorithm (An Example) yes no Packets > 10 Protocol = http Protocol = ftp yes yes no Packets > 10 Bytes > 60K ...
Our Approaches for Stream Classification <ul><li>Why our approaches? </li></ul><ul><ul><li>Is decision-tree good for model...
Stream Classification by Naïve Bayesian <ul><li>Naïve Bayesian + boosting </li></ul><ul><ul><li>A working paper with Jiong...
Stream Classification by K-Nearest Neighbor <ul><li>Classification based on nearest neighbors </li></ul><ul><ul><li>C. Aga...
Mining Frequent Patterns for Stream Data <ul><li>Frequent pattern mining is valuable in stream applications </li></ul><ul>...
Our Approach on Frequent Stream Patterns <ul><li>Approach 1: Mining with Multiple Time Granularities </li></ul><ul><ul><li...
A Discussion of Our Work Plan <ul><li>System architecture design </li></ul><ul><ul><li>Need preprocessing of stream data f...
Conclusions <ul><li>Stream data mining:  A rich and largely unexplored field </li></ul><ul><ul><li>Current research focus ...
References <ul><li>B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data stream systems ” ,...
www.cs.uiuc.edu/~hanj <ul><li>Thank you !!! </li></ul>
Upcoming SlideShare
Loading in …5
×

MAIDS: Mining Alarming Incidents in Data Streams (PPT 156 KB)

1,626 views

Published on

  • Be the first to comment

MAIDS: Mining Alarming Incidents in Data Streams (PPT 156 KB)

  1. 1. MAIDS: Mining Alarming Incidents  in Data Streams A discussion on the MAIDS project May 20, 2003
  2. 2. Outline <ul><li>Characteristics of data streams </li></ul><ul><li>Mining dynamics in data streams </li></ul><ul><li>Multi-dimensional analysis of data streams </li></ul><ul><li>Stream query and stream cubing </li></ul><ul><ul><li>Querying, statistical summary, OLAP, regression, gradient analysis, … </li></ul></ul><ul><li>Mining frequent patterns in data streams </li></ul><ul><li>Clustering data streams </li></ul><ul><li>Classification of data streams </li></ul><ul><li>Planning for implementation and testing </li></ul>
  3. 3. Characteristics of Data Streams <ul><li>Data Streams </li></ul><ul><ul><li>Data streams — continuous, ordered, changing, fast, huge amount </li></ul></ul><ul><ul><li>Traditional DBMS — data stored in finite, persistent data sets </li></ul></ul><ul><li>Characteristics </li></ul><ul><ul><li>Huge volumes of continuous data, possibly infinite </li></ul></ul><ul><ul><li>Fast changing and requires fast, real-time response </li></ul></ul><ul><ul><li>Data stream captures nicely our data processing needs of today </li></ul></ul><ul><ul><li>Random access is expensive — single linear scan algorithm ( can only have one look ) </li></ul></ul><ul><ul><li>Store only the summary of the data seen thus far </li></ul></ul><ul><ul><li>Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing </li></ul></ul>
  4. 4. Stream Data Applications <ul><li>Telecommunication calling records </li></ul><ul><li>Business: credit card transaction flows </li></ul><ul><li>Network monitoring and traffic engineering </li></ul><ul><li>Financial market: stock exchange </li></ul><ul><li>Engineering & industrial processes: power supply & manufacturing </li></ul><ul><li>Sensor, monitoring & surveillance: video streams </li></ul><ul><li>Security monitoring </li></ul><ul><li>Web logs and Web page click streams </li></ul><ul><li>Massive data sets (even saved but random access is too expensive) </li></ul>
  5. 5. Architecture: Stream Query Processing Scratch Space (Main memory and/or Disk) User/Application Stream Query Processor Results Multiple streams SDMS (Stream Data Management System) Continuous Query
  6. 6. Challenges of Stream Data Processing <ul><li>Multiple, continuous, rapid, time-varying, ordered streams </li></ul><ul><li>Main memory computations </li></ul><ul><li>Queries are often continuous </li></ul><ul><ul><li>Evaluated continuously as stream data arrives </li></ul></ul><ul><ul><li>Answer updated over time </li></ul></ul><ul><li>Queries are often complex </li></ul><ul><ul><li>Beyond element-at-a-time processing </li></ul></ul><ul><ul><li>Beyond stream-at-a-time processing </li></ul></ul><ul><ul><li>Beyond relational queries (scientific, data mining, OLAP) </li></ul></ul><ul><li>Multi-level/multi-dimensional processing and data mining </li></ul><ul><ul><li>Most stream data are at pretty low-level or multi-dimensional in nature </li></ul></ul>
  7. 7. Processing Stream Queries <ul><li>Query types </li></ul><ul><ul><li>One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) </li></ul></ul><ul><ul><li>Predefined query vs. ad-hoc query (issued on-line) </li></ul></ul><ul><li>Unbounded memory requirements </li></ul><ul><ul><li>For real-time response, main memory algorithm should be used </li></ul></ul><ul><ul><li>Memory requirement is unbounded if one will join future tuples </li></ul></ul><ul><li>Approximate query answering </li></ul><ul><ul><li>With bounded memory, it is not always possible to produce exact answers </li></ul></ul><ul><ul><li>High-quality approximate answers are desired </li></ul></ul><ul><ul><li>Data reduction and synopsis construction methods </li></ul></ul><ul><ul><ul><li>Sketches, random sampling, histograms, wavelets, etc. </li></ul></ul></ul>
  8. 8. Projects on DSMS (Data Stream Management System) <ul><li>Research projects and system prototypes </li></ul><ul><ul><li>STREAM (Stanford): A general-purpose DSMS </li></ul></ul><ul><ul><li>Cougar (Cornell): sensors </li></ul></ul><ul><ul><li>Aurora (Brown/MIT): sensor monitoring, dataflow </li></ul></ul><ul><ul><li>Hancock (AT&T): telecom streams </li></ul></ul><ul><ul><li>Niagara (OGI/Wisconsin): Internet XML databases </li></ul></ul><ul><ul><li>OpenCQ (Georgia Tech): triggers, incr. view maintenance </li></ul></ul><ul><ul><li>Tapestry ( Xerox): pub/sub content-based filtering </li></ul></ul><ul><ul><li>Telegraph (Berkeley): adaptive engine for sensors </li></ul></ul><ul><ul><li>Tradebot (www.tradebot.com): stock tickers & streams </li></ul></ul><ul><ul><li>Tribeca (Bellcore): network monitoring </li></ul></ul><ul><ul><li>Streaminer & MAIDS (UIUC & NCSA): new projects for stream data mining </li></ul></ul>
  9. 9. Stream Data Mining vs. Stream Querying <ul><li>Stream mining — A more challenging task </li></ul><ul><ul><li>It shares most of the difficulties with stream querying </li></ul></ul><ul><ul><li>Patterns are hidden and more general than querying </li></ul></ul><ul><ul><li>It may require exploratory analysis </li></ul></ul><ul><ul><ul><li>Not necessarily continuous queries </li></ul></ul></ul><ul><li>Stream data mining tasks </li></ul><ul><ul><li>Multi-dimensional on-line analysis of streams </li></ul></ul><ul><ul><li>Mining outliers and unusual patterns in stream data </li></ul></ul><ul><ul><li>Clustering data streams </li></ul></ul><ul><ul><li>Classification of stream data </li></ul></ul>
  10. 10. Stream Data Mining Tasks <ul><li>Multi-dimensional (on-line) statistical analysis of streams </li></ul><ul><li>Clustering data streams </li></ul><ul><li>Classification of data streams </li></ul><ul><li>Mining frequent patterns in data streams </li></ul><ul><li>Mining sequential patterns in data streams </li></ul><ul><li>Mining partial periodicity in data streams </li></ul><ul><li>Mining notable gradients in data streams </li></ul><ul><li>Mining outliers and unusual patterns in data streams </li></ul><ul><li>…… , more? </li></ul>
  11. 11. Challenges for Mining Dynamics in Data Streams <ul><li>Most stream data are at pretty low-level or multi-dimensional in nature: needs ML/MD processing </li></ul><ul><li>Analysis requirements </li></ul><ul><ul><li>Multi-dimensional trends and unusual patterns </li></ul></ul><ul><ul><li>Capturing important changes at multi-dimensions/levels </li></ul></ul><ul><ul><li>Fast, real-time detection and response </li></ul></ul><ul><ul><li>Comparing with data cube: Similarity and differences </li></ul></ul><ul><li>Stream (data) cube or stream OLAP: Is this feasible? </li></ul><ul><ul><li>Can we implement it efficiently? </li></ul></ul>
  12. 12. Multi-Dimensional Stream Analysis: Examples <ul><li>Analysis of Web click streams </li></ul><ul><ul><li>Raw data at low levels: seconds, web page addresses, user IP addresses, … </li></ul></ul><ul><ul><li>Analysts want: changes, trends, unusual patterns, at reasonable levels of details </li></ul></ul><ul><ul><li>E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours .” </li></ul></ul><ul><li>Analysis of power consumption streams </li></ul><ul><ul><li>Raw data: power consumption flow for every household, every minute </li></ul></ul><ul><ul><li>Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago </li></ul></ul>
  13. 13. A Key Step — Stream Data Reduction <ul><li>Challenges of OLAPing stream data </li></ul><ul><ul><li>Raw data cannot be stored </li></ul></ul><ul><ul><li>Simple aggregates are not powerful enough </li></ul></ul><ul><ul><li>History shape and patterns at different levels are desirable: multi-dimensional regression analysis </li></ul></ul><ul><li>Proposal </li></ul><ul><ul><li>A scalable multi-dimensional stream “data cube” that can aggregate regression model of stream data efficiently without accessing the raw data </li></ul></ul><ul><li>Stream data compression </li></ul><ul><ul><li>Compress the stream data to support memory- and time-efficient multi-dimensional regression analysis </li></ul></ul>
  14. 14. A Tilted Time-Frame Model Up to 7 days: Up to a year: Logarithmic (exponential) scale: 2t 1t Time Now 4t 8t 16t 31 days 24 hours 4 qtrs 12 months Time Now 24hrs 4qtrs 15minutes 7 days Time Now 25sec.
  15. 15. Benefits of Tilted Time-Frame Model <ul><li>Each cell stores the measures according to tilt-time-frame </li></ul><ul><ul><li>Limited memory space: Impossible to store the history in full scale </li></ul></ul><ul><li>Emphasis more on recent data </li></ul><ul><ul><li>Most applications emphasize on recent data (slide window) </li></ul></ul><ul><li>Natural partition on different time granularities </li></ul><ul><ul><li>Putting different weights on remote data </li></ul></ul><ul><ul><li>Useful even for uniform weight </li></ul></ul><ul><li>Tilted time-frame forms a new time dimension </li></ul><ul><ul><li>for mining changes and evolutions </li></ul></ul><ul><li>Essential for mining dynamic patterns or outliers </li></ul><ul><ul><li>Finding those with dramatic changes </li></ul></ul><ul><ul><li>E.g., exceptional stocks —not following the trends </li></ul></ul>
  16. 16. A Stream Cube Architecture <ul><li>A tilted time frame </li></ul><ul><ul><li>Different time granularities </li></ul></ul><ul><ul><ul><li>second, minute, quarter, hour, day, week, … </li></ul></ul></ul><ul><li>Critical layers </li></ul><ul><ul><li>Minimum interest layer (m-layer) </li></ul></ul><ul><ul><li>Observation layer (o-layer) </li></ul></ul><ul><ul><li>User: watches at o-layer and occasionally needs to drill-down down to m-layer </li></ul></ul><ul><li>Partial materialization of stream cubes </li></ul><ul><ul><li>Full materialization: too space and time consuming </li></ul></ul><ul><ul><li>No materialization: slow response at query time </li></ul></ul><ul><ul><li>Partial materialization: what do we mean “partial”? </li></ul></ul>
  17. 17. Two Critical Layers in the Stream Cube (*, theme, quarter) (user-group, URL-group, minute) m-layer (minimal interest) (individual-user, URL, second) (primitive) stream data layer o-layer (observation)
  18. 18. Stream Cube Structure: from m-layer to o-layer (A1, *, C1) (A1, *, C2) (A1, *, C2) (A1, *, C2) (A1, B1, C2) (A1, B2, C1) (A2, *, C2) (A2, B1, C1) (A1, B2, C2) (A2, B1, C2) A2, B2, C1) (A2, B2, C2)
  19. 19. An H-Tree Cubing Structure root entertainment sports politics uic uiuc uic uiuc jeff mary jeff Jim Q.I. Q.I. Q.I. Observation layer Minimal int. layer Regression: Sum: xxxx Cnt: yyyy Quant-Info
  20. 20. Benefits of H-Tree and H-Cubing <ul><li>H-tree and H-cubing </li></ul><ul><ul><li>Developed for computing data cubes and iceberg cubes </li></ul></ul><ul><ul><ul><li>J. Han, J. Pei, G. Dong, and K. Wang, “ Efficient Computation of Iceberg Cubes with Complex Measures ”, SIGMOD'01 </li></ul></ul></ul><ul><ul><ul><li>D. Xin, J. Han, X. Li, B. Wah, “Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration”, VLDB'03, Berlin, Germany, Sept. 2003. </li></ul></ul></ul><ul><ul><li>Compressed database, fast cubing, space preserving </li></ul></ul><ul><li>Using H-tree for stream cubing </li></ul><ul><ul><li>Space preserving: Intermediate aggregates can be computed incrementally and saved in tree nodes </li></ul></ul><ul><ul><li>Facilitate computing other cells and multi-dimensional analysis </li></ul></ul><ul><ul><li>H-tree with computed cells can be viewed as stream cube </li></ul></ul>
  21. 21. Use of Stream Cubing: Regression Analysis <ul><li>Regression modeling of cells at all dimensions and levels </li></ul><ul><ul><li>Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression analysis of time-series data streams”, VLDB'02. </li></ul></ul><ul><li>Efficient storage and scalable and fast aggregation </li></ul><ul><li>Lossless aggregation without accessing the raw data </li></ul><ul><li>Covered a large and the most popular class of regression </li></ul><ul><ul><li>Including quadratic, polynomial, and nonlinear models </li></ul></ul><ul><li>Methodology can be used for other statistical summary, gradient analysis, and so on </li></ul>
  22. 22. Clustering Data Streams <ul><li>Network intrusion detection: one example </li></ul><ul><ul><li>Detect bursts of activities or abrupt changes in real time — by on-line clustering </li></ul></ul><ul><li>Two major methodologies: </li></ul><ul><ul><li>Motwani et al. (Stanford and HP Lab) </li></ul></ul><ul><ul><ul><li>S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”, FOCS'00. </li></ul></ul></ul><ul><ul><ul><li>Merging and changing k-media cluster centers </li></ul></ul></ul><ul><ul><li>Our approach (UIUC and IBM) </li></ul></ul><ul><ul><ul><li>Tilted time frame to store historical data in compressed way </li></ul></ul></ul><ul><ul><ul><li>Mining evolving data streams </li></ul></ul></ul>
  23. 23. Clustering Evolving Data Streams <ul><li>Why clustering evolving data streams? </li></ul><ul><ul><li>Finding evolutions of clusters not just current clusters </li></ul></ul><ul><ul><li>C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Evolving Data Streams”, VLDB'03 </li></ul></ul><ul><li>Methodology </li></ul><ul><ul><li>Tilted time frame work: compression + mining changes </li></ul></ul><ul><ul><li>Micro-clustering: better quality than k-means/k-median </li></ul></ul><ul><ul><ul><li>incremental, online processing and maintenance </li></ul></ul></ul><ul><ul><li>Two stages: micro-clustering and macro-clustering </li></ul></ul><ul><ul><li>With limited “overhead” to achieve high efficiency, scalability, quality of results and power of evolution/change detection </li></ul></ul>
  24. 24. Decision Tree Induction with Stream Data <ul><li>VFDT/CVFDT </li></ul><ul><ul><li>P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00 </li></ul></ul><ul><ul><li>G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”,KDD'01 </li></ul></ul><ul><li>VFDT (Very Fast Decision Tree) (Domingos and Hulten’00) </li></ul><ul><ul><li>With high probability, constructs an identical model that a traditional (greedy) method would learn </li></ul></ul><ul><ul><li>If it cannot be inserted into the same branch, construct “ shadow branches ” as preparation for changes </li></ul></ul><ul><ul><li>If the shadow becomes dominant, switch of tree branches occurs </li></ul></ul><ul><li>CVFDT: Extension to time changing data </li></ul>
  25. 25. Single-Pass Algorithm (An Example) yes no Packets > 10 Protocol = http Protocol = ftp yes yes no Packets > 10 Bytes > 60K Protocol = http SP(Bytes) - SP(Packets) > Data Stream Data Stream From Gehrke’s SIGMOD tutorial slides
  26. 26. Our Approaches for Stream Classification <ul><li>Why our approaches? </li></ul><ul><ul><li>Is decision-tree good for modeling fast changing data? —too fast changes may make trees obsolete </li></ul></ul><ul><ul><li>May other models have better survivability (adaptability)? </li></ul></ul><ul><ul><li>Can we find and compare evolution behavior? </li></ul></ul><ul><li>Our methodology </li></ul><ul><ul><li>Tilted time framework (compression and evolution) </li></ul></ul><ul><ul><li>Instead of decision-trees, consider other models that do not need dramatic changes, e.g., Naïve Bayesian with boosting, K-NN? </li></ul></ul><ul><ul><li>Incremental updating, dynamic maintenance, and model construction </li></ul></ul><ul><ul><li>Comparing of models to find changes </li></ul></ul>
  27. 27. Stream Classification by Naïve Bayesian <ul><li>Naïve Bayesian + boosting </li></ul><ul><ul><li>A working paper with Jiong Yang, Wei Wang and Xifeng Yan </li></ul></ul><ul><ul><li>A stable model that registers attribute-value pairs (similar to Raiforest-based classification) </li></ul></ul><ul><ul><li>History can be recorded using tilted time framework </li></ul></ul><ul><ul><li>Using boosting to increase classification accuracy </li></ul></ul><ul><ul><li>Adaptable to dramatic changes </li></ul></ul><ul><ul><li>Incremental updating, dynamic maintenance, and fast model construction </li></ul></ul><ul><ul><li>Comparing of models to find changes </li></ul></ul>
  28. 28. Stream Classification by K-Nearest Neighbor <ul><li>Classification based on nearest neighbors </li></ul><ul><ul><li>C. Agarwal (IBM), J. Han, J. Wang, P. S. Yu (IBM), “A Framework for Effective Classification of Evolving Data”, a working paper </li></ul></ul><ul><li>Two kinds of stream classification tasks </li></ul><ul><ul><li>Type I: Prediction of peers in the current stream </li></ul></ul><ul><ul><li>Type II: Prediction of the behavior in the next window </li></ul></ul><ul><li>Registration of basic statistics (clustering features) using B-tree (Birch-like) structure </li></ul><ul><li>Different philosophies for training and model construction for type I and II tasks </li></ul>
  29. 29. Mining Frequent Patterns for Stream Data <ul><li>Frequent pattern mining is valuable in stream applications </li></ul><ul><ul><li>e.g., network intrusion mining (Dokas, et al’02) </li></ul></ul><ul><li>Mining precise freq. patterns in stream data: unrealistic </li></ul><ul><ul><li>Even store them in a compressed form, such as FPtree </li></ul></ul><ul><li>How to mine frequent patterns with good approximation? </li></ul><ul><ul><li>Approximate frequent patterns (Manku & Motwani, VLDB’02) </li></ul></ul><ul><ul><li>Major ideas: not tracing items until it becomes first frequent </li></ul></ul><ul><ul><li>Adv: guarantee error bound </li></ul></ul><ul><ul><li>Disadv: keep a large set of traces </li></ul></ul><ul><li>Our comments </li></ul><ul><ul><li>Keep only current frequent patterns? No changes can be detected </li></ul></ul>
  30. 30. Our Approach on Frequent Stream Patterns <ul><li>Approach 1: Mining with Multiple Time Granularities </li></ul><ul><ul><li>C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities”, Next Gen. Data Mining, MIT Press, 2003 </li></ul></ul><ul><ul><li>Keep pattern-trees at the tilted time window frame (using tree-sharing method) </li></ul></ul><ul><ul><li>Mining evolution and dramatic changes of frequent patterns </li></ul></ul><ul><li>Approach 2: Mining only interested itemsets </li></ul><ul><ul><li>Identify interested items in stream environment </li></ul></ul><ul><ul><li>Keep precise/compressed history in tilted time window </li></ul></ul><ul><ul><li>Mining using FP-tree and related fast mining method </li></ul></ul>
  31. 31. A Discussion of Our Work Plan <ul><li>System architecture design </li></ul><ul><ul><li>Need preprocessing of stream data flow </li></ul></ul><ul><ul><li>Two working modes </li></ul></ul><ul><ul><ul><li>Disk-based and true stream processing </li></ul></ul></ul><ul><ul><li>Main memory algorithms </li></ul></ul><ul><ul><ul><li>“cube” and tilted time frame structure </li></ul></ul></ul><ul><li>Test data sets </li></ul><ul><ul><li>Network flow (multi-dimensional data) </li></ul></ul><ul><ul><li>Web click stream analysis </li></ul></ul><ul><ul><li>Stock trading data? </li></ul></ul><ul><ul><li>Multiple stream weather data? </li></ul></ul>
  32. 32. Conclusions <ul><li>Stream data mining: A rich and largely unexplored field </li></ul><ul><ul><li>Current research focus in database community: </li></ul></ul><ul><ul><ul><li>DSMS system architecture, continuous query processing, supporting mechanisms </li></ul></ul></ul><ul><ul><li>Stream data mining and stream OLAP analysis </li></ul></ul><ul><ul><ul><li>Powerful tools for finding general and unusual patterns </li></ul></ul></ul><ul><ul><ul><li>Effectiveness, efficiency and scalability: lots of open problems </li></ul></ul></ul><ul><li>Our philosophy: </li></ul><ul><ul><li>A multi-dimensional stream analysis framework </li></ul></ul><ul><ul><li>Time is a special dimension: tilted time frame </li></ul></ul><ul><ul><li>What to compute and what to save? — Critical layers </li></ul></ul><ul><ul><li>Very partial materialization/precomputation : popular path approach </li></ul></ul><ul><ul><li>Mining dynamics of stream data </li></ul></ul>
  33. 33. References <ul><li>B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data stream systems ” , PODS'02 (tutorial). </li></ul><ul><li>S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record, 30:109--120, 2001. </li></ul><ul><li>Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing stream data: Is it feasible?”, DMKD'02. </li></ul><ul><li>Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression analysis of time-series data streams”, VLDB'02. </li></ul><ul><li>P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00. </li></ul><ul><li>M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only get one look”, SIGMOD'02 (tutorial). </li></ul><ul><li>J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over continuous data streams”, SIGMOD'01. </li></ul><ul><li>S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”, FOCS'00. </li></ul><ul><li>G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01. </li></ul>
  34. 34. www.cs.uiuc.edu/~hanj <ul><li>Thank you !!! </li></ul>

×