DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams


Published on

This paper proposed a novel tree structure, DSTree, which can handle the stream data. The experiments show the comparable performance in terms of accuracy and efficiency.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams

  1. 1. DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams Presenter / Meng-Lun Wu Source / ICDM’06, IEEE Author / Carson Kai-Sang Leung, Quamrul I. Khan
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>DSTree </li></ul><ul><li>Discussion </li></ul><ul><li>Experimental Results </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Introduction <ul><li>With advances in technology, a flood of data can be produced in many applications. </li></ul><ul><ul><li>Ex. Sensor networks and Web click streams. </li></ul></ul><ul><li>This calls for efficient techniques for extracting useful information from streams of data. </li></ul><ul><li>Mining from data streams is more challenging due to </li></ul><ul><ul><li>Property 1: Data streams are continuous and unbounded. </li></ul></ul><ul><ul><li>Property 2: Data in the streams are not necessarily uniformly distributed; their distributions are usually changing with time. </li></ul></ul>
  4. 4. Introduction (cont.) <ul><li>Several stream mining algorithms can be broadly categorized into two classes. </li></ul><ul><ul><li>Exact algorithms </li></ul></ul><ul><ul><ul><li>Finding truly frequent itemsets, especially for maximal, closed, or “short” itemsets. </li></ul></ul></ul><ul><ul><ul><ul><li>i.e., itemsets with frequency  user-defined minimum support threshold minsup. </li></ul></ul></ul></ul><ul><ul><li>Approximate algorithms </li></ul></ul><ul><ul><ul><li>Finding “frequent” itemsets by using approximate procedures, which may lead to some false positives or false negatives. </li></ul></ul></ul><ul><li>The key contribution of this work is called DSTree (Data Stream Tree), which is designed for exact stream mining of regular frequent itemsets. </li></ul>
  5. 5. Related Work <ul><li>In this section, we briefly discuss some data structures that are relevant to our work. </li></ul><ul><li>CanTree (proposed in ICDM’05) vs. DSTree </li></ul><ul><ul><li>CanTree is designed for incremental mining, whereas DSTree is designed for stream mining. </li></ul></ul><ul><ul><li>Each node in the CanTree keeps just one frequency count, whereas each node in the DSTree keeps a list of frequency counts. </li></ul></ul><ul><ul><li>Transactions are deleted from the CanTree, the frequency counts of the affected nodes get decremented while in the DSTree, the list of frequency counts at each affected node just shifts. </li></ul></ul>
  6. 6. Related Work (cont.) <ul><li>FP-streaming (FP-tree and FP-stream) vs. DSTree </li></ul>FP-streaming DSTree Algorithm type Approximate stream mining Exact stream mining Constructing tree One batch built one FP-tree Several batch in one transaction built one DSTree Store counts Keep just one frequency counts Keeps a list of frequency counts Support threshold Yes No Each path Represents an itemset Represents a transaction in the current window Window Titled-time windowing sliding window
  7. 7. DSTree <ul><li>Due to Property 1 of data streams, </li></ul><ul><ul><li>The DSTree is desinged for (exact) stream mining. </li></ul></ul><ul><ul><li>The construction of the DSTree only requires one scan of the streaming data. </li></ul></ul><ul><ul><li>The tree captures the contents of transactions in each batch of streaming data. </li></ul></ul><ul><li>Due to the dynamic nature and Property 2 of data streams </li></ul><ul><ul><li>We arrange transaction items according to some canonical order. </li></ul></ul><ul><ul><ul><li>E.g. lexicographic order or alphabetical order. </li></ul></ul></ul><ul><ul><li>The frequency of a node in DSTree is at least as high as the sum of frequencies of its children. </li></ul></ul><ul><ul><li>The ordering of items is unaffected by the continuous changes in item frequencies. </li></ul></ul>
  8. 8. <ul><li>Let minsup be 3 and let the window size w be 2 batches (indicating that only two batches of transactions are kept). </li></ul><ul><li>If we call the mining process at time T’, we get frequent itemsets {a}:4, {a,c}:3, {a,d}:3, {b}:4, {b,d}:4, {c}:3 and {d}:5. </li></ul>DSTree Example
  9. 9. Discussion <ul><li>(a) Applicability for finding other patterns: </li></ul><ul><li>DSTree also provides users with such functionalities as stream mining of maximal, closed, and constrained itemsets. </li></ul><ul><ul><li>A frequent itemset is maximal if none of its proper supersets is frequent. </li></ul></ul><ul><ul><li>A frequent itemset X is closed if none of proper supersets of X has the same frequency as X. </li></ul></ul><ul><ul><li>DSTree can provide users with additional functionality to these algorithms. </li></ul></ul><ul><ul><ul><li>These algorithms can use DSTree and arrange tree items according to some cannonical order. </li></ul></ul></ul><ul><ul><ul><ul><li>C succ  max (S.Price)  30 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>C conv  avg (S.Price)  7 </li></ul></ul></ul></ul>
  10. 10. Discussion (cont.) <ul><li>(b)Extensions – different windowing techniques </li></ul><ul><ul><li>It is important to note that the sliding window is not confined. </li></ul></ul><ul><ul><ul><li>Tiled-time window: more weights can be put on recent data and less weights on older data. </li></ul></ul></ul><ul><li>(c) Efficiency and memory issues: </li></ul><ul><ul><li>DSTree do not need to keep any extra tree structures as in the FP-streaming algorithm, where space is required for both the FP-tree and the FP-stream structure. </li></ul></ul>
  11. 11. Experimental Results & First Experiment <ul><li>Experiment Data </li></ul><ul><ul><li>IBM Almaden Research Center: 1 M records with an average transaction length of 10 items, and a domain of 1,000 items. </li></ul></ul><ul><ul><li>Each batch contain 0.1 M transactions and the window size is set to w=5 batches. </li></ul></ul><ul><ul><li>This experiments mainly evaluated the accuracy and efficiency of DSTree. </li></ul></ul><ul><li>First Experiment </li></ul><ul><ul><li>Accuracy: Comparing the frequent itemsets returned by mining directly from these transactions with those returned by mining from our DSTree. </li></ul></ul><ul><ul><ul><li>Accuracy: 100% </li></ul></ul></ul>
  12. 12. Second Experiment <ul><li>We compared the runtime of mining from DSTree with that of using the FP-streaming mining algorithm. </li></ul>
  13. 13. Third Experiment <ul><li>We compared the DSTree with its relevant structures (e.g., CanTree, FP-tree and FP-stream structure). </li></ul><ul><ul><li>When minsup=0.05%, the size of DSTree is about 1.25X that of the FP-stream; when minsup=0.01%, the size of the DSTree <0.90X that of the stream. </li></ul></ul><ul><ul><li>The results also confirmed that the size of the DSTree did not depend on minsup whereas that of the FP-stream did. </li></ul></ul><ul><ul><li>Among the two structures, the former kept transactions while the latter kept itemsets. </li></ul></ul><ul><ul><li>Thus mining from the DSTree gave exact results, but mining with FP-streaming gave approximate results. </li></ul></ul>
  14. 14. Fourth Experiment <ul><li>It show that the sizes of both DSTree and CanTree were unaffected by changes in minsup. </li></ul><ul><ul><li>The size of the DSTree is smaller than that of the CanTree because the latter keeps all transactions whereas the former only keeps transactions in the current window. </li></ul></ul><ul><ul><ul><li>The size of DSTree  0.5X that of the CanTree. </li></ul></ul></ul><ul><ul><ul><li>The DSTree required a lower maintenance cost than the CanTree. </li></ul></ul></ul><ul><ul><ul><li>Whenever transactions were deleted from the CanTree, it needed to either decrement the frequency count of nodes or remove the nodes corresponding to the deleted transactions. </li></ul></ul></ul><ul><ul><ul><li>In contrast, DSTree did not require expensive deletion of transactions; it just shifted the frequency lists. </li></ul></ul></ul>
  15. 15. Fifth experiment <ul><li>This paper tested with the usual experiment (e.g., the effect of minsup). </li></ul><ul><li>As expected, when minsup increased, the runtime decreased. </li></ul>
  16. 16. Sixth experiment <ul><li>This experiment tested scalability with the number of transactions. </li></ul><ul><li>The results show that mining with DSTree had linear scalability. </li></ul>
  17. 17. Conclusions <ul><li>A key contribution of this paper is to propose the novel structure of DSTree (Data Stream Tree). </li></ul><ul><li>This tree captures the contents of transactions in a window, and arranges tree nodes according to some canonical order. </li></ul><ul><li>By exploiting its nice properties, DSTree can be easily maintained when the window slides. </li></ul><ul><li>It can also be used for efficient stream mining of maximal itemsets, closed itemsets, as well as constrained itemsets. </li></ul>