Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Mining, DataWarehousing and Knowledge        Discovery  Basic Algorithms and Concepts          Srinath Srinivasa     ...
Overview• Why Data Mining?• Data Mining concepts• Data Mining algorithms  –   Tabular data mining  –   Association, Classi...
Why Data Mining    From a managerial perspective:                                                Analyzing trends         ...
Data Mining• Look for hidden patterns and trends in  data that is not immediately apparent  from summarizing the data• No ...
Data Mining       +                     =           Interestingness       HiddenData           criteria          patterns
Data Mining                        Type                                    of                                   Patterns  ...
Data Mining Type of data    Type of                 Interestingness criteria          +                       =           ...
Type of Data• Tabular                  (Ex: Transaction data)    – Relational    – Multi-dimensional• Spatial             ...
Type of Interestingness•   Frequency•   Rarity•   Correlation•   Length of occurrence   (for sequence and temporal    data...
Data Mining vs Statistical InferenceStatistics:                                       Statistical     Conceptual          ...
Data Mining vs Statistical InferenceData mining:                               Mining                               Algori...
Data Mining ConceptsAssociations and Item-sets:An association is a rule of the form: if X then Y.It is denoted as X  YExa...
Data Mining ConceptsSupport and Confidence:The support for a rule R is the ratio of the number of occurrencesof R, given a...
Data Mining Concepts Support and Confidence:                              Support for {Bag, Uniform} =  Bag     Uniform   ...
Mining for Frequent Item-setsThe Apriori Algorithm:Given minimum required support s as interestingness criterion:1. Search...
Mining for Frequent Item-sets The Apriori Algorithm: (Example)                              Let minimum support = 0.3  Bag...
Mining for Frequent Item-sets The Apriori Algorithm: (Example)                               Let minimum support = 0.3  Ba...
Mining for Association Rules                            Association rules are of the form  Bag     Uniform   Crayons      ...
Mining for Association Rules Mining association rules using apriori                              General Procedure:  Bag  ...
Mining for Association Rules Mining association rules using apriori                              Example:  Bag     Uniform...
Mining for Association Rules Mining association rules using apriori                              Confidence for these rule...
Mining for Association Rules Mining association rules using apriori                          People who buy a school bag a...
Generalized Association RulesSince customers can buy any number of items in one transaction,the transaction relation would...
Generalized Association RulesA transaction for the purposes of data mining is obtained byperforming a GROUP BY of the tabl...
Generalized Association RulesA GROUP BY over Bill No. would show frequent buying patternsacross different customers.A GROU...
Classification and ClusteringGiven a set of data elements:   Classification maps each data element to one of a set of   pr...
Classification TechniquesDecision Tree IdentificationOutlook       Temp      Play?   Classification problemSunny         3...
Classification TechniquesHunt’s method for decision tree identification:Given N element types and m decision classes:1. Fo...
Classification TechniquesDecision Tree Identification ExampleOutlook    Temp     Play?Sunny      Warm     Yes          Sun...
Classification TechniquesDecision Tree Identification ExampleOutlook    Temp     Play?Sunny      Warm     Yes          Sun...
Classification TechniquesDecision Tree Identification ExampleOutlook    Temp     Play?                                Clou...
Classification TechniquesDecision Tree Identification ExampleOutlook    Temp     Play?                               Overc...
Classification TechniquesDecision Tree Identification Example                         Yes/No        Cloudy                ...
Classification TechniquesDecision Tree Identification Example• Top down technique for decision tree identification• Decisi...
Other Classification AlgorithmsQuinlan’s depth-first strategy builds the decision tree in adepth-first fashion, by conside...
Clustering TechniquesClustering partitions the data set into clusters or equivalenceclasses.Similarity among members of a ...
Euclidian Distance for Tables                                              (Overcast,Chilly,Don’t Play)     Overcast      ...
Clustering TechniquesGeneral Strategy:1. Draw a graph connecting items which are close to one   another with edges.2. Part...
Clustering TechniquesClustering types:Hierarchical clustering: Clusters are formed at different   levels by merging cluste...
Clustering TechniquesNearest Neighbour Clustering Algorithm:Given n elements x1, x2, … xn, and threshold t, .1. j  1, k ...
Clustering TechniquesIterative partitional clustering:Given n elements x1, x2, … xn, and k clusters, each with a   center....
Mining Sequence DataCharacteristics of Sequence Data:• Collection of data elements which are ordered sequences• In a seque...
Mining Sequence DataSome Definitions:• A sequence is a list of itemsets of finite length.• Example:    • {pen,pencil,ink}{...
Mining Sequence DataSome Definitions:• A sequence S’ = {a1, a2, …, am} is said to be containedwithin another sequence S, i...
Mining Sequence DataApriori Algorithm for Sequences:1. L1  Set of all interesting 1-sequences2. k  13. while Lk is not e...
Mining Sequence DataGenerating Candidate Sequences:Given L1, L2, … Lk, candidate sequences of Lk+1 are generated   as foll...
Mining Sequence DataExample:           minsup = 0.5abcde      Interesting 1-sequences:bdae         aaebd         bbe      ...
Mining Sequence DataExample:           minsup = 0.5abcde      Interesting 2-sequences:bdae         ab, bdaebdbe         Ca...
Mining Sequence DataLanguage Inference:Given a set of sequences, consider each sequence as thebehavioural trace of a machi...
Mining Sequence Data• Inferring the syntax of a language given  its sentences• Applications: discerning behavioural  patte...
Mining Sequence Data“Maximal” nature of language inference:                                                     a,b,c  abc...
Mining Sequence Data“Shortest-run Generalization”      (Srinivasa and Spiliopoulou 2000)Given a set of n sequences:1. Crea...
Mining Sequence Data“Shortest-run Generalization”        (Srinivasa and Spiliopoulou 2000)                     a       a  ...
Mining Streaming DataCharacteristics of streaming data:• Large data sequence• No storage• Often an infinite sequence• Exam...
Mining Streaming DataRunning mean:Let n = number of items read so far,   avg = running average calculated so far,On readin...
Mining Streaming DataRunning variance:      var = ∑(num-avg)2         = ∑num2 - 2*∑num*avg + ∑avg2Let A = ∑num2 of all num...
Mining Streaming DataRunning variance:On reading next number num:avg  (avg*n + num) / (n+1) n  n+1A  A + num2B  B + 2*...
Mining Streaming Dataγ-Consistency:    (Srinivasa and Spiliopoulou, CoopIS 1999)Let streaming data be in the form of “fram...
Mining Streaming Dataγ-Consistency:   (Srinivasa and Spiliopoulou, CoopIS 1999)                             γ*sup(k)      ...
Data Warehousing• A platform for online analytical processing (OLAP)• Warehouses collect transactional data from several  ...
Data Warehousing                   OLTP Orde      r   Proc                 e s s in                            g          ...
OLTP vs OLAP  Transactional Data (OLTP)         Analysis Data (OLAP)Small or medium size databases Very large databasesTra...
Data Cleaning• Performs logical transformation of  transactional data to suit the data  warehouse• Model of operations  m...
Data Cleaning                          Data WarehouseOrdersOrder_id                  Customers Price                    Pr...
Multi-dimensional Data ModelPrice                    Products                                                 rs          ...
Some MDBMS Operations• Roll-up  – Add dimensions• Drill-down  – Collapse dimensions• Vector-distance operations (ex:  clus...
Star Schema          Dim                  Dim         Tbl_1                Tbl_1  Dim            Fact table               ...
WWW Based References•   http://www.kdnuggets.com/•   http://www.megaputer.com/•   http://www.almaden.ibm.com/cs/quest/inde...
References•    Agrawal, R. Srikant: ``Fast Algorithms for Mining     Association Rules, Proc. of the 20th Intl Conference ...
References•    A. Shoshani. OLAP and Statistical Databases: Similarities and     Differences. Proc. of ACM PODS 1997.•    ...
Upcoming SlideShare
Loading in …5
×

Data mining, data warehousing and knowledge discovery

741 views

Published on

Published in: Education
  • Be the first to comment

Data mining, data warehousing and knowledge discovery

  1. 1. Data Mining, DataWarehousing and Knowledge Discovery Basic Algorithms and Concepts Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in
  2. 2. Overview• Why Data Mining?• Data Mining concepts• Data Mining algorithms – Tabular data mining – Association, Classification and Clustering – Sequence data mining – Streaming data mining• Data Warehousing concepts
  3. 3. Why Data Mining From a managerial perspective: Analyzing trends Wealth generation SecurityStrategic decision making
  4. 4. Data Mining• Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data• No Query…• …But an “Interestingness criteria”
  5. 5. Data Mining + = Interestingness HiddenData criteria patterns
  6. 6. Data Mining Type of Patterns + = Interestingness HiddenData criteria patterns
  7. 7. Data Mining Type of data Type of Interestingness criteria + = Interestingness HiddenData criteria patterns
  8. 8. Type of Data• Tabular (Ex: Transaction data) – Relational – Multi-dimensional• Spatial (Ex: Remote sensing data)• Temporal (Ex: Log information) – Streaming (Ex: multimedia, network traffic) – Spatio-temporal (Ex: GIS)• Tree (Ex: XML data)• Graphs (Ex: WWW, BioMolecular data)• Sequence (Ex: DNA, activity logs)• Text, Multimedia …
  9. 9. Type of Interestingness• Frequency• Rarity• Correlation• Length of occurrence (for sequence and temporal data)• Consistency• Repeating / periodicity• “Abnormal” behavior• Other patterns of interestingness…
  10. 10. Data Mining vs Statistical InferenceStatistics: Statistical Conceptual Reasoning Model (Hypothesis ) “Proof” (Validation of Hypothesis)
  11. 11. Data Mining vs Statistical InferenceData mining: Mining Algorithm Data Based on Interestingness Pattern (model, rule, hypothesis) discovery
  12. 12. Data Mining ConceptsAssociations and Item-sets:An association is a rule of the form: if X then Y.It is denoted as X  YExample: If India wins in cricket, sales of sweets go up.For any rule if X  Y  Y  X, then X and Y are calledan “interesting item-set”.Example: People buying school uniforms in June also buy school bags (People buying school bags in June also buy school uniforms)
  13. 13. Data Mining ConceptsSupport and Confidence:The support for a rule R is the ratio of the number of occurrencesof R, given all occurrences of all rules.The confidence of a rule X  Y, is the ratio of the number ofoccurrences of Y given X, among all other occurrences given X.
  14. 14. Data Mining Concepts Support and Confidence: Support for {Bag, Uniform} = Bag Uniform Crayons 5/10 = 0.5 Books Bag Uniform Bag Uniform Pencil Bag Pencil Book Confidence for Bag  Uniform =Uniform Crayons Bag 5/8 = 0.625 Bag Pencil BookCrayons Uniform Bag Books Crayons BagUniform Crayons Pencil Pencil Uniform Books
  15. 15. Mining for Frequent Item-setsThe Apriori Algorithm:Given minimum required support s as interestingness criterion:1. Search for all individual elements (1-element item-set) that have a minimum support of s2. Repeat 1. From the results of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s 2. This becomes the set of all frequent (i+1)-element item- sets that are interesting3. Until item-set size reaches maximum..
  16. 16. Mining for Frequent Item-sets The Apriori Algorithm: (Example) Let minimum support = 0.3 Bag Uniform Crayons Interesting 1-element item-sets: Books Bag Uniform {Bag}, {Uniform}, {Crayons}, {Pencil}, Bag Uniform Pencil {Books} Bag Pencil BooksUniform Crayons Bag Interesting 2-element item-sets: Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil} {Bag,Books} {Uniform,Crayons}Crayons Uniform Bag {Uniform,Pencil} {Pencil,Books} Books Crayons BagUniform Crayons Pencil Pencil Uniform Books
  17. 17. Mining for Frequent Item-sets The Apriori Algorithm: (Example) Let minimum support = 0.3 Bag Uniform Crayons Books Bag Uniform Interesting 3-element item-sets: {Bag,Uniform,Crayons} Bag Uniform Pencil Bag Pencil BooksUniform Crayons Bag Bag Pencil BooksCrayons Uniform Bag Books Crayons BagUniform Crayons Pencil Pencil Uniform Books
  18. 18. Mining for Association Rules Association rules are of the form Bag Uniform Crayons AB Books Bag Uniform Bag Uniform Pencil Which are directional… Bag Pencil BooksUniform Crayons Bag Association rule mining requires two Bag Pencil Books thresholds:Crayons Uniform Bag Books Crayons Bag minsup and minconfUniform Crayons Pencil Pencil Uniform Books
  19. 19. Mining for Association Rules Mining association rules using apriori General Procedure: Bag Uniform Crayons 1. Use apriori to generate frequent Books Bag Uniform itemsets of different sizes Bag Uniform Pencil 2. At each iteration divide each frequent Bag Pencil Books itemset X into two parts LHS andUniform Crayons Bag RHS. This represents a rule of the form LHS  RHS Bag Pencil Books 3. The confidence of such a rule isCrayons Uniform Bag support(X)/support(LHS) Books Crayons Bag 4. Discard all rules whose confidence isUniform Crayons Pencil less than minconf. Pencil Uniform Books
  20. 20. Mining for Association Rules Mining association rules using apriori Example: Bag Uniform Crayons The frequent itemset {Bag, Uniform, Books Bag Uniform Crayons} has a support of 0.3. Bag Uniform Pencil Bag Pencil Books This can be divided into the followingUniform Crayons Bag rules: {Bag}  {Uniform, Crayons} Bag Pencil Books {Bag, Uniform}  {Crayons}Crayons Uniform Bag {Bag, Crayons}  {Uniform} Books Crayons Bag {Uniform}  {Bag, Crayons}Uniform Crayons Pencil {Uniform, Crayons}  {Bag} Pencil Uniform Books {Crayons}  {Bag, Uniform}
  21. 21. Mining for Association Rules Mining association rules using apriori Confidence for these rules are as follows: Bag Uniform Crayons {Bag}  {Uniform, Crayons} 0.375 Books Bag Uniform {Bag, Uniform}  {Crayons} 0.6 Bag Uniform Pencil {Bag, Crayons}  {Uniform} 0.75 Bag Pencil Books {Uniform}  {Bag, Crayons} 0.428Uniform Crayons Bag {Uniform, Crayons}  {Bag} 0.75 {Crayons}  {Bag, Uniform} 0.75 Bag Pencil BooksCrayons Uniform Bag If minconf is 0.7, then we have discovered the Books Crayons Bag following rules…Uniform Crayons Pencil Pencil Uniform Books
  22. 22. Mining for Association Rules Mining association rules using apriori People who buy a school bag and a set of crayons are likely to buy school Bag Uniform Crayons uniform. Books Bag Uniform Bag Uniform Pencil People who buy school uniform and a set Bag Pencil Books of crayons are likely to buy a schoolUniform Crayons Bag bag. Bag Pencil Books People who buy just a set of crayons areCrayons Uniform Bag likely to buy a school bag and school Books Crayons Bag uniform as well.Uniform Crayons Pencil Pencil Uniform Books
  23. 23. Generalized Association RulesSince customers can buy any number of items in one transaction,the transaction relation would be in the form of a list of individualpurchases.Bill No. Date Item15563 23.10.2003 Books15563 23.10.2003 Crayons15564 23.10.2003 Uniform15564 23.10.2003 Crayons
  24. 24. Generalized Association RulesA transaction for the purposes of data mining is obtained byperforming a GROUP BY of the table over various fields.Bill No. Date Item15563 23.10.2003 Books15563 23.10.2003 Crayons15564 23.10.2003 Uniform15564 23.10.2003 Crayons
  25. 25. Generalized Association RulesA GROUP BY over Bill No. would show frequent buying patternsacross different customers.A GROUP BY over Date would show frequent buying patternsacross different days. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons
  26. 26. Classification and ClusteringGiven a set of data elements: Classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes Clustering groups data elements into different groups based on the similarity between elements within a single group
  27. 27. Classification TechniquesDecision Tree IdentificationOutlook Temp Play? Classification problemSunny 30 YesOvercast 15 No WeatherSunny 16 Yes Cloudy 27 Yes Play(Yes,No)Overcast 25 YesOvercast 17 NoCloudy 17 NoCloudy 35 Yes
  28. 28. Classification TechniquesHunt’s method for decision tree identification:Given N element types and m decision classes:1. For i  1 to N do 1. Add element i to the i-1 element item-sets from the previous iteration 2. Identify the set of decision classes for each item-set 3. If an item-set has only one decision class, then that item-set is done, remove that item-set from subsequent iterations2. done
  29. 29. Classification TechniquesDecision Tree Identification ExampleOutlook Temp Play?Sunny Warm Yes Sunny YesOvercast Chilly NoSunny Chilly Yes Cloudy Yes/NoCloudy Pleasant YesOvercast Pleasant Yes Overcast Yes/NoOvercast Chilly NoCloudy Chilly NoCloudy Warm Yes
  30. 30. Classification TechniquesDecision Tree Identification ExampleOutlook Temp Play?Sunny Warm Yes Sunny YesOvercast Chilly NoSunny Chilly Yes Cloudy Yes/NoCloudy Pleasant YesOvercast Pleasant Yes Overcast Yes/NoOvercast Chilly NoCloudy Chilly NoCloudy Warm Yes
  31. 31. Classification TechniquesDecision Tree Identification ExampleOutlook Temp Play? CloudySunny Warm Yes Yes WarmOvercast Chilly NoSunny Chilly Yes Cloudy NoCloudy Pleasant Yes ChillyOvercast Pleasant Yes CloudyOvercast Chilly No Yes PleasantCloudy Chilly NoCloudy Warm Yes
  32. 32. Classification TechniquesDecision Tree Identification ExampleOutlook Temp Play? OvercastSunny Warm Yes WarmOvercast Chilly NoSunny Chilly Yes Overcast NoCloudy Pleasant Yes ChillyOvercast Pleasant Yes OvercastOvercast Chilly No Yes PleasantCloudy Chilly NoCloudy Warm Yes
  33. 33. Classification TechniquesDecision Tree Identification Example Yes/No Cloudy Overcast Sunny Yes/No Yes Yes/No Warm Pleasant Chilly Chilly No PleasantYes No Yes Yes
  34. 34. Classification TechniquesDecision Tree Identification Example• Top down technique for decision tree identification• Decision tree created is sensitive to the order in whichitems are considered• If an N-item-set does not result in a clear decision,classification classes have to be modeled by rough sets.
  35. 35. Other Classification AlgorithmsQuinlan’s depth-first strategy builds the decision tree in adepth-first fashion, by considering all possible tests that give adecision and selecting the test that gives the best informationgain. It hence eliminates tests that are inconclusive.SLIQ (Supervised Learning in Quest) developed in theQUEST project of IBM uses a top-down breadth-first strategyto build a decision tree. At each level in the tree, an entropyvalue of each node is calculated and nodes having the lowestentropy values selected and expanded.
  36. 36. Clustering TechniquesClustering partitions the data set into clusters or equivalenceclasses.Similarity among members of a class more than similarityamong members across classes.Similarity measures: Euclidian distance or other applicationspecific measures.
  37. 37. Euclidian Distance for Tables (Overcast,Chilly,Don’t Play) Overcast (Cloudy,Pleasant,Play) Cloudy Don’t Play Play Sunny Warm Pleasant Chilly
  38. 38. Clustering TechniquesGeneral Strategy:1. Draw a graph connecting items which are close to one another with edges.2. Partition the graph into maximally connected subcomponents. 1. Construct an MST for the graph 2. Merge items that are connected by the minimum weight of the MST into a cluster
  39. 39. Clustering TechniquesClustering types:Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower levelPartitional clustering: Clusters are formed at only one level
  40. 40. Clustering TechniquesNearest Neighbour Clustering Algorithm:Given n elements x1, x2, … xn, and threshold t, .1. j  1, k  1, Clusters = {}2. Repeat 1. Find the nearest neighbour of xj 2. Let the nearest neighbour be in cluster m 3. If distance to nearest neighbour > t, then create a new cluster and k  k+1; else assign xj to cluster m 4. j  j+13. until j > n
  41. 41. Clustering TechniquesIterative partitional clustering:Given n elements x1, x2, … xn, and k clusters, each with a center.1. Assign each element to its closest cluster center2. After all assignments have been made, compute the cluster centroids for each of the cluster3. Repeat the above two steps with the new centroids until the algorithm converges
  42. 42. Mining Sequence DataCharacteristics of Sequence Data:• Collection of data elements which are ordered sequences• In a sequence, each item has an index associated with it• A k-sequence is a sequence of length k. Support for sequencej is the number of m-sequences (m>=j) which contain j as asequence• Sequence data: transaction logs, DNA sequences, patientailment history, …
  43. 43. Mining Sequence DataSome Definitions:• A sequence is a list of itemsets of finite length.• Example: • {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil} • … the purchases of a single customer over time…• The order of items within an itemset does not matter; but theorder of itemsets matter• A subsequence is a sequence with some itemsets deleted
  44. 44. Mining Sequence DataSome Definitions:• A sequence S’ = {a1, a2, …, am} is said to be containedwithin another sequence S, if S contains a subsequence {b1, b2,… bm} such that a1 ⊆ b1, a2 ⊆ b2, …, am ⊆ bm.• Hence, {pen}{pencil}{ruler,pencil} is contained in{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
  45. 45. Mining Sequence DataApriori Algorithm for Sequences:1. L1  Set of all interesting 1-sequences2. k  13. while Lk is not empty do 1. Generate all candidate k+1 sequences 2. Lk+1  Set of all interesting k+1-sequences3. done
  46. 46. Mining Sequence DataGenerating Candidate Sequences:Given L1, L2, … Lk, candidate sequences of Lk+1 are generated as follows:For each sequence s in Lk, concatenate s with all new 1- sequences found while generating Lk-1
  47. 47. Mining Sequence DataExample: minsup = 0.5abcde Interesting 1-sequences:bdae aaebd bbe deabda eaaaabaaa Candidate 2-sequencescbdb aa, ab, ad, aeabbab ba, bb, bd, beabde da, db, dd, de ea, eb, ed, ee
  48. 48. Mining Sequence DataExample: minsup = 0.5abcde Interesting 2-sequences:bdae ab, bdaebdbe Candidate 2-sequenceseabda aba, abb, abd, abe,aaaa aab, bab, dab, eab,baaa bda, bdb, bdd, bde,cbdb bbd, dbd, ebd.abbababde Interesting 3-sequences = {}
  49. 49. Mining Sequence DataLanguage Inference:Given a set of sequences, consider each sequence as thebehavioural trace of a machine, and infer the machine that candisplay the given sequence as behavior. aabb ababcac abbac … Input set of sequences Output state machine
  50. 50. Mining Sequence Data• Inferring the syntax of a language given its sentences• Applications: discerning behavioural patterns, emergent properties discovery, collaboration modeling, …• State machine discovery is the reverse of state machine construction• Discovery is “maximalist” in nature…
  51. 51. Mining Sequence Data“Maximal” nature of language inference: a,b,c abc aabc “Most general” state machine aabbc c abbc b c b a a c c b b “Most specific” state machine
  52. 52. Mining Sequence Data“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)Given a set of n sequences:1. Create a state machine for the first sequence2. for j  2 to n do 1. Create a state machine for the jth sequence 2. Merge this sequence into the earlier sequence as follows: 1. Merge all halt states in the new state machine to the halt state in the existing state machine 2. If two or more paths to the halt state share the same suffix, merge the suffixes together into a single path3. Done
  53. 53. Mining Sequence Data“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000) a a b c baabcbaac a a b c b caabc a a b c b c b a a c b
  54. 54. Mining Streaming DataCharacteristics of streaming data:• Large data sequence• No storage• Often an infinite sequence• Examples: Stock market quotes, streaming audio/video,network traffic
  55. 55. Mining Streaming DataRunning mean:Let n = number of items read so far, avg = running average calculated so far,On reading the next number num: avg  (n*avg+num) / (n+1) n  n+1
  56. 56. Mining Streaming DataRunning variance: var = ∑(num-avg)2 = ∑num2 - 2*∑num*avg + ∑avg2Let A = ∑num2 of all numbers read so far B = 2*∑num*avg of all numbers read so far C = ∑avg2 of all numbers read so far avg = average of numbers read so far n = number of numbers read so far
  57. 57. Mining Streaming DataRunning variance:On reading next number num:avg  (avg*n + num) / (n+1) n  n+1A  A + num2B  B + 2*avg*numC  C + avg2var = A + B + C
  58. 58. Mining Streaming Dataγ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)Let streaming data be in the form of “frames” where eachframe comprises of one or more data elements.Support for data element k within a frame is defined as(#occurrences of k)/(#elements in frame)γ-Consistency for data element k is the “sustained” supportfor k over all frames read so far, with a “leakage” of (1- γ)
  59. 59. Mining Streaming Dataγ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) γ*sup(k) (1-γ) levelt(k) = (1-γ)*levelt-1(k) + γ*sup(k)
  60. 60. Data Warehousing• A platform for online analytical processing (OLAP)• Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis• Also called “data marts”• A critical component of the decision support system (DSS) of enterprises• Some typical DW queries: – Which item sells best in each region that has retail outlets – Which advertising strategy is best for South India? – Which (age_group/occupation) in South India likes fast food, and which (age_group/occupation) likes to cook?
  61. 61. Data Warehousing OLTP Orde r Proc e s s in g Data Cleaning Inventory les Data Sa Warehouse (OLAP)
  62. 62. OLTP vs OLAP Transactional Data (OLTP) Analysis Data (OLAP)Small or medium size databases Very large databasesTransient data Archival dataFrequent insertions and Infrequent updatesupdatesSmall query shadow Very large query shadowNormalization important to De-normalization important tohandle updates handle queries
  63. 63. Data Cleaning• Performs logical transformation of transactional data to suit the data warehouse• Model of operations  model of enterprise• Usually a semi-automatic process
  64. 64. Data Cleaning Data WarehouseOrdersOrder_id Customers Price ProductsCust_id Orders Inventory Price Inventory Time Prod_id Price Sales Price_chng Cust_id Cust_prof Tot_sales
  65. 65. Multi-dimensional Data ModelPrice Products rs de Or Customers Jan’01 Jun’01 Jan’02 Jun’02 Time
  66. 66. Some MDBMS Operations• Roll-up – Add dimensions• Drill-down – Collapse dimensions• Vector-distance operations (ex: clustering)• Vector space browsing
  67. 67. Star Schema Dim Dim Tbl_1 Tbl_1 Dim Fact table Dim Tbl_1 Tbl_1
  68. 68. WWW Based References• http://www.kdnuggets.com/• http://www.megaputer.com/• http://www.almaden.ibm.com/cs/quest/index.html• http://fas.sfu.ca/cs /research/groups/DB/sections/publication/kdd/kdd.html• http://www.cs.su.oz.au/~thierry/ckdd.html• http://www.dwinfocenter.org/• http://datawarehouse.itoolbox.com/• http://www.knowledgestorm.com/• http://www.bitpipe.com/• http://www.dw-institute.com/• http://www.datawarehousing.com/
  69. 69. References• Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules, Proc. of the 20th Intl Conference on Very Large Databases, Santiago, Chile, Sept. 1994.• R. Agrawal, R. Srikant, ``Mining Sequential Patterns, Proc. of the Intl Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995.• R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Intl Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996.• Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997.• Jennifer Widom. Research Problems in Data Warehousing. Proc. of Int’l Conf. On Information and Knowledge Management, 1995.
  70. 70. References• A. Shoshani. OLAP and Statistical Databases: Similarities and Differences. Proc. of ACM PODS 1997.• Panos Vassiliadis, Timos Sellis. A Survey on Logical Models for OLAP Databases. ACM SIGMOD Record• M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi- Dimensional Databases. Proc of VLDB 1997, Athens, Greece.• Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions Based on Consistent Patterns. Proc. of CoopIS 1999, Edinburg, UK.• Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000, Como, Italy.

×