CS6905 - Data Mining using OLAP

  • 813 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
813
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. CS6905 Data Mining (by Daniel Lemire) CS6905 - Data Mining using OLAP OLAP primarily supports user-driven queries. However, many data warehouses are used for data mining. What the relationship between the two?
  • 2. CS6905 Data Mining (by Daniel Lemire) Buzz words £ Differences between OLAP and Data Mining £ OLAP as a Deductive Process £ Association Rules £ Attribute-Value Focusing £ Iceberg Queries
  • 3. CS6905 Data Mining (by Daniel Lemire) References for this lecture The course textbook OLAP Mining: An integration of OLAP with Data Mining, Jiawei Han Discovery of Multiple-Level Association Rules from Large Databases, Jiawei Han and Yongjian Fu Computing Iceberg Queries Efficiently, Min Fang et al. Many more!
  • 4. CS6905 Data Mining (by Daniel Lemire) What is data mining? Fashionable industry term (danger, danger) Han defines data mining as... discover some non trivial and interesting knowledge or patterns My definition: precise answers to unprecise queries
  • 5. CS6905 Data Mining (by Daniel Lemire) Is OLAP a form of data mining? NO. OLAP is meant for end-user Data Mining is for experts OLAP provides views Data Mining provides rules, relations
  • 6. CS6905 Data Mining (by Daniel Lemire) What does our textbook say? OLAP provides descriptive modelling Data Mining provides explanatory modelling
  • 7. CS6905 Data Mining (by Daniel Lemire) Briefly recall deduction : from general to specific applying rules to instances induction : for specific to general finding some rules out of many facts
  • 8. CS6905 Data Mining (by Daniel Lemire) What course summary says? OLAP supports deductive analysis user provides a rule, and it is tested and made precise Data Mining supports inductive analysis feed in the data source, find a rule
  • 9. CS6905 Data Mining (by Daniel Lemire) What does Data Mining do? Characterization: do not breathe ⇔ dead Comparison: dogs are bigger than cats Classification: caucasian, asian, african Association: sunburn when I go outside Prediction: you are likely to like beer and beautiful women
  • 10. CS6905 Data Mining (by Daniel Lemire) An Itemset An itemset is simply a non-empty set of attribute values. Typically, itemsets are large. k-itemsets are itemsets containing at least k elements
  • 11. CS6905 Data Mining (by Daniel Lemire) Association Rules Formally defined as A1 ∧ . . . ∧ Ai → ∧B1 ∧ . . . ∧ B j Support of A1 ∧ . . . ∧ Ai ? P(A1 ∧ . . . ∧ Ai) Confidence of rule B1 ∧ . . . ∧ B j → A1 ∧ . . . ∧ Ai: P(B1 ∧ . . . ∧ B j ∧ A1 ∧ . . . ∧ Ai) P(B1 ∧. . .∧B j |A1 ∧. . .∧Ai) = . P(A1 ∧ . . . ∧ Ai) (Some authors define support differently)
  • 12. CS6905 Data Mining (by Daniel Lemire) Association Rules: Example Monster Species Color vegetarian? Ziziz Red yes YiYoz Blue yes Filoufoul Red no Coucou Red yes Passpass Blue yes
  • 13. CS6905 Data Mining (by Daniel Lemire) Association Rules: Example £ BLUE monsters are vegetarian support = 40%, confi- dence = 100% £ RED monsters are vegetarian support = 40%, confi- dence = 66.7%
  • 14. CS6905 Data Mining (by Daniel Lemire) Strong Association Rules £ User sets support and confidence thresholds e.g. at least 100 relations, 80% confidence £ Rules above support threshold have Large support. £ Rules above confidence threshold have High confidence. £ Rules satisfying both are said to be Strong.
  • 15. CS6905 Data Mining (by Daniel Lemire) Classical Association Rules Mining £ The classical reference is Fast Algorithms for Mining As- sociation Rules by Agrawal and Srikant. £ They presented the Apriori algorithm: the reference algo- rithm. £ Not very OLAPish though.
  • 16. CS6905 Data Mining (by Daniel Lemire) Back to OLAP! £ To understand, we turn the monster database into a cube GROUP BY Species red blue £ vegetarian 2 2 eats humans 1 0
  • 17. CS6905 Data Mining (by Daniel Lemire) Are cubes always related to support? £ Look at these sales... Where’s the support? Week Days Week-Ends £ hot-dogs 432.32$ 132$ fries 332.35$ 745.12$ £ GROUP BY cube give support for free, not other cubes!
  • 18. CS6905 Data Mining (by Daniel Lemire) Ok, so that’s statistics, right? £ No. £ Statistics samples a problem, uses a model to predict £ OLAP does brute force computation £ Recall that OLAP wasn’t thinkable when statistics became popular
  • 19. CS6905 Data Mining (by Daniel Lemire) Ok, so it’s simple, right? £ No. £ Efficient methods to automatically search for strong rules exist∗ £ They often fail to be useful £ Machines don’t do pattern recognition well! (*) Agrawal et al., Mining association rules between sets of items in large database.
  • 20. CS6905 Data Mining (by Daniel Lemire) General Association Rule Mining £ First Focus on Getting (at least) Minimum Support £ Out of Instances left go for (at least) Minimum Confi- dence out of what is left
  • 21. CS6905 Data Mining (by Daniel Lemire) Sample data ok cheese yellow cheese skim milk 1% milk 2% milk fat milk whole wheat 12 0 43 13 22 0 white bread 8 16 432 3304 4343 444 brown bread 0 32 2 99 441 4324
  • 22. CS6905 Data Mining (by Daniel Lemire) Simplistic Han-Fu algorithm £ Support threshold at 30%, user wants relation against bread £ First test relation cheese vs bread, small support, skip £ Test relation milk vs bread, possible!
  • 23. CS6905 Data Mining (by Daniel Lemire) Finer scale data skim milk 1% milk 2% milk plenty of fat ··· whole wheat 43 13 22 0 ··· white bread 432 3304 4343 444 ··· brown bread 2 99 441 4324 ··· . . . . . . . . . . ...
  • 24. CS6905 Data Mining (by Daniel Lemire) Attribute-Value Focusing £ Support threshold: P(A1 ∧ . . . Ai) > 0.3 £ Skim milk vs bread, small support, skip £ Test relation 1% milk vs bread, small support (25%) £ Test relation 2% milk vs bread, possible! £ Test relation fat milk vs bread, possible!
  • 25. CS6905 Data Mining (by Daniel Lemire) Finer scale Han-Fu £ Support threshold at 30% (relative) £ Skim milk vs bread, small support, skip £ Test relation 1% milk vs bread, small support (25%) £ Test relation 2% milk vs bread, possible! £ Test relation fat milk vs bread, possible!
  • 26. CS6905 Data Mining (by Daniel Lemire) Conclusion of simplified Han-Fu £ Would spot brown bread ↔ plenty of fat £ Would spot 2% whole wheat
  • 27. CS6905 Data Mining (by Daniel Lemire) whole wheat ↔ skim milk £ Whole Wheat has low support £ Automated Rule Mining likely to fail £ Lowering support will not help! £ Confidence/support not enough to define interestingness £ Need Human Pattern Recognition!
  • 28. CS6905 Data Mining (by Daniel Lemire) Item Tuples £ To find association rules, almost enough to find Item Pairs∗ £ Item Pairs: (white bread, 1% milk) : 3304 (white bread, 2% milk) : 4343 (brown bread, fatty milk): 4324 (*) Park et al., An effective hash based algorithm for mining association rules.
  • 29. CS6905 Data Mining (by Daniel Lemire) Iceberg Cubes £ Data Cube Ci1,...,ik (positive values, GROUP-BY) £ Choose threshold ε > 0.  i1,...,ik if Ci1,...,ik > ε C £ Iceberg Cube ICi1,...,ik =  0 otherwise £ Effectively ignores region of small support. Min Fang et al. Computing Iceberg Queries Efficiently. Beyer and Ramakrishnan, Bottom-Up Computation of Sparse and Iceberg CUBEs.
  • 30. CS6905 Data Mining (by Daniel Lemire) Iceberg Cube (ε = 4000) skim milk 1% milk 2% milk plenty of fat ··· whole wheat 0 0 0 0 ··· white bread 0 0 4343 444 ··· brown bread 0 0 0 4324 ··· . . . . . . . . . . ...
  • 31. CS6905 Data Mining (by Daniel Lemire) General Iceberg Cubes £ Our definition only works for GROUP BY cubes. £ More general case is SELECT A,B,C,COUNT(*),SUM(X) FROM R CUBE BY A,B,C HAVING COUNT(*) >= N £ as opposed to simple iceberg queries SELECT A,B,C,COUNT(*) FROM R GROUP BY A,B,C HAVING COUNT(*) >= N
  • 32. CS6905 Data Mining (by Daniel Lemire) Benefits? £ Faster analysis of association rules∗ : only need to focus on minimal confidence £ Really Sparse Cubes (storage) (*) Kamber et al. Metarule-guided mining of multi-dimensional association rules using data cubes.