Data Mining Presentation


Published on

1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining Presentation

  1. 1. Data Mining ”Part of Knowledge Discovery” <ul><li>Mining and discovery of new information in terms of patterns or rules from vast amounts of data, Elmasri ch. 27 </li></ul>
  2. 2. Why data mining? <ul><li>Many organizations have generated large amounts of machine-readeable data </li></ul><ul><li>SQL assumes the user is aware of the database schema. </li></ul><ul><li>Helps in extracting meaningful new patterns that cannot be found by merely querying or processing data </li></ul>Combine data Find interesting patterns Business decision
  3. 3. Knowledge Discovery in Databases (KDD) <ul><li>Data selection </li></ul><ul><li>Data cleansing </li></ul><ul><li>Enrichment </li></ul><ul><li>Data transformation (encoding) </li></ul><ul><li>Data mining </li></ul><ul><li>Reporting and display </li></ul>
  4. 4. Goals of Data Mining and KDD <ul><li>Prediction e.g. sales volume, earthquakes </li></ul><ul><li>Identification e.g. existence of genes, system intrusions </li></ul><ul><li>Classification of different categories e.g. discount-seeking shoppers or loyal regular shoppers in a supermarket </li></ul><ul><li>Optimization of limited resources such as time, space, money or materials and maximization of outputs such as sales or profits </li></ul>
  5. 5. Predict using data mining
  6. 6. Association Rule The Database is regarded as a collection of transactions, each involving a set of items. Elmasri <ul><li>Example: Market Basket Analysis </li></ul><ul><li>Items frequently purchased together: </li></ul><ul><ul><li>Ex. milk => juice </li></ul></ul><ul><ul><li>Ex. bread => juice </li></ul></ul><ul><li>Results used for: </li></ul><ul><ul><li>Product placement </li></ul></ul><ul><ul><li>Advertising </li></ul></ul><ul><ul><li>Coupons </li></ul></ul><ul><ul><li>Sales </li></ul></ul>bread, cookies, coffee t4 milk, eggs t3 milk, juice t2 milk, bread, cookies, juice t1 Itemset Transaction
  7. 7. Association Rule Definition <ul><li>Association Rule: X => Y, where X = { x 1 , x 2 , … , x n } , </li></ul><ul><li> Y = { y 1 , y 2 , … , y m } </li></ul><ul><ul><li>X and Y are sets of items with x i and y j being distinct items for i and j </li></ul></ul><ul><li>If a customer buys X she is also likely to buy Y </li></ul><ul><li>Association Rule: In general LHS => RHS </li></ul><ul><li>Itemset: LHS  RHS, the set of items purchased by customers </li></ul>
  8. 8. Support & Confidence <ul><li>Support of an itemset: Percentage of transactions which contain LHS  RHS </li></ul><ul><li>*How frequently a specific itemset occurs </li></ul><ul><li>*Prevalence of the rule </li></ul><ul><li>Confidence of an itemset: Ratio of number of transactions that contain LHS  RHS to the number that contain LHS, support(LHS U RHS)/support(LHS). </li></ul><ul><li>*Strength of the rule </li></ul>
  9. 9. Example of Support Support of {milk, juice} is 50%, 2 of 4 transactions. Support of {bread, juice} is 25%, 1 of 4 transactions. Percentage of transactions which contain LHS  RHS bread, cookies, coffee t4 milk, eggs t3 milk, juice t2 milk, bread, cookies, juice t1 Itemset Transaction
  10. 10. Example of Confidence Confidence of milk => juice is 50/75 = 66,7%, 2 of 3 transactions. Confidence of bread => juice is 25/50 = 50%, 1 of 2 transactions. support(LHS  RHS)/support(LHS) bread, cookies, coffee t4 milk, eggs t3 milk, juice t2 milk, bread, cookies, juice t1 Itemset Transaction
  11. 11. Problems <ul><li>The association rule problem is to identify all association rules X => Y with a minimum support and confidence. </li></ul><ul><li>The number of distinct itemsets is 2 m , where m is the number of items. </li></ul><ul><li>Counting support for all possible itemsets becomes very computation intensive! </li></ul>2 1000 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
  12. 12. Techniques <ul><li>The following properties of sets are used to create efficient algorithms: </li></ul><ul><ul><li>Downward Closure: a subset of a large itemset must also be large </li></ul></ul><ul><ul><li>Antimonotonicity: a superset of a small itemset is also small </li></ul></ul>
  13. 13. Sets and Combinatorics <ul><li>Combinatorics is a branch of mathematics that studies finite collections of objects that satisfy specified criteria, and is in particular concerned with &quot;counting&quot; the objects in those collections and with deciding whether certain &quot;optimal&quot; objects exist </li></ul>
  14. 14. Apriori Algorithm <ul><li>Algorithm for finding frequent (large) itemsets: </li></ul><ul><li>k = 1; </li></ul><ul><li>Find frequent itemset, L k from C k , the set of all candidate itemsets; </li></ul><ul><li>Form C k+1 from L k ; </li></ul><ul><li>k = k+1; </li></ul><ul><li>Repeat 2-4 until C k is empty </li></ul>
  15. 15. Ex. <ul><li>Candidate 1-itemsets (C 1 ): </li></ul><ul><li>{milk} {bread} {juice} {cookies} {eggs} {coffee} </li></ul><ul><li>0.75 0.5 0.5 0.5 0.25 0.25 </li></ul><ul><li>C 2 itemsets: </li></ul><ul><li>{milk,bread}, {milk,juice}, {bread,juice}, {milk,cookies}, {bread,cookies}, {juice,cookie} </li></ul><ul><li>0.25, 0.5 , 0.25, 0.25, 0.5 , 0.25 </li></ul><ul><li>C 3 itemsets: </li></ul><ul><li>None! Consider {milk, juice, bread} ¬∈ C 2 </li></ul>bread, cookies, coffee t4 milk, eggs t3 milk, juice t2 milk, bread, cookies, juice t1 Itemset Transaction
  16. 16. Improved Algorithms <ul><li>Methods: </li></ul><ul><ul><li>Multiple scans to cope with large itemsets </li></ul></ul><ul><ul><li>Bitmaps </li></ul></ul><ul><ul><li>Hash trees </li></ul></ul><ul><li>Algorithms: </li></ul><ul><ul><li>Sampling Algorithm </li></ul></ul><ul><ul><li>Frequent-Pattern Tree Algorithm </li></ul></ul><ul><ul><li>Partition Algorithm </li></ul></ul>
  17. 17. Sampling Algorithm <ul><li>Pick a sample that fits in RAM </li></ul><ul><li>Compute it’s support </li></ul><ul><li>Extend from sample to rest of the database </li></ul>
  18. 18. Frequent-Pattern Tree <ul><li>Generate the frequent 1-itemset </li></ul><ul><li>On a transaction-by-transaction basis, select the items that exists in the 1-itemset </li></ul><ul><li>Using the rules of the algorithm, a tree-structure is grown as the transactions are processed </li></ul><ul><li>Once the tree is constructed, pattern-mining it is trivial </li></ul>
  19. 19. Partition Algorithm <ul><li>Divide database into sections </li></ul><ul><li>Find frequent itemsets for each partition </li></ul><ul><li>Take the union of those itemsets </li></ul><ul><li>Discard false positives </li></ul>
  20. 20. Hierarchical Association Rules <ul><li>Divide each itemset into categories and subcategories (e.g. beverages -> carbonated and noncarbonated) </li></ul><ul><li>Find associations between hierarchies </li></ul><ul><li>Ignore associations in a hierarchy </li></ul><ul><li>Multidimensional (usually time-based sequences) </li></ul>
  21. 21. Discovery of Sequential Patterns <ul><li>The discovery of sequential patterns is based on the concept of a sequence of itemsets. </li></ul><ul><li>e.g. {milk, bread, juice}, {bread, eggs}, {cookies, milk, coffee} could be a sequence based on three visits by the same customer. </li></ul><ul><li>The sequence S1, S2, S3... is a predictor of the fact that a customer buying S1 is likely to buy S2 and then S3 and so on. </li></ul>
  22. 22. Applications of Data Mining <ul><li>Marketing </li></ul><ul><ul><li>Analysis of consumer behavior </li></ul></ul><ul><ul><li>Advertising campaigns </li></ul></ul><ul><ul><li>Targeted mailings </li></ul></ul><ul><ul><li>Segmentation of customers, stores, or products </li></ul></ul><ul><li>Finance </li></ul><ul><ul><li>Creditworthiness of clients </li></ul></ul><ul><ul><li>Performance analysis of finance investments </li></ul></ul><ul><ul><li>Fraud detection </li></ul></ul><ul><li>Manufacturing </li></ul><ul><ul><li>Optimization of resources </li></ul></ul><ul><ul><li>Optimization of manufacturing processes </li></ul></ul><ul><ul><li>Product design based on customer requirements </li></ul></ul><ul><li>Health Care </li></ul><ul><ul><li>Discovering patterns in X-ray images </li></ul></ul><ul><ul><li>Analyzing side effects of drugs </li></ul></ul><ul><ul><li>Effectiveness of treatments </li></ul></ul>
  23. 23. Future of Data Mining <ul><li>Active research is ongoing </li></ul><ul><ul><li>Neural Networks </li></ul></ul><ul><ul><li>Regression Analysis </li></ul></ul><ul><ul><li>Genetic Algorithms </li></ul></ul><ul><li>Data mining is used in many areas today. We cannot even begin to imagine what the future holds in its womb! </li></ul>