PowerPoint Presentation


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PowerPoint Presentation

  1. 1. Chapter 27 Data Mining Concepts Dr. Muhammad Shafique
  2. 2. Outline <ul><li>Data mining: an introduction </li></ul><ul><li>Data mining and knowledge discovery </li></ul><ul><li>Goals of data mining </li></ul><ul><li>Knowledge discovery during data mining </li></ul><ul><li>Applications of data mining </li></ul><ul><li>Commercial data mining tools </li></ul><ul><li>References </li></ul><ul><li>Summary </li></ul>
  3. 3. Data Mining: An Introduction <ul><li>Data mining refers to the discovery of new information in terms of patterns or rules from vast amounts of data </li></ul><ul><li>Data warehousing and Data mining </li></ul><ul><ul><li>The goal of data warehouse is to support decision making process </li></ul></ul><ul><ul><li>Data mining can be used in conjunction with a data warehouse to help with certain decisions </li></ul></ul><ul><ul><li>Data mining can be applied to operational databases but to make it more efficient and meaningful it is applied to data warehouses </li></ul></ul><ul><li>Data mining applications should be considered early during the design of a data warehouse </li></ul>
  4. 4. Building a Data Warehouse
  5. 5. Data Warehouse Architecture
  6. 6. Data Mining and Knowledge Discovery <ul><li>Knowledge discovery in databases (KDD) --- more general than data mining </li></ul><ul><li>KDD process consists of six phases 1. Data selection 2. Data cleaning 3. Enrichment 4. Data transformation 5. Data mining 6. Display and reporting </li></ul><ul><li>Example Specialty consumer goods retailer </li></ul><ul><ul><li>Association rule: whenever a customer buys product X he also buys product Y </li></ul></ul><ul><ul><li>Sequential pattern: whenever a customer buys a camera then within six months he buys photographic supplies </li></ul></ul><ul><ul><li>Classification trees: credit-card customers, cash customers, etc. </li></ul></ul>
  7. 7. Goals of Data Mining <ul><li>Prediction --- data mining can show how certain attributes within the data will behave in the future </li></ul><ul><li>Identification --- data patterns can be used to identify the existence of an item, event, or an activity </li></ul><ul><li>Classification --- data mining can partition the data so that different classes or categories can be identified based on combinations of parameters </li></ul><ul><li>Optimization --- one eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials </li></ul>
  8. 8. Knowledge Discovery During Data Mining <ul><li>Raw data  Information  knowledge </li></ul><ul><li>Deductive knowledge </li></ul><ul><ul><li>Deduce new information based on applying pre-specified logical rules of deduction on the given data </li></ul></ul><ul><li>Inductive knowledge </li></ul><ul><ul><li>Discover new rules and patterns from the available data </li></ul></ul><ul><li>Data mining addresses inductive knowledge </li></ul><ul><ul><li>Discovered knowledge can be </li></ul></ul><ul><ul><ul><li>Unstructured like rules or propositional logic </li></ul></ul></ul><ul><ul><ul><li>Structured like decision trees, semantic network, neural networks, etc </li></ul></ul></ul>
  9. 9. Types of Knowledge Discovered During Data Mining <ul><li>The knowledge discovered during data mining can be described as </li></ul><ul><ul><li>Association rules --- correlate the presence of a set of items with another range of values for another set of variables </li></ul></ul><ul><ul><li>Classification hierarchies --- create hierarchies of classes </li></ul></ul><ul><ul><li>Sequential patterns --- sequence of actions or events </li></ul></ul><ul><ul><li>Pattern with time series --- similarities detected within positions of the time series </li></ul></ul><ul><ul><li>Categorization and segmentation --- partition a given population of events or items into sets of “similar” elements. </li></ul></ul>
  10. 10. Association Rules <ul><li>An association rule is of the form X  Y where X = {x 1 , x 2 , …., x n } and Y = {y 1 , y 2 , …, y m } are sets of distinct items The rule states that if a customer buys X, he is also likely to buy Y </li></ul><ul><li>The set LHS  RHS is called an itemset </li></ul><ul><li>Interest measures </li></ul><ul><ul><li>Support ( prevalence ) for the rule LHS  RHS is the percentage of transactions that hold all the items in the itemset. </li></ul></ul><ul><ul><li>Confidence ( strength ) for the rule LHS  RHS is the percentage (fraction) of all transactions that include items in LHS and out of these the ones that include items of RHS. </li></ul></ul><ul><ul><ul><li>Confidence is computed as support (LHS  RHS) / support (LHS) </li></ul></ul></ul>
  11. 11. Example <ul><li>Tid time Items </li></ul><ul><li>101 6:35 milk, bread cookies, juice </li></ul><ul><li>203 7:38 milk, juice </li></ul><ul><li>305 8:05 milk, eggs </li></ul><ul><li>792 8:40 bread, cookies, coffee </li></ul><ul><li>Consider two rules milk  juice and bread  juice </li></ul><ul><ul><li>Support {milk, juice} is 50% </li></ul></ul><ul><ul><li>Support {bread, juice} is 25% </li></ul></ul><ul><ul><li>Confidence of milk  juice is 66.7% </li></ul></ul><ul><ul><li>Confidence of Bread  juice is 50% </li></ul></ul>
  12. 12. Association Rules (cont.) <ul><li>The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence thresholds. </li></ul><ul><li>The problem of mining association rules is thus decomposed into two sub-problems: </li></ul><ul><ul><li>Generate all itemsets that have a support that exceeds the threshold. These sets of items are called large itemsets. </li></ul></ul><ul><ul><li>For each large itemset, all the rules that have a minimum confidence are generated as follows: For a large itemset X and Y  X, let Z = X - Y; then if support (X)/support (Z)  minimum confidence, the rule Z  Y (i.e., X - Y  Y) is a valid rule. </li></ul></ul>
  13. 13. Association Rules (cont.) <ul><li>Basic Algorithms for Finding Association Rules </li></ul><ul><li>The current algorithms that find large itemsets are designed to work as follows: </li></ul><ul><ul><li>Test the support for itemsets of length 1, called 1-itemsets, by scanning the database. Discard those that do not meet minimum required support. </li></ul></ul><ul><ul><li>Extend the large 1-itemsets into 2-itemsets by appending one item each time, to generate all candidate itemsets of length two. Test the support for all candidate itemsets by scanning the database and eliminate those 2-itemsets that do not meet the minimum support. </li></ul></ul><ul><ul><li>Repeat the above steps; at step k, the previously found ( k - 1) itemsets are extended into k -itemsets and tested for minimum support. </li></ul></ul><ul><ul><li>The process is repeated until no large itemsets can be found. </li></ul></ul>
  14. 14. Association Rules (cont.) <ul><li>Apriori algorithm </li></ul><ul><li>Sampling algorithm </li></ul><ul><li>Frequent-pattern tree algorithm </li></ul><ul><li>Partition algorithm </li></ul>
  15. 15. Approaches to Other Data Mining Problems <ul><li>Discovery of sequential patterns </li></ul><ul><li>Discovery of Patterns in Time Series </li></ul><ul><li>Discovery of Classification Rules </li></ul><ul><li>Regression </li></ul><ul><li>Neural Networks </li></ul><ul><li>Genetic Algorithms </li></ul>
  16. 16. Applications of Data Mining <ul><li>Data mining can be applied to a large variety of decision-making contexts in business like </li></ul><ul><ul><li>Marketing </li></ul></ul><ul><ul><li>Finance </li></ul></ul><ul><ul><li>Manufacturing </li></ul></ul><ul><ul><li>Health care </li></ul></ul>
  17. 17. Commercial Data Mining Tools <ul><li>Some representative data mining tools </li></ul><ul><li>Company Product Technique Platform Interface 1. Acknosoft Kate Decision trees Win NT UNIX Microsoft </li></ul><ul><li>Case-based Access </li></ul><ul><li>reasoning </li></ul><ul><li>2.Angoss Knowledge Decision trees, Win NT ODBC </li></ul><ul><li> Seeker Statistics </li></ul><ul><li>3. Business Business Neural nets, Win NT ODBC </li></ul><ul><li> Objects Miner Machine learning </li></ul><ul><li>4. Data Data Comprehensive, UNIX ODBC </li></ul><ul><li> Distilleries Surveyor Can mix DM ODMG-compliant </li></ul><ul><li>5. IBM Intelligent Classification, UNIX (AIX) IBM DB2 </li></ul><ul><li> Miner Association rules, </li></ul><ul><li> Predictive models </li></ul>
  18. 18. Reference <ul><li>Textbook Chapter 27 </li></ul><ul><ul><li>27.1, 27.2, 27.5, 27.6, 27.7 </li></ul></ul><ul><li>“ Enhancing Mining of Association Rules from Data Cubes” Riadh B Messaoud, et al, proceedings of the 9 th international workshop on data warehousing and OLAP, Arlington, Virginia, USA, 2006, pp 11-18 </li></ul>
  19. 19. Summary <ul><li>Thank you </li></ul>