Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 1
  • 4
  • 5
  • 6
  • 7
  • 9
  • 11
  • recevait

    1. 1. Data Mining in Large Databases (Contributing Slides by Gregory Piatetsky-Shapiro and Rajeev Rastogi and Kyuseok Shim Lucent Bell laboratories)
    2. 2. Overview <ul><li>Introduction </li></ul><ul><li>Association Rules </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul>
    3. 3. Background <ul><li>Corporations have huge databases containing a wealth of information </li></ul><ul><li>Business databases potentially constitute a goldmine of valuable business information </li></ul><ul><li>Very little functionality in database systems to support data mining applications </li></ul><ul><li>Data mining: The efficient discovery of previously unknown patterns in large databases </li></ul>
    4. 4. Applications <ul><li>Fraud Detection </li></ul><ul><li>Loan and Credit Approval </li></ul><ul><li>Market Basket Analysis </li></ul><ul><li>Customer Segmentation </li></ul><ul><li>Financial Applications </li></ul><ul><li>E-Commerce </li></ul><ul><li>Decision Support </li></ul><ul><li>Web Search </li></ul>
    5. 5. Data Mining Techniques <ul><li>Association Rules </li></ul><ul><li>Sequential Patterns </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Similar Time Sequences </li></ul><ul><li>Similar Images </li></ul><ul><li>Outlier Discovery </li></ul><ul><li>Text/Web Mining </li></ul>
    6. 6. Examples of Patterns <ul><li>Association rules </li></ul><ul><ul><li>98% of people who purchase diapers buy beer </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>People with age less than 25 and salary > 40k drive sports cars </li></ul></ul><ul><li>Similar time sequences </li></ul><ul><ul><li>Stocks of companies A and B perform similarly </li></ul></ul><ul><li>Outlier Detection </li></ul><ul><ul><li>Residential customers with businesses at home </li></ul></ul>
    7. 7. Association Rules <ul><li>Given: </li></ul><ul><ul><li>A database of customer transactions </li></ul></ul><ul><ul><li>Each transaction is a set of items </li></ul></ul><ul><li>Find all rules X => Y that correlate the presence of one set of items X with another set of items Y </li></ul><ul><ul><li>Any number of items in the consequent or antecedent of a rule </li></ul></ul><ul><ul><li>Possible to specify constraints on rules (e.g., find only rules involving expensive imported products) </li></ul></ul>
    8. 8. Association Rules <ul><li>Sample Applications </li></ul><ul><ul><li>Market basket analysis </li></ul></ul><ul><ul><li>Attached mailing in direct marketing </li></ul></ul><ul><ul><li>Fraud detection for medical insurance </li></ul></ul><ul><ul><li>Department store floor/shelf planning </li></ul></ul>
    9. 9. Confidence and Support <ul><li>A rule must have some minimum user-specified confidence </li></ul><ul><ul><li>1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. </li></ul></ul><ul><li>A rule must have some minimum user-specified support </li></ul><ul><ul><li>1 & 2 => 3 should hold in some minimum percentage of transactions to have business value </li></ul></ul>
    10. 10. Example <ul><li>Example: </li></ul><ul><li>For minimum support = 50%, minimum confidence = 50%, we have the following rules </li></ul><ul><ul><li>1 => 3 with 50% support and 66% confidence </li></ul></ul><ul><ul><li>3 => 1 with 50% support and 100% confidence </li></ul></ul>
    11. 11. Problem Decomposition <ul><li>1. Find all sets of items that have minimum support </li></ul><ul><ul><li>Use Apriori Algorithm </li></ul></ul><ul><li>2. Use the frequent itemsets to generate the desired rules </li></ul><ul><ul><li>Generation is straight forward </li></ul></ul>
    12. 12. Problem Decomposition - Example For minimum support = 50% and minimum confidence = 50% <ul><li>For the rule 1 => 3: </li></ul><ul><li>Support = Support({1, 3}) = 50% </li></ul><ul><li>Confidence = Support({1,3})/Support({1}) = 66% </li></ul>
    13. 13. The Apriori Algorithm <ul><li>Fk : Set of frequent itemsets of size k </li></ul><ul><li>Ck : Set of candidate itemsets of size k </li></ul><ul><li>F1 = {large items} </li></ul><ul><li>for ( k=1; Fk != 0; k++) do { </li></ul><ul><li>Ck+1 = New candidates generated from Fk </li></ul><ul><li>foreach transaction t in the database do </li></ul><ul><li>Increment the count of all candidates in Ck+1 that </li></ul><ul><li>are contained in t </li></ul><ul><li>Fk+1 = Candidates in Ck+1 with minimum support </li></ul><ul><li>} </li></ul><ul><li>Answer = Uk Fk </li></ul>
    14. 14. Key Observation <ul><li>Every subset of a frequent itemset is also frequent => a candidate itemset in C k+1 can be pruned if even one of its subsets is not contained in F k </li></ul>
    15. 15. Apriori - Example Database D C 1 F 1 C 2 C 2 F 2 Scan D Scan D
    16. 16. Sequential Patterns <ul><li>Given: </li></ul><ul><ul><li>A sequence of customer transactions </li></ul></ul><ul><ul><li>Each transaction is a set of items </li></ul></ul><ul><li>Find all maximal sequential patterns supported by more than a user-specified percentage of customers </li></ul><ul><li>Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction </li></ul>
    17. 17. Classification <ul><li>Given: </li></ul><ul><ul><li>Database of tuples, each assigned a class label </li></ul></ul><ul><li>Develop a model/profile for each class </li></ul><ul><ul><li>Example profile (good credit): </li></ul></ul><ul><ul><li>(25 <= age <= 40 and income > 40k) or (married = YES) </li></ul></ul><ul><li>Sample applications: </li></ul><ul><ul><li>Credit card approval (good, bad) </li></ul></ul><ul><ul><li>Bank locations (good, fair, poor) </li></ul></ul><ul><ul><li>Treatment effectiveness (good, fair, poor) </li></ul></ul>
    18. 18. Decision Tree <ul><li>An internal node is a test on an attribute. </li></ul><ul><li>A branch represents an outcome of the test, e.g., Color=red. </li></ul><ul><li>A leaf node represents a class label or class label distribution. </li></ul><ul><li>At each node, one attribute is chosen to split training examples into distinct classes as much as possible </li></ul><ul><li>A new case is classified by following a matching path to a leaf node. </li></ul>
    19. 19. Decision Trees No true high mild rain Yes false normal hot overcast Yes true high mild overcast Yes true normal mild sunny Yes false normal mild rain Yes false normal cool sunny No false high mild sunny Yes true normal cool overcast No true normal cool rain Yes false normal cool rain Yes false high mild rain Yes false high hot overcast No true high hot sunny No false high hot sunny Play? Windy Humidity Temperature Outlook
    20. 20. Example Tree overcast high normal false true sunny rain No No Yes Yes Yes Outlook Humidity Windy
    21. 21. Decision Tree Algorithms <ul><li>Building phase </li></ul><ul><ul><li>Recursively split nodes using best splitting attribute for node </li></ul></ul><ul><li>Pruning phase </li></ul><ul><ul><li>Smaller imperfect decision tree generally achieves better accuracy </li></ul></ul><ul><ul><li>Prune leaf nodes recursively to prevent over-fitting </li></ul></ul>
    22. 22. Attribute Selection <ul><li>Which is the best attribute? </li></ul><ul><ul><li>The one which will result in the smallest tree </li></ul></ul><ul><ul><li>Heuristic: choose the attribute that produces the “purest” nodes </li></ul></ul><ul><li>Popular impurity criterion : information gain </li></ul><ul><ul><li>Information gain increases with the average purity of the subsets that an attribute produces </li></ul></ul><ul><li>Strategy: choose attribute that results in greatest information gain </li></ul>
    23. 23. Which attribute to select?
    24. 24. Computing information <ul><li>Information is measured in bits </li></ul><ul><ul><li>Given a probability distribution, the info required to predict an event is the distribution’s entropy </li></ul></ul><ul><ul><li>Entropy gives the information required in bits (this can involve fractions of bits!) </li></ul></ul><ul><li>Formula for computing the entropy: </li></ul>
    25. 25. Example: attribute “Outlook” <ul><li>“ Outlook” = “Sunny”: </li></ul><ul><li>“ Outlook” = “Overcast”: </li></ul><ul><li>“ Outlook” = “Rainy”: </li></ul><ul><li>Expected information for attribute: </li></ul>
    26. 26. Computing the information gain <ul><li>Information gain: </li></ul><ul><li>(information before split) – (information after split) </li></ul><ul><li>Information gain for attributes from weather data: </li></ul>
    27. 27. Continuing to split
    28. 28. The final decision tree <ul><li>Note: not all leaves need to be pure; sometimes identical instances have different classes </li></ul><ul><ul><li> Splitting stops when data can’t be split any further </li></ul></ul>
    29. 29. Decision Trees <ul><li>Pros </li></ul><ul><ul><li>Fast execution time </li></ul></ul><ul><ul><li>Generated rules are easy to interpret by humans </li></ul></ul><ul><ul><li>Scale well for large data sets </li></ul></ul><ul><ul><li>Can handle high dimensional data </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>Cannot capture correlations among attributes </li></ul></ul><ul><ul><li>Consider only axis-parallel cuts </li></ul></ul>
    30. 30. Clustering <ul><li>Given: </li></ul><ul><ul><li>Data points and number of desired clusters K </li></ul></ul><ul><li>Group the data points into K clusters </li></ul><ul><ul><li>Data points within clusters are more similar than across clusters </li></ul></ul><ul><li>Sample applications: </li></ul><ul><ul><li>Customer segmentation </li></ul></ul><ul><ul><li>Market basket customer analysis </li></ul></ul><ul><ul><li>Attached mailing in direct marketing </li></ul></ul><ul><ul><li>Clustering companies with similar growth </li></ul></ul>
    31. 31. Traditional Algorithms <ul><li>Partitional algorithms </li></ul><ul><ul><li>Enumerate K partitions optimizing some criterion </li></ul></ul><ul><ul><li>Example: square-error criterion </li></ul></ul><ul><ul><li>m i is the mean of cluster C i </li></ul></ul>
    32. 32. K-means Algorithm <ul><li>Assign initial means </li></ul><ul><li>Assign each point to the cluster for the closest mean </li></ul><ul><li>Compute new mean for each cluster </li></ul><ul><li>Iterate until criterion function converges </li></ul>
    33. 33. K-means example, step 1 Pick 3 initial cluster centers (randomly) k 1 k 2 k 3 X Y
    34. 34. K-means example, step 2 Assign each point to the closest cluster center k 1 k 2 k 3 X Y
    35. 35. K-means example, step 3 Move each cluster center to the mean of each cluster X Y k 1 k 2 k 2 k 1 k 3 k 3
    36. 36. K-means example, step 4 Reassign points closest to a different new cluster center Q: Which points are reassigned? X Y k 1 k 2 k 3
    37. 37. K-means example, step 4 … A: three points with animation X Y k 1 k 3 k 2
    38. 38. K-means example, step 4b re-compute cluster means X Y k 1 k 3 k 2
    39. 39. K-means example, step 5 move cluster centers to cluster means X Y k 2 k 1 k 3
    40. 40. Discussion <ul><li>Result can vary significantly depending on initial choice of seeds </li></ul><ul><li>Can get trapped in local minimum </li></ul><ul><ul><li>Example: </li></ul></ul><ul><li>To increase chance of finding global optimum: restart with different random seeds </li></ul>instances initial cluster centers
    41. 41. K-means clustering summary <ul><li>Advantages </li></ul><ul><li>Simple, understandable </li></ul><ul><li>items automatically assigned to clusters </li></ul><ul><li>Disadvantages </li></ul><ul><li>Must pick number of clusters before hand </li></ul><ul><li>All items forced into a cluster </li></ul><ul><li>Too sensitive to outliers </li></ul>
    42. 42. Traditional Algorithms <ul><li>Hierarchical clustering </li></ul><ul><ul><li>Nested Partitions </li></ul></ul><ul><ul><li>Tree structure </li></ul></ul>
    43. 43. Agglomerative Hierarchcal Algorithms <ul><li>Mostly used hierarchical clustering algorithm </li></ul><ul><li>Initially each point is a distinct cluster </li></ul><ul><li>Repeatedly merge closest clusters until the number of clusters becomes K </li></ul><ul><ul><li>Closest: d mean (C i , C j ) = </li></ul></ul><ul><ul><li>d min (C i , C j ) = </li></ul></ul><ul><ul><li>Likewise d ave (C i , C j ) and d max (C i , C j ) </li></ul></ul>
    44. 44. Similar Time Sequences <ul><li>Given: </li></ul><ul><ul><li>A set of time-series sequences </li></ul></ul><ul><li>Find </li></ul><ul><ul><li>All sequences similar to the query sequence </li></ul></ul><ul><ul><li>All pairs of similar sequences </li></ul></ul><ul><ul><li>whole matching vs. subsequence matching </li></ul></ul><ul><li>Sample Applications </li></ul><ul><ul><li>Financial market </li></ul></ul><ul><ul><li>Scientific databases </li></ul></ul><ul><ul><li>Medical Diagnosis </li></ul></ul>
    45. 45. Whole Sequence Matching <ul><li>Basic Idea </li></ul><ul><li>Extract k features from every sequence </li></ul><ul><li>Every sequence is then represented as a point in k-dimensional space </li></ul><ul><li>Use a multi-dimensional index to store and search these points </li></ul><ul><ul><li>Spatial indices do not work well for high dimensional data </li></ul></ul>
    46. 46. Similar Time Sequences <ul><li>Take Euclidean distance as the similarity measure </li></ul><ul><li>Obtain Discrete Fourier Transform (DFT) coefficients of each sequence in the database </li></ul><ul><li>Build a multi-dimensional index using first a few Fourier coefficients </li></ul><ul><li>Use the index to retrieve sequences that are at most distance away from query sequence </li></ul><ul><li>Post-processing: </li></ul><ul><ul><li>compute the actual distance between sequences in the time domain </li></ul></ul>
    47. 47. Outlier Discovery <ul><li>Given: </li></ul><ul><ul><li>Data points and number of outliers (= n) to find </li></ul></ul><ul><li>Find top n outlier points </li></ul><ul><ul><li>outliers are considerably dissimilar from the remainder of the data </li></ul></ul><ul><li>Sample applications: </li></ul><ul><ul><li>Credit card fraud detection </li></ul></ul><ul><ul><li>Telecom fraud detection </li></ul></ul><ul><ul><li>Medical analysis </li></ul></ul>
    48. 48. Statistical Approaches <ul><li>Model underlying distribution that generates dataset (e.g. normal distribution) </li></ul><ul><li>Use discordancy tests depending on </li></ul><ul><ul><li>data distribution </li></ul></ul><ul><ul><li>distribution parameter (e.g. mean, variance) </li></ul></ul><ul><ul><li>number of expected outliers </li></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><li>most tests are for single attribute </li></ul></ul><ul><ul><li>In many cases, data distribution may not be known </li></ul></ul>
    49. 49. Distance-based Outliers <ul><li>For a fraction p and a distance d, </li></ul><ul><ul><li>a point o is an outlier if p points lie at a greater distance than d </li></ul></ul><ul><li>General enough to model statistical outlier tests </li></ul><ul><li>Develop nested-loop and cell-based algorithms </li></ul><ul><li>Scale okay for large datasets </li></ul><ul><li>Cell-based algorithm does not scale well for high dimensions </li></ul>