Data Mining (overview)

526 views
451 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
526
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 1
  • 4
  • 5
  • 6
  • 7
  • 7
  • 7
  • 11
  • 12
  • 14
  • 15
  • 16
  • 18
  • 19
  • Data Mining (overview)

    1. 1. Data Mining (overview)
    2. 2. Presentation overview <ul><li>Introduction </li></ul><ul><li>Association Rules </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Similar Time Sequences </li></ul><ul><li>Similar Images </li></ul><ul><li>Outliers </li></ul><ul><li>WWW </li></ul><ul><li>Summary </li></ul>
    3. 3. Background <ul><li>Corporations have huge databases containing a wealth of information </li></ul><ul><li>Business databases potentially constitute a goldmine of valuable business information </li></ul><ul><li>Very little functionality in database systems to support data mining applications </li></ul><ul><li>Data mining: The efficient discovery of previously unknown patterns in large databases </li></ul>
    4. 4. Applications <ul><li>Fraud Detection </li></ul><ul><li>Loan and Credit Approval </li></ul><ul><li>Market Basket Analysis </li></ul><ul><li>Customer Segmentation </li></ul><ul><li>Financial Applications </li></ul><ul><li>E-Commerce </li></ul><ul><li>Decision Support </li></ul><ul><li>Web Search </li></ul>
    5. 5. Data Mining Techniques <ul><li>Association Rules </li></ul><ul><li>Sequential Patterns </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Similar Time Sequences </li></ul><ul><li>Similar Images </li></ul><ul><li>Outlier Discovery </li></ul><ul><li>Text/Web Mining </li></ul>
    6. 6. Examples of Discovered Patterns <ul><li>Association rules </li></ul><ul><ul><li>98% of people who purchase diapers also buy beer </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>People with age less than 25 and salary > 40k drive sports cars </li></ul></ul><ul><li>Similar time sequences </li></ul><ul><ul><li>Stocks of companies A and B perform similarly </li></ul></ul><ul><li>Outlier Detection </li></ul><ul><ul><li>Residential customers for telecom company with businesses at home </li></ul></ul>
    7. 7. Association Rules <ul><li>Given: </li></ul><ul><ul><ul><li>A database of customer transactions </li></ul></ul></ul><ul><ul><ul><li>Each transaction is a set of items </li></ul></ul></ul><ul><li>Find all rules X => Y that correlate the presence of one set of items X with another set of items Y </li></ul><ul><ul><ul><li>Example: 98% of people who purchase diapers and baby food also buy beer. </li></ul></ul></ul><ul><ul><ul><li>Any number of items in the consequent/antecedent of a rule </li></ul></ul></ul><ul><ul><ul><li>Possible to specify constraints on rules (e.g., find only rules involving expensive imported products) </li></ul></ul></ul>
    8. 8. Association Rules <ul><li>Sample Applications </li></ul><ul><ul><ul><li>Market basket analysis </li></ul></ul></ul><ul><ul><ul><li>Attached mailing in direct marketing </li></ul></ul></ul><ul><ul><ul><li>Fraud detection for medical insurance </li></ul></ul></ul><ul><ul><ul><li>Department store floor/shelf planning </li></ul></ul></ul>
    9. 9. Confidence and Support <ul><li>A rule must have some minimum user-specified confidence </li></ul><ul><ul><li>1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. </li></ul></ul><ul><li>A rule must have some minimum user-specified support (how frequently the rule occurs) </li></ul><ul><ul><li>1 & 2 => 3 should hold in some minimum percentage of transactions to have business value </li></ul></ul>
    10. 10. Example <ul><li>For minimum support = 50%, minimum confidence = 50%, we have the following rules </li></ul><ul><ul><li>1 => 3 with 50% support and 66% confidence </li></ul></ul><ul><ul><li>(1&3 happened in 50% of cases, but whenever 1 happened only in 2/3 of cases 3 happened too) </li></ul></ul><ul><ul><li>3 => 1 with 50% support and 100% confidence </li></ul></ul><ul><li>(3&1 happened in 50% of cases, but whenever 3 happened 1 happened too) </li></ul>
    11. 11. Quantitative Association Rules <ul><li>Quantitative attributes (e.g. age, income) </li></ul><ul><li>Categorical attributes (e.g. make of car) </li></ul><ul><li>[Age: 30..39] and [Married: Yes] => [NumCars:2] </li></ul>min support = 40% min confidence = 50% Definition?
    12. 12. Temporal Association Rules <ul><li>Can describe the rich temporal character in data </li></ul><ul><li>Example: </li></ul><ul><ul><li>{diaper} -> {beer} (support = 5%, confidence = 87%) </li></ul></ul><ul><ul><li>Support of this rule may jump to 25% between 6 to 9 PM weekdays </li></ul></ul><ul><li>Problem: How to find rules that follow interesting user-defined temporal patterns </li></ul><ul><li>Challenge is to design efficient algorithms that do much better than finding every rule in every time unit </li></ul>
    13. 13. Correlation Rules <ul><li>Association rules do not capture correlations </li></ul><ul><li>Example: </li></ul><ul><ul><ul><li>Suppose 90% customers buy coffee, 25% buy tea and 20% buy both tea and coffee </li></ul></ul></ul><ul><ul><ul><li>{tea} => {coffee} has high support 0.2 and confidence 0.8 </li></ul></ul></ul><ul><ul><ul><li>{tea, coffee} are not correlated </li></ul></ul></ul><ul><ul><ul><li>expected support of customers buying both is 0.9 * 0.25 = 0.225 </li></ul></ul></ul>
    14. 14. Sequential Patterns <ul><li>Given: </li></ul><ul><ul><ul><li>A sequence of customer transactions </li></ul></ul></ul><ul><ul><ul><li>Each transaction is a set of items </li></ul></ul></ul><ul><li>Find all maximal sequential patterns supported by more than a user-specified percentage of customers </li></ul><ul><li>Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction </li></ul><ul><ul><ul><li>10% is the support of the pattern </li></ul></ul></ul>
    15. 15. Classification <ul><li>Given: </li></ul><ul><ul><ul><li>Database of tuples, each assigned a class label </li></ul></ul></ul><ul><li>Develop a model/profile for each class </li></ul><ul><ul><ul><li>Example profile (good credit): (25 <= age <= 40 and income > 40k) or (married = YES) </li></ul></ul></ul><ul><ul><li>Sample applications: </li></ul></ul><ul><ul><ul><li>Credit card approval (good, bad) </li></ul></ul></ul><ul><ul><ul><li>Bank locations (good, fair, poor) </li></ul></ul></ul><ul><ul><ul><li>Treatment effectiveness (good, fair, poor) </li></ul></ul></ul>
    16. 16. Decision Trees 50 Churners 50 Non-Churners 30 Churners 50 Non-Churners 20 Churners 0 Non-Churners 25 Churners 10 Non-Churners 5 Churners 40 Non-Churners 20 Churners 0 Non-Churners 5 Churners 10 Non-Churners New technology phone Old technology phone Customer <= 2.3 years Customer > 2.3 years Age <= 55 Age > 55 A decision tree is a predictive model that makes a prediction on the basis of a series of decisions
    17. 17. Decision Trees DT are creating a segmentation of the original data set. This segmentation is done for the prediction of some information. The records fall in each segment have similarity with respect to the information being predicted. The DT and the algorithms may be complex, but the results are presented in an easy-to-understand way, quite useful to the business user.
    18. 18. Decision Trees <ul><li>DT in business : </li></ul><ul><li>Automation – Very favorable technique for automating the data mining and predictive modeling. They embed automated solutions to things that other techniques leave as a burden to the user (4/4) </li></ul><ul><li>Clarity – The models are viewed as a tree of simple decisions based on familiar predictors or as a set of rules. The user can confirm the DT or modify by hand on the basis of his own expertise (4/4) </li></ul><ul><li>ROI – Because DT work well with relational databases, they provide well-integrated solutions with highly accurate models (3/4) </li></ul>
    19. 19. Decision Trees <ul><li>Pros </li></ul><ul><ul><ul><li>Fast execution time </li></ul></ul></ul><ul><ul><ul><li>Generated rules are easy to interpret by humans </li></ul></ul></ul><ul><ul><ul><li>Scale well for large data sets </li></ul></ul></ul><ul><ul><ul><li>Can handle high dimensional data </li></ul></ul></ul><ul><li>Cons </li></ul><ul><ul><ul><li>Cannot capture correlations among attributes </li></ul></ul></ul><ul><ul><ul><li>Consider only axis-parallel cuts </li></ul></ul></ul>
    20. 20. Clustering <ul><li>Given: </li></ul><ul><ul><ul><li>Data points and number of desired clusters K </li></ul></ul></ul><ul><li>Group the data points into K clusters </li></ul><ul><ul><ul><li>Data points within clusters are more similar than across clusters </li></ul></ul></ul><ul><li>Sample applications: </li></ul><ul><ul><ul><li>Customer segmentation </li></ul></ul></ul><ul><ul><ul><li>Market basket customer analysis </li></ul></ul></ul><ul><ul><ul><li>Attached mailing in direct marketing </li></ul></ul></ul><ul><ul><ul><li>Clustering companies with similar growth </li></ul></ul></ul>
    21. 21. Where to use clustering and nearest-neighbor prediction <ul><li>Clustering for clarity </li></ul><ul><ul><li>A high-level view </li></ul></ul><ul><ul><li>Segmentation </li></ul></ul><ul><li>Clustering for outlier analysis </li></ul><ul><ul><li>To see records that stick out of the rest </li></ul></ul><ul><ul><li>e.g. Wine distributors produce a certain level of profit. One store produces significantly lower profit. Turns out that the distributor was delivering to but not collecting payment from one of its customers. </li></ul></ul><ul><li>Nearest neighbor for prediction </li></ul><ul><ul><li>Objects “near” to each other have similar prediction values. </li></ul></ul><ul><ul><li>Examples: “to find more documents as this one” among journal articles, the value to be predicted in the next value of stock price based on time series. </li></ul></ul>
    22. 22. Outlier Discovery <ul><li>Sometimes clustering is performed to see when one record sticks out of the rest </li></ul><ul><ul><ul><li>E.g. One store stands out as producing significantly lower profit. Closer examination shows that the distributor was not collecting payment from one of his customers </li></ul></ul></ul><ul><ul><ul><li>E.g. A sale of man’s suits is being held in all branches of a department store. All stores, but one, have seen at least 100% jump in revenue. It turns out that store had advertised via radio rather than TV as other stores </li></ul></ul></ul><ul><li>Sample applications: </li></ul><ul><ul><ul><li>Credit card fraud detection </li></ul></ul></ul><ul><ul><ul><li>Telecom fraud detection </li></ul></ul></ul><ul><ul><ul><li>Customer segmentation </li></ul></ul></ul><ul><ul><ul><li>Medical analysis </li></ul></ul></ul>
    23. 23. Outlier Discovery <ul><li>Given: </li></ul><ul><ul><ul><li>Data points and number of outliers (= n) to find </li></ul></ul></ul><ul><li>Find top n outlier points </li></ul><ul><ul><ul><li>outliers are considerably dissimilar from the remainder of the data </li></ul></ul></ul>
    24. 24. Statistical Approaches <ul><li>Model underlying distribution that generates dataset (e.g. normal distribution) </li></ul><ul><li>Use discordancy tests depending on </li></ul><ul><ul><ul><li>data distribution </li></ul></ul></ul><ul><ul><ul><li>distribution parameter (e.g. mean, variance) </li></ul></ul></ul><ul><ul><ul><li>number of expected outliers </li></ul></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><ul><li>most tests are for single attribute </li></ul></ul></ul><ul><ul><ul><li>In many cases, data distribution may not be known </li></ul></ul></ul>
    25. 25. Differences between the nearest-neighbor technique and clustering <ul><li>Used for prediction and consolidation </li></ul><ul><li>Space is defined by the problem to be solved </li></ul><ul><li>Generally only uses distance metrics to determine nearness </li></ul><ul><li>Used for consolidating data into a high level view and general grouping of records into like behaviors </li></ul><ul><li>Space is defined as default n -dimensional space, or by the user, or predefined space driven by past experience </li></ul><ul><li>Can use other metrics beside distance to determine nearness of two records- e.g.linking points together </li></ul>Nearest neighbors Clustering
    26. 26. How clustering and nearest-neighbor work <ul><li>Looking at n -dimensional space </li></ul><ul><ul><li>The distance between the cluster and a given data point is often measured from the center of mass of the cluster </li></ul></ul><ul><ul><li>The center can be calculated </li></ul></ul><ul><ul><ul><li>By simply average income and age of each record </li></ul></ul></ul><ul><ul><ul><li>By square error criterion </li></ul></ul></ul><ul><ul><ul><li>Other </li></ul></ul></ul><ul><ul><li>Many clustering problems have hundreds of dimensions. Our intuition works only in 2 or 3-dimensional space </li></ul></ul>Cluster 1 Cluster 2 Cluster 3 Customers of a golf equipment business Cluster 1 – retirees with modest income Cluster 2 – middle-aged weekend golfers Cluster 3 – wealthy youth with exclusive club membership Outliers Income $120,000 Age 100 Yrs.
    27. 27. Traditional Algorithms <ul><li>Partitional algorithms </li></ul><ul><ul><ul><li>Enumerate K partitions optimizing some criterion </li></ul></ul></ul><ul><ul><ul><li>Example: square-error criterion </li></ul></ul></ul><ul><ul><ul><li>m i is the mean of cluster C i </li></ul></ul></ul>
    28. 28. How is “nearness” defined <ul><li>The “trivial” case </li></ul>ID Name Prediction Age Balance($) Income Eyes Gender <ul><li>Carla Yes 21 2300 High Blue F </li></ul><ul><li>Sue ?? 21 2300 High Blue F </li></ul>Exactly the same as the record to be predicted is considered “close.” However, it is unlikely to find exact matches <ul><li>The Manhattan Distance metric adds up the differences between each predictor between the historical record and the record to be predicted </li></ul><ul><li>The Euclidean Distance metrics calculates distance the Pythagorean way (the square of the hypotenuse is equal to the sum of squares of the other two sides) </li></ul><ul><li>Others… </li></ul>
    29. 29. <ul><li>The Manhattan Distance metric (an example) </li></ul>Calculating the difference between ages (6 years) and balances ($3100) is simple. Eyes color predictor? e.g. match=0, mismatch=1 Income – assign numbers: high=3, medium=2, low=1 3108 = 6 + 3100 + 0 + 1 + 1 The result must be normalized (e.g. 0-100) 225 = 6 + 19 + 0 + 100 + 100 ID Name Prediction Age Balance($) Income Eyes Gender <ul><li>Carla Yes 21 2300 High Blue F </li></ul><ul><li>Carl No 27 5400 High Brown M </li></ul>
    30. 30. <ul><li>Calculating dimension weights </li></ul><ul><ul><li>Different dimensions may have different weights </li></ul></ul><ul><ul><ul><li>e.g. In text classification not all words (dimensions) are created equal: “entrepreneur” is significant, “the” is not. </li></ul></ul></ul><ul><ul><ul><li>Two methods </li></ul></ul></ul><ul><ul><ul><ul><li>The inverse frequency of the word is used, “the” 1/10,000, “entrepreneur” 1/10 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>The importance of the word to the topic to be predicted. “entrepreneur” and “venture capital” will be given higher weight then “tornado”, the topic is to start a small business </li></ul></ul></ul></ul><ul><ul><li>Dimension weights have also been calculated via adaptive algorithms where random weights are tried initially and then slowly modified to improve the accuracy of the system (neural networks, genetic algorithms) </li></ul></ul>
    31. 31. Hierarchy of Clusters <ul><li>The hierarchy of clusters is viewed as a tree in which the smallest clusters merge to create the next highest level of clusters. </li></ul><ul><ul><li>Agglomerative technique starts with as many clusters as there are records. The clusters that are nearest each other are merged to form the next largest cluster. This merging is continued until a hierarchy of clusters is built. </li></ul></ul><ul><ul><li>Divisive technique takes the opposite approach. It starts with all the records in one cluster, then try to split that cluster into smaller pieces, etc. </li></ul></ul><ul><li>The hierarchy allows the end user to chose the level to work with </li></ul>Large single cluster Smallest clusters
    32. 32. Similar Time Sequences <ul><li>Given: </li></ul><ul><ul><ul><li>A set of time-series sequences </li></ul></ul></ul><ul><li>Find </li></ul><ul><ul><ul><li>All sequences similar to the query sequence </li></ul></ul></ul><ul><ul><ul><li>All pairs of similar sequences </li></ul></ul></ul><ul><ul><ul><li>whole matching vs. subsequence matching </li></ul></ul></ul><ul><li>Sample Applications </li></ul><ul><ul><ul><li>Financial market </li></ul></ul></ul><ul><ul><ul><li>Market basket data analysis </li></ul></ul></ul><ul><ul><ul><li>Scientific databases </li></ul></ul></ul><ul><ul><ul><li>Medical Diagnosis </li></ul></ul></ul>
    33. 33. Whole Sequence Matching <ul><li>Basic Idea </li></ul><ul><li>Extract k features from every sequence </li></ul><ul><li>Every sequence is then represented as a point in k-dimensional space </li></ul><ul><li>Use a multi-dimensional index to store and search these points </li></ul><ul><ul><li>Spatial indices do not work well for high dimensional data </li></ul></ul>
    34. 34. Similar Time Sequences <ul><li>Sequences are normalized with amplitude scaling and offset translation </li></ul><ul><li>Two subsequences are considered similar if one lies within an envelope of width around the other, ignoring outliers </li></ul><ul><li>Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences </li></ul>
    35. 35. Similar Sequences Found VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund Two similar mutual funds in the different fund group
    36. 36. Similar Images <ul><li>Given: </li></ul><ul><ul><ul><li>A set of images </li></ul></ul></ul><ul><li>Find: </li></ul><ul><ul><ul><li>All images similar to a given image </li></ul></ul></ul><ul><ul><ul><li>All pairs of similar images </li></ul></ul></ul><ul><li>Sample applications: </li></ul><ul><ul><ul><li>Medical diagnosis </li></ul></ul></ul><ul><ul><ul><li>Weather predication </li></ul></ul></ul><ul><ul><ul><li>Web search engine for images </li></ul></ul></ul><ul><ul><ul><li>E-commerce </li></ul></ul></ul>
    37. 37. Similar Images <ul><li>QBIC[Nib93, FSN95], [JFS95], WBIIS[WWWFS98] </li></ul><ul><ul><ul><li>Generates a single signature per image </li></ul></ul></ul><ul><ul><ul><li>Fails when the images contain similar objects, but at different locations or varying sizes </li></ul></ul></ul><ul><li>[Smi97] </li></ul><ul><ul><ul><li>Divide an image into individual objects </li></ul></ul></ul><ul><ul><ul><li>Manual extraction can be very tedious and time consuming </li></ul></ul></ul><ul><ul><ul><li>Inaccurate in identifying objects and not robust </li></ul></ul></ul>
    38. 38. WALRUS <ul><li>Automatically extract regions from an image based on the complexity of images </li></ul><ul><li>A single signature is used per each region </li></ul><ul><li>Two images are considered to be similar if they have enough similar region pairs </li></ul>
    39. 39. WALRUS Similarity Model
    40. 40. WALRUS (Overview) Compute wavelet signatures for sliding windows Cluster windows to generate regions Insert regions into spatial index (R* tree) Compute wavelet signatures for sliding windows Cluster windows to generate regions Find matching regions using spatial index Compute similarity between query image and target images Image Querying Phase Image Indexing Phase
    41. 41. WALRUS Query image
    42. 42. Web Mining: Challenges <ul><li>Today’s search engines are plagued by problems: </li></ul><ul><ul><li>the abundance problem (99% of info of no interest to 99% of people) </li></ul></ul><ul><ul><li>limited coverage of the Web (internet sources hidden behind search interfaces) </li></ul></ul><ul><ul><li>limited query interface based on keyword-oriented search </li></ul></ul><ul><ul><li>limited customization to individual users </li></ul></ul>
    43. 43. Web is ….. <ul><li>The web is a huge collection of documents </li></ul><ul><ul><ul><li>Semistructured (HTML, XML) </li></ul></ul></ul><ul><ul><ul><li>Hyper-link information </li></ul></ul></ul><ul><ul><ul><li>Access and usage information </li></ul></ul></ul><ul><ul><ul><li>Dynamic </li></ul></ul></ul><ul><ul><ul><li>(i.e. New pages are constantly being generated) </li></ul></ul></ul>
    44. 44. Web Mining <ul><li>Web Content Mining </li></ul><ul><ul><ul><li>Extract concept hierarchies/relations from the web </li></ul></ul></ul><ul><ul><ul><li>Automatic categorization </li></ul></ul></ul><ul><li>Web Log Mining </li></ul><ul><ul><ul><li>Trend analysis (i.e web dynamics info) </li></ul></ul></ul><ul><ul><ul><li>Web access association/sequential pattern analysis </li></ul></ul></ul><ul><li>Web Structure Mining </li></ul><ul><ul><ul><li>Google: A page is important if important pages point to it </li></ul></ul></ul>
    45. 45. Improving Search/Customization <ul><li>Learn about user’s interests based on access patterns </li></ul><ul><li>Provide users with pages, sites and advertisements of interest </li></ul>
    46. 46. Summary <ul><li>Data mining: </li></ul><ul><ul><ul><li>Good science - leading position in research community </li></ul></ul></ul><ul><ul><ul><li>Recent progress for large databases: association rules, classification, clustering, similar time sequences, similar image retrieval, outlier discovery, etc. </li></ul></ul></ul><ul><ul><ul><li>Many papers were published in major conferences </li></ul></ul></ul><ul><ul><ul><li>Still promising and rich field with many challenging research issues </li></ul></ul></ul><ul><ul><ul><li>Maturing in industry </li></ul></ul></ul>

    ×