COT5230 Data Mining Week 4 Data Mining and Statistics Clustering Techniques M O N A S H A U S T R A L I A ’ S  I N T E R N...
References <ul><li>Elder, John F. IV; Pregibon, Daryl;  A Statistical Perspective on KDD ; pp.87-93. Proceedings of the Fi...
The Link between Pattern and Approach <ul><li>Data mining aims to reveal knowledge about the data under consideration </li...
A Taxonomy of Approaches to Data Mining - 1 <ul><li>It is not expected that all the approaches will work equally well with...
A Taxonomy of Approaches to Data Mining - 2
Verification-driven Data Mining Techniques - 1 <ul><li>Verification data mining techniques require the user to postulate s...
Verification-driven Data Mining Techniques - 2 <ul><li>The reasons for this are various </li></ul><ul><li>Statistical tech...
Problems with Statistical Approaches - 1 <ul><li>Traditional statistical models often highlight linear relationships but n...
Problems with Statistical Approaches - 2 <ul><li>Statisticians have traditionally focussed on  model estimation , rather t...
Discovery-driven Data Mining Techniques  <ul><li>Discovery-driven data mining techniques can also be broken down into two ...
<ul><li>Informative techniques do not present a solution to a known problem </li></ul><ul><ul><li>they present interesting...
Regression <ul><li>Regression is a predictive technique which discovers relationships between input and output patterns, w...
An Example of a Regression Model - 1 <ul><li>Consider a mortgage provider that is concerned with retaining mortgages once ...
An Example of a Regression Model - 2
An Example of a Regression Model - 3 <ul><li>Linear regression on the data does not match the real pattern of the data </l...
Exploratory Data Analysis (EDA) <ul><li>Classical statistics has a dogma that the data may not be viewed prior to modeling...
EDA and the Domain Expert - 1 <ul><li>It is a very hard problem to include common sense based on some knowledge of the dom...
EDA and the Domain Expert - 2 <ul><li>The obstacles to entirely automating the process are: </li></ul><ul><ul><li>It is ha...
An Interactive Approach to DM <ul><li>A domain expert is someone who has meta-knowledge about the problem </li></ul><ul><l...
Automatic Cluster Detection <ul><li>If the are many competing patterns, a data set can appear to contain just noise </li><...
Automatic Cluster Detection - example <ul><li>The Hehrtzsprung-Russell diagram which graphs a stars luminosity against tem...
The Hehrtzsprung-Russell diagram Luminosity (Sun=1) Temperature (Degrees Kelvin) Red Giants Main Sequence White Dwarves 1 ...
The K-Means Technique <ul><li>K , the number of clusters that are to be formed, must be decided before beginning </li></ul...
Assign Each Record to the Nearest Centroid X 2 X 1
Calculate the New Centroids X 2 X 1
Determine the New Cluster Boundaries X 2 X 1
Similarity, Association and Distance <ul><li>The method just described assumes that each record can be described as a poin...
Types of Variables <ul><li>Categories </li></ul><ul><ul><li>e.g. Food Group: Grain, Dairy, Meat, etc. </li></ul></ul><ul><...
Measures of Similarity <ul><li>Euclidean distance  </li></ul><ul><li>Angle between two vectors (from origin to data point ...
Weighting and Scaling <ul><li>Weighting allows some variables to assume greater importance than others. </li></ul><ul><ul>...
Variations of the K-Means Technique <ul><li>There are problems with simple K-means method </li></ul><ul><ul><li>It does no...
Agglomeration Methods - 1 <ul><li>A true unsupervised technique would not pre-determine the number of clusters </li></ul><...
Agglomeration Methods - 2 <ul><li>An example of an agglomerative cluster tree: </li></ul>
Evaluating Clusters <ul><li>We desire clusters to have members which are close to each other and we also want the clusters...
Strengths of Automatic Cluster Detection <ul><li>Strengths </li></ul><ul><ul><li>is an undirected knowledge discovery tech...
Upcoming SlideShare
Loading in …5
×

Data Mining and Statistics Clustering Techniques

1,274 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,274
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data Mining and Statistics Clustering Techniques

  1. 1. COT5230 Data Mining Week 4 Data Mining and Statistics Clustering Techniques M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y
  2. 2. References <ul><li>Elder, John F. IV; Pregibon, Daryl; A Statistical Perspective on KDD ; pp.87-93. Proceedings of the First International Conference on Knowledge Discovery & Data Mining (Ed. Fayyad, U.M. & Uthurusamy, R.), AAAI Press, Menlo Park, California 1995. </li></ul><ul><li>Berry& Linoff (1997) Data Mining Techniques: For Marketing, Sales, and Customer Support, Wiley. </li></ul><ul><li>Berson A. & Smith S.J. (1997) Data Warehousing, Data Mining and OLAP , McGraw-Hill. </li></ul>
  3. 3. The Link between Pattern and Approach <ul><li>Data mining aims to reveal knowledge about the data under consideration </li></ul><ul><li>This knowledge takes the form of patterns within the data which embody our understanding of the data </li></ul><ul><ul><li>Patterns are also referred to as structures, models and relationships </li></ul></ul><ul><li>The approach chosen is inherently linked to the pattern revealed </li></ul>
  4. 4. A Taxonomy of Approaches to Data Mining - 1 <ul><li>It is not expected that all the approaches will work equally well with all data sets </li></ul><ul><li>Visualization of data sets can be combined with, or used prior to, modeling and assists in selecting an approach and indicating what patterns might be present </li></ul>
  5. 5. A Taxonomy of Approaches to Data Mining - 2
  6. 6. Verification-driven Data Mining Techniques - 1 <ul><li>Verification data mining techniques require the user to postulate some hypothesis </li></ul><ul><ul><li>Simple query and reporting, or statistical analysis techniques then confirm this hypothesis </li></ul></ul><ul><li>Statistics has been neglected to a degree in data mining in comparison to less traditional techniques such as </li></ul><ul><ul><li>neural networks, genetic algorithms and rule-based approaches to classification </li></ul></ul><ul><li>Many of these “less traditional” techniques also have a statistical interpretation </li></ul>
  7. 7. Verification-driven Data Mining Techniques - 2 <ul><li>The reasons for this are various </li></ul><ul><li>Statistical techniques are most useful for well-structured problems </li></ul><ul><li>Many data mining problems are not well structured: </li></ul><ul><ul><li>the statistical techniques breakdown or require large amounts of time and effort to be effective </li></ul></ul>
  8. 8. Problems with Statistical Approaches - 1 <ul><li>Traditional statistical models often highlight linear relationships but not complex non-linear relationships </li></ul><ul><li>Exploring all possible higher dimensional relationships, often (usually) takes an unacceptably long time </li></ul><ul><ul><li>the non-linear statistical methods require knowledge about </li></ul></ul><ul><ul><ul><li>the type of non-linearity </li></ul></ul></ul><ul><ul><ul><li>the ways in which the variables interact </li></ul></ul></ul><ul><ul><li>This knowledge is often not known in complex multi-dimensional data mining problems </li></ul></ul>
  9. 9. Problems with Statistical Approaches - 2 <ul><li>Statisticians have traditionally focussed on model estimation , rather than model selection </li></ul><ul><li>For these reasons less traditional, more exploratory, techniques are often chosen for modern data mining </li></ul><ul><li>The current high level of interest in data mining centres on many of the newer techniques, which may be termed discovery-driven </li></ul><ul><li>Lessons from statistics should not be forgotten. Estimation of uncertainty and checking of assumptions is as important as ever! </li></ul>
  10. 10. Discovery-driven Data Mining Techniques <ul><li>Discovery-driven data mining techniques can also be broken down into two broad areas: </li></ul><ul><ul><li>those techniques which are considered predictive , sometimes termed supervised techniques </li></ul></ul><ul><ul><li>those techniques which are termed informative , sometimes termed unsupervised techniques </li></ul></ul><ul><li>Predictive techniques build patterns by making a prediction of some unknown attribute given the values of other known attributes </li></ul>
  11. 11. <ul><li>Informative techniques do not present a solution to a known problem </li></ul><ul><ul><li>they present interesting patterns for consideration by some expert in the domain </li></ul></ul><ul><ul><li>the patterns may be termed “informative patterns” </li></ul></ul><ul><li>The main predictive and informative patterns are: </li></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association </li></ul></ul>
  12. 12. Regression <ul><li>Regression is a predictive technique which discovers relationships between input and output patterns, where the values are continuous or real valued </li></ul><ul><li>Many traditional statistical regression models are linear </li></ul><ul><li>Neural networks, though biologically inspired, are in fact non-linear regression models </li></ul><ul><li>Non-linear relationships occur in many multi-dimensional data mining applications </li></ul>
  13. 13. An Example of a Regression Model - 1 <ul><li>Consider a mortgage provider that is concerned with retaining mortgages once taken out </li></ul><ul><li>They may also be interested in how profit on individual loans is related to customers paying off their loans at an accelerated rate </li></ul><ul><ul><li>For example, a customer may pay an additional amount each month and thus pay off their loan in 15 years instead of 25 years </li></ul></ul><ul><li>A graph of the relationship between profit and the elapsed time between when a loan is actually paid off and when it was originally contracted to be paid off appears on the next slide </li></ul>
  14. 14. An Example of a Regression Model - 2
  15. 15. An Example of a Regression Model - 3 <ul><li>Linear regression on the data does not match the real pattern of the data </li></ul><ul><li>The curved line represents what might be produced by a non-linear approach (perhaps a neural network) </li></ul><ul><li>This curved line fits the data much better. It could be used as the basis on which to predict profitability </li></ul><ul><ul><li>Decisions on exit fees and penalties for certain behaviors may be based on this kind of analysis. </li></ul></ul>
  16. 16. Exploratory Data Analysis (EDA) <ul><li>Classical statistics has a dogma that the data may not be viewed prior to modeling [Elde95] </li></ul><ul><ul><li>aim is to avoid choosing biased hypotheses </li></ul></ul><ul><li>During the 1970s the term Exploratory Data Analysis (EDA) was used to express the notion that both the choice of model and hints as to appropriate approaches could be data-driven </li></ul><ul><li>Elder and Pregiban describes the dichotomy thus: “On the one side the argument was that hypotheses and the like must not be biased by choosing them on the basis of what the data seemed to be indicating. On the other side was the belief that pictures and numerical summaries of data are necessary in order to understand how rich a model the data can support.” </li></ul>
  17. 17. EDA and the Domain Expert - 1 <ul><li>It is a very hard problem to include common sense based on some knowledge of the domain in automated modeling systems </li></ul><ul><ul><li>chance discoveries occur when exploring data that may not have occurred otherwise </li></ul></ul><ul><ul><li>these can also change the approach to the subsequent modeling </li></ul></ul>
  18. 18. EDA and the Domain Expert - 2 <ul><li>The obstacles to entirely automating the process are: </li></ul><ul><ul><li>It is hard to quantify a procedure to capture “the unexpected” in plots </li></ul></ul><ul><ul><li>Even if this could be accomplished, one would need to describe how this maps into the next analysis step in the automated procedure </li></ul></ul><ul><li>What is needed is a way to represent meta-knowledge about the problem at hand and the procedures commonly used </li></ul>
  19. 19. An Interactive Approach to DM <ul><li>A domain expert is someone who has meta-knowledge about the problem </li></ul><ul><li>An interactive exploration and a querying and/or visualization system guided by a domain expert goes beyond current statistical methods </li></ul><ul><li>Current thinking on statistical theory recognizes such an approach as being potentially able to provide a more effective way of discovering knowledge about a data set </li></ul>
  20. 20. Automatic Cluster Detection <ul><li>If the are many competing patterns, a data set can appear to contain just noise </li></ul><ul><li>Subdividing a data set into clusters where patterns can be more easily discerned can overcome this </li></ul><ul><li>When we have no idea how to define the clusters automatic cluster detection methods can be useful </li></ul><ul><li>Finding clusters is an unsupervised learning task </li></ul>
  21. 21. Automatic Cluster Detection - example <ul><li>The Hehrtzsprung-Russell diagram which graphs a stars luminosity against temperature reveals three clusters </li></ul><ul><ul><li>It is interesting to note that each of the clusters has a different relationship between luminosity and temperature. </li></ul></ul><ul><li>In most data mining situations the variables to consider and the clusters that may be formed are not so easily determined </li></ul>
  22. 22. The Hehrtzsprung-Russell diagram Luminosity (Sun=1) Temperature (Degrees Kelvin) Red Giants Main Sequence White Dwarves 1 40,000 2,500
  23. 23. The K-Means Technique <ul><li>K , the number of clusters that are to be formed, must be decided before beginning </li></ul><ul><ul><li>Step 1 </li></ul></ul><ul><ul><ul><li>Select K data points to act as the seeds (or initial centroids) </li></ul></ul></ul><ul><ul><li>Step 2 </li></ul></ul><ul><ul><ul><li>Each record is assigned to the centroid which is nearest, thus forming a cluster </li></ul></ul></ul><ul><ul><li>Step 3 </li></ul></ul><ul><ul><ul><li>The centroids of the new clusters are then calculated. Go back to Step 2 </li></ul></ul></ul><ul><ul><li>This is continued until the clusters stop changing </li></ul></ul>
  24. 24. Assign Each Record to the Nearest Centroid X 2 X 1
  25. 25. Calculate the New Centroids X 2 X 1
  26. 26. Determine the New Cluster Boundaries X 2 X 1
  27. 27. Similarity, Association and Distance <ul><li>The method just described assumes that each record can be described as a point in a metric-space </li></ul><ul><ul><li>This is not easily done for many data sets (e.g. categorical and some numeric variables) </li></ul></ul><ul><li>The records in a cluster should have a natural association. A measure of similarity is required. </li></ul><ul><ul><li>Euclidean distance is often used, but it is not always suitable </li></ul></ul><ul><ul><li>Euclidean distance treats changes in each dimension equally, but in databases changes in one field may be more important than changes in another </li></ul></ul>
  28. 28. Types of Variables <ul><li>Categories </li></ul><ul><ul><li>e.g. Food Group: Grain, Dairy, Meat, etc. </li></ul></ul><ul><li>Ranks </li></ul><ul><ul><li>e.g. Food Quality: Premium, High Grade, Medium, Low </li></ul></ul><ul><li>Intervals </li></ul><ul><ul><li>e.g. The distance between temperatures </li></ul></ul><ul><li>True Measures </li></ul><ul><ul><li>The measures have a meaningful zero point so ratios have meaning as well as distances </li></ul></ul>
  29. 29. Measures of Similarity <ul><li>Euclidean distance </li></ul><ul><li>Angle between two vectors (from origin to data point </li></ul><ul><li>The number of features in common </li></ul><ul><li>Mahalanobis distance </li></ul>
  30. 30. Weighting and Scaling <ul><li>Weighting allows some variables to assume greater importance than others. </li></ul><ul><ul><li>The domain expert must decide if certain variables deserve a greater weighting </li></ul></ul><ul><ul><li>Statistical weighting techniques also exist </li></ul></ul><ul><li>Scaling attempts to apply a common range to variables so that differences are comparable between variables </li></ul><ul><ul><li>This can also be statistically based </li></ul></ul>
  31. 31. Variations of the K-Means Technique <ul><li>There are problems with simple K-means method </li></ul><ul><ul><li>It does not deal well with overlapping clusters. </li></ul></ul><ul><ul><li>The clusters can be pulled of centre by outliers. </li></ul></ul><ul><ul><li>Records are either in or out of the cluster so there is no notion of likelihood of being in a particular cluster or not </li></ul></ul><ul><li>A Gaussian Mixture Model varies the approach already outlined by attaching a weighting based on a probability distribution to records which are close to or distant from the centroids initially chosen. There is then less chance of outliers distorting the situation. Each record contributes to some degree to each of the centroids </li></ul>
  32. 32. Agglomeration Methods - 1 <ul><li>A true unsupervised technique would not pre-determine the number of clusters </li></ul><ul><li>A hierarchical technique would offer a hierarchy of clusters from large to small. This can be achieved in a number of ways </li></ul><ul><li>An agglomerative technique starts out by considering each record as a cluster and gradually building larger clusters by merging the records which are near each other </li></ul>
  33. 33. Agglomeration Methods - 2 <ul><li>An example of an agglomerative cluster tree: </li></ul>
  34. 34. Evaluating Clusters <ul><li>We desire clusters to have members which are close to each other and we also want the clusters to be widely spaced </li></ul><ul><li>Variance measures are often used. Ideally, we want to minimize within-cluster variance and maximize between-cluster variance </li></ul><ul><li>But variance is not the only important factor, for example it will favor not merging clusters in an hierarchical technique </li></ul>
  35. 35. Strengths of Automatic Cluster Detection <ul><li>Strengths </li></ul><ul><ul><li>is an undirected knowledge discovery technique </li></ul></ul><ul><ul><li>works well with many types of data </li></ul></ul><ul><ul><li>is relatively simple to carry out </li></ul></ul><ul><li>Weaknesses </li></ul><ul><ul><li>Can be difficult to choose the distance measures and weightings </li></ul></ul><ul><ul><li>Can be sensitive to initial parameter choices </li></ul></ul><ul><ul><li>The clusters found can be difficult to interpret </li></ul></ul>

×