Published on

Published in: Education, Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Each topic is a talk..
  • Cluster2

    1. 1. Clustering in Data Warehouse Department of CE MSPVL Polytechnic College Pavoorchatram 1
    2. 2. Overview <ul><li>Part 1: what is Data mining </li></ul><ul><li>Part 2:Association Rules </li></ul><ul><li>Part 3:Classification </li></ul><ul><li>Part 4: clustering </li></ul><ul><li>Part 5: Approaches to data mining Problems </li></ul><ul><li>Part 6: Application of Data Mining </li></ul><ul><li>Part7:commercial tools for data mining </li></ul>
    3. 3. Part 1: Data Mining
    4. 4. Relationship to data warehouse <ul><li>Data mining uses data warehouse to take decisions. Data warehouse is to support decision making. </li></ul><ul><li>Data mining can be applied to operational database with individual transaction. </li></ul><ul><li>Data mining helps in extracting meaningful new patterns. </li></ul><ul><li>Data mining applications should be considered during the design of a data warehouse. The successful use of database mining applications depends on the construction of data warehouse. </li></ul>
    5. 5. Define Data Mining <ul><li>Data mining is sorting through data to identify patterns and establish relationships. </li></ul><ul><li>Data mining parameters include: </li></ul><ul><li>Association - looking for patterns where one event is connected to another event </li></ul><ul><li>Sequence or path analysis - looking for patterns where one event leads to another later event </li></ul><ul><li>Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok) </li></ul><ul><li>Clustering - finding and visually documenting groups of facts not previously known </li></ul>
    6. 6. Part 2: Association Rules
    7. 7. Association Rules <ul><li>Association rules between Set of items in large database </li></ul>
    8. 8. Why Association Rules? Bread ,milk Milk ,sugar Pen ,ink
    9. 9. The general form of association rule is <ul><li>X  Y </li></ul><ul><li>x  set of items {x1,x2,….xn} </li></ul><ul><li>y  Set of items {y1,y2,y3…yn} </li></ul><ul><li>The above rule can be stated as database tuples that satisfy the condition in x are also likely to satisfy the condition in y. </li></ul>
    10. 10. Consider the Purchase Table <ul><li>Retail shops are often interested in association between different items that people buy. If we refer the table given above it is clear that </li></ul><ul><ul><ul><ul><ul><li>People who buy pen also buys ink </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>People who buys bread also milk. </li></ul></ul></ul></ul></ul>
    11. 11. Association rules measures <ul><li>Support </li></ul><ul><li>Confidence </li></ul>
    12. 12. Support <ul><li>This is the measure of percentage of transaction that contains the union all the items in the LHS and RHS. </li></ul><ul><li>Consider the rule PEN  INK has a support of 75%. That is the items in LHS U RHS occur in 75% of transactions and a higher support. </li></ul>
    13. 13. Confidence <ul><li>Confidence is the measure of percentage of transactions that include the items in RHS. </li></ul><ul><li>Confidence is a measure of how often the rule is true. </li></ul><ul><li>bread  Milk </li></ul><ul><li>Confidence of 80% of the purchases that include bread also milk. </li></ul>
    14. 14. Part 3: classification Classification rules Decision trees Mathematical formula Neural network
    15. 15. Some basic operations <ul><li>Predictive: </li></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><li>Descriptive: </li></ul><ul><ul><li>Clustering / similarity matching </li></ul></ul><ul><ul><li>Association rules and variants </li></ul></ul><ul><ul><li>Deviation detection </li></ul></ul>
    16. 16. Classification <ul><li>Given old data about customers and payments, predict new applicant’s loan eligibility. </li></ul>Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
    17. 17. Classification <ul><li>Classification is a data mining (machine learning) technique used to predict group membership for data instances. </li></ul>
    18. 18. Why Data Mining <ul><li>Credit ratings/targeted marketing : </li></ul><ul><ul><li>Given a database of 100,000 names, which persons are the least likely to default on their credit cards? </li></ul></ul><ul><ul><li>Identify likely responders to sales promotions </li></ul></ul><ul><li>Fraud detection </li></ul><ul><ul><li>Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? </li></ul></ul><ul><li>Customer relationship management : </li></ul><ul><ul><li>Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : </li></ul></ul>
    19. 19. Classification <ul><li>Classification is defined as a process of finding a set of functions that describe and distinguish data classes. </li></ul>Training Data Classification algorithm Classification Rules If age=“31 …. 40” And income=high Then rating = good. Name Age Income Rating abc 20 low fair xyz 31…40 Medium Good mny 40…50 High Excellent
    20. 20. classification <ul><li>This function we can find out the classes of the objects whose class labels are not known based on a set of training data. </li></ul><ul><li>A training data is a data whose class label is known. </li></ul><ul><li>The following are the different forms of classification </li></ul><ul><ul><li>Classification Rules </li></ul></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>Mathematical formula </li></ul></ul><ul><ul><li>Neural network </li></ul></ul>
    21. 21. Classification methods <ul><li>Goal: Predict class Ci = f(x1, x2, .. Xn) </li></ul><ul><li>Regression: (linear or any other polynomial) </li></ul><ul><ul><li>a*x1 + b*x2 + c = Ci. </li></ul></ul><ul><li>Decision tree classifier: divide decision space into piecewise constant regions. </li></ul><ul><li>Neural networks: partition by non-linear boundaries </li></ul>
    22. 22. <ul><li>Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. </li></ul>Decision trees Salary < 1 M Prof = teacher Age < 30 Good Bad Bad Good
    23. 23. Pros and Cons of decision trees <ul><li>Cons </li></ul><ul><ul><li>Cannot handle complicated relationship between features </li></ul></ul><ul><ul><li>simple decision boundaries </li></ul></ul><ul><ul><li>problems with lots of missing data </li></ul></ul><ul><li>Pros </li></ul><ul><ul><li>Reasonable training time </li></ul></ul><ul><ul><li>Fast application </li></ul></ul><ul><ul><li>Easy to interpret </li></ul></ul><ul><ul><li>Easy to implement </li></ul></ul><ul><ul><li>Can handle large number of features </li></ul></ul>
    24. 24. Neural network <ul><li>Set of nodes connected by directed weighted edges </li></ul>Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3 Basic NN unit A more typical NN
    25. 25. Pros and Cons of Neural Network <ul><li>Cons </li></ul><ul><ul><li>Slow training time </li></ul></ul><ul><ul><li>Hard to interpret </li></ul></ul><ul><ul><li>Hard to implement: trial and error for choosing number of nodes </li></ul></ul><ul><li>Pros </li></ul><ul><ul><li>Can learn more complicated class boundaries </li></ul></ul><ul><ul><li>Fast application </li></ul></ul><ul><ul><li>Can handle large number of features </li></ul></ul>Conclusion: Use neural nets only if decision trees/NN fail. classification
    26. 26. Part 4:Clustering Partitioning clustering algorithm Hierarchical clustering algorithm
    27. 27. Clustering <ul><li>Unsupervised learning when old data with class labels not available e.g. when introducing a new product. </li></ul><ul><li>Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. </li></ul><ul><li>Key requirement: Need a good measure of similarity between instances. </li></ul>
    28. 28. clustering
    29. 29. Similarity
    30. 30. Prevalent  Interesting <ul><li>Analysts already know about prevalent rules </li></ul><ul><li>Interesting rules are those that deviate from prior expectation </li></ul><ul><li>Mining’s payoff is in finding surprising phenomenon </li></ul>1995 Milk and cereal sell together! Milk and cereal sell together! 1998 Zzzz...
    31. 31. Clustering Algorithm <ul><li>Partition clustering Algorithm </li></ul><ul><li>Hierarchical clustering algorithm </li></ul>
    32. 32. Partition clustering Algorithm <ul><li>Partition clustering algorithm generates a tree of clusters. </li></ul><ul><li>The number of cluster k is given by the user </li></ul>
    33. 33. Hierarchical clustering algorithm <ul><li>Hierarchical clustering algorithm generates a tree of clusters. </li></ul><ul><li>That is in the first step each cluster consists of single record. </li></ul><ul><li>In the second step,two cluster are grouped together </li></ul><ul><li>In the final step there is a single partition </li></ul>
    34. 34. Part 6: Approaches to data mining problems Discovery of sequential Discovery of patterns in time series Discovery of classification rules Regression
    35. 35. Discovery of sequential patterns Suppose a customer visit the shop three times and purchase the following sequence of item sets. { milk, bread, juice } { bread, eggs } { cookies, milk, coffee } The problem of discovering sequential patterns is to find all subsequences from the given sets of sequences that have a user defined minimum support. Trans_id Time Item_Purchased 101 6.35 Milk, bread, juice 792 7.38 Milk, juice 1130 8.05 Milk, eggs 1735 8.40 Bread, cookies ,coffee
    36. 36. Discovery of patterns in time series <ul><li>Time series are sequence of events having a fixed type of transaction. </li></ul><ul><li>The period during which the stock is raised steady for n days. </li></ul><ul><li>The longest period over which the stock and a change of not more than 1% over last closing price. </li></ul><ul><li>The quarter of a year during which the stock had the most percentage gain or loss. </li></ul>
    37. 37. Discovery of classification rules <ul><li>Classification is a process of defining a function that classifies a given object into many possible classes. </li></ul>
    38. 38. Example <ul><li>A bank wishes to classify its loan applicants into two groups or classes. </li></ul><ul><ul><li>A group who are loan worthy(eligible) </li></ul></ul><ul><ul><li>Another group who are not worthy(not eligible) </li></ul></ul><ul><ul><li>To do the above classification, the bank can use the classification rule given below </li></ul></ul><ul><ul><li>If monthly income greater than 30,000 then </li></ul></ul><ul><ul><li>they are worthy </li></ul></ul><ul><ul><li>Else not worthy </li></ul></ul>
    39. 39. Regression <ul><li>Regression is defined as a function over variables which gives a target class variable. </li></ul>
    40. 40. Example <ul><li>Labtest(Patient id,test1,test2,….testn) </li></ul><ul><li>This contain values of n test for one patient </li></ul><ul><li>The target variable that wish to predict is p, the probability of survival of the patient. </li></ul><ul><li>p=f{test1,test2,test3…testn} </li></ul><ul><li>This function is called regression function. </li></ul>
    41. 41. MSPVL Polytechnic college