Upcoming SlideShare
Loading in...5

Like this? Share it with your network








Total Views
Views on SlideShare
Embed Views



4 Embeds 102 96 3
http://localhost 2 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Each topic is a talk..

Cluster2 Presentation Transcript

  • 1. Clustering in Data Warehouse Department of CE MSPVL Polytechnic College Pavoorchatram 1
  • 2. Overview
    • Part 1: what is Data mining
    • Part 2:Association Rules
    • Part 3:Classification
    • Part 4: clustering
    • Part 5: Approaches to data mining Problems
    • Part 6: Application of Data Mining
    • Part7:commercial tools for data mining
  • 3. Part 1: Data Mining
  • 4. Relationship to data warehouse
    • Data mining uses data warehouse to take decisions. Data warehouse is to support decision making.
    • Data mining can be applied to operational database with individual transaction.
    • Data mining helps in extracting meaningful new patterns.
    • Data mining applications should be considered during the design of a data warehouse. The successful use of database mining applications depends on the construction of data warehouse.
  • 5. Define Data Mining
    • Data mining is sorting through data to identify patterns and establish relationships.
    • Data mining parameters include:
    • Association - looking for patterns where one event is connected to another event
    • Sequence or path analysis - looking for patterns where one event leads to another later event
    • Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok)
    • Clustering - finding and visually documenting groups of facts not previously known
  • 6. Part 2: Association Rules
  • 7. Association Rules
    • Association rules between Set of items in large database
  • 8. Why Association Rules? Bread ,milk Milk ,sugar Pen ,ink
  • 9. The general form of association rule is
    • X  Y
    • x  set of items {x1,x2,….xn}
    • y  Set of items {y1,y2,y3…yn}
    • The above rule can be stated as database tuples that satisfy the condition in x are also likely to satisfy the condition in y.
  • 10. Consider the Purchase Table
    • Retail shops are often interested in association between different items that people buy. If we refer the table given above it is clear that
            • People who buy pen also buys ink
            • People who buys bread also milk.
  • 11. Association rules measures
    • Support
    • Confidence
  • 12. Support
    • This is the measure of percentage of transaction that contains the union all the items in the LHS and RHS.
    • Consider the rule PEN  INK has a support of 75%. That is the items in LHS U RHS occur in 75% of transactions and a higher support.
  • 13. Confidence
    • Confidence is the measure of percentage of transactions that include the items in RHS.
    • Confidence is a measure of how often the rule is true.
    • bread  Milk
    • Confidence of 80% of the purchases that include bread also milk.
  • 14. Part 3: classification Classification rules Decision trees Mathematical formula Neural network
  • 15. Some basic operations
    • Predictive:
      • Regression
      • Classification
    • Descriptive:
      • Clustering / similarity matching
      • Association rules and variants
      • Deviation detection
  • 16. Classification
    • Given old data about customers and payments, predict new applicant’s loan eligibility.
    Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
  • 17. Classification
    • Classification is a data mining (machine learning) technique used to predict group membership for data instances.
  • 18. Why Data Mining
    • Credit ratings/targeted marketing :
      • Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
      • Identify likely responders to sales promotions
    • Fraud detection
      • Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?
    • Customer relationship management :
      • Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :
  • 19. Classification
    • Classification is defined as a process of finding a set of functions that describe and distinguish data classes.
    Training Data Classification algorithm Classification Rules If age=“31 …. 40” And income=high Then rating = good. Name Age Income Rating abc 20 low fair xyz 31…40 Medium Good mny 40…50 High Excellent
  • 20. classification
    • This function we can find out the classes of the objects whose class labels are not known based on a set of training data.
    • A training data is a data whose class label is known.
    • The following are the different forms of classification
      • Classification Rules
      • Decision trees
      • Mathematical formula
      • Neural network
  • 21. Classification methods
    • Goal: Predict class Ci = f(x1, x2, .. Xn)
    • Regression: (linear or any other polynomial)
      • a*x1 + b*x2 + c = Ci.
    • Decision tree classifier: divide decision space into piecewise constant regions.
    • Neural networks: partition by non-linear boundaries
  • 22.
    • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
    Decision trees Salary < 1 M Prof = teacher Age < 30 Good Bad Bad Good
  • 23. Pros and Cons of decision trees
    • Cons
      • Cannot handle complicated relationship between features
      • simple decision boundaries
      • problems with lots of missing data
    • Pros
      • Reasonable training time
      • Fast application
      • Easy to interpret
      • Easy to implement
      • Can handle large number of features
  • 24. Neural network
    • Set of nodes connected by directed weighted edges
    Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3 Basic NN unit A more typical NN
  • 25. Pros and Cons of Neural Network
    • Cons
      • Slow training time
      • Hard to interpret
      • Hard to implement: trial and error for choosing number of nodes
    • Pros
      • Can learn more complicated class boundaries
      • Fast application
      • Can handle large number of features
    Conclusion: Use neural nets only if decision trees/NN fail. classification
  • 26. Part 4:Clustering Partitioning clustering algorithm Hierarchical clustering algorithm
  • 27. Clustering
    • Unsupervised learning when old data with class labels not available e.g. when introducing a new product.
    • Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
    • Key requirement: Need a good measure of similarity between instances.
  • 28. clustering
  • 29. Similarity
  • 30. Prevalent  Interesting
    • Analysts already know about prevalent rules
    • Interesting rules are those that deviate from prior expectation
    • Mining’s payoff is in finding surprising phenomenon
    1995 Milk and cereal sell together! Milk and cereal sell together! 1998 Zzzz...
  • 31. Clustering Algorithm
    • Partition clustering Algorithm
    • Hierarchical clustering algorithm
  • 32. Partition clustering Algorithm
    • Partition clustering algorithm generates a tree of clusters.
    • The number of cluster k is given by the user
  • 33. Hierarchical clustering algorithm
    • Hierarchical clustering algorithm generates a tree of clusters.
    • That is in the first step each cluster consists of single record.
    • In the second step,two cluster are grouped together
    • In the final step there is a single partition
  • 34. Part 6: Approaches to data mining problems Discovery of sequential Discovery of patterns in time series Discovery of classification rules Regression
  • 35. Discovery of sequential patterns Suppose a customer visit the shop three times and purchase the following sequence of item sets. { milk, bread, juice } { bread, eggs } { cookies, milk, coffee } The problem of discovering sequential patterns is to find all subsequences from the given sets of sequences that have a user defined minimum support. Trans_id Time Item_Purchased 101 6.35 Milk, bread, juice 792 7.38 Milk, juice 1130 8.05 Milk, eggs 1735 8.40 Bread, cookies ,coffee
  • 36. Discovery of patterns in time series
    • Time series are sequence of events having a fixed type of transaction.
    • The period during which the stock is raised steady for n days.
    • The longest period over which the stock and a change of not more than 1% over last closing price.
    • The quarter of a year during which the stock had the most percentage gain or loss.
  • 37. Discovery of classification rules
    • Classification is a process of defining a function that classifies a given object into many possible classes.
  • 38. Example
    • A bank wishes to classify its loan applicants into two groups or classes.
      • A group who are loan worthy(eligible)
      • Another group who are not worthy(not eligible)
      • To do the above classification, the bank can use the classification rule given below
      • If monthly income greater than 30,000 then
      • they are worthy
      • Else not worthy
  • 39. Regression
    • Regression is defined as a function over variables which gives a target class variable.
  • 40. Example
    • Labtest(Patient id,test1,test2,….testn)
    • This contain values of n test for one patient
    • The target variable that wish to predict is p, the probability of survival of the patient.
    • p=f{test1,test2,test3…testn}
    • This function is called regression function.
  • 41. MSPVL Polytechnic college