Your SlideShare is downloading. ×
  • Like
Cluster2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply
Published

 

Published in Education , Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
658
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
26
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Each topic is a talk..

Transcript

  • 1. Clustering in Data Warehouse Department of CE MSPVL Polytechnic College Pavoorchatram 1
  • 2. Overview
    • Part 1: what is Data mining
    • Part 2:Association Rules
    • Part 3:Classification
    • Part 4: clustering
    • Part 5: Approaches to data mining Problems
    • Part 6: Application of Data Mining
    • Part7:commercial tools for data mining
  • 3. Part 1: Data Mining
  • 4. Relationship to data warehouse
    • Data mining uses data warehouse to take decisions. Data warehouse is to support decision making.
    • Data mining can be applied to operational database with individual transaction.
    • Data mining helps in extracting meaningful new patterns.
    • Data mining applications should be considered during the design of a data warehouse. The successful use of database mining applications depends on the construction of data warehouse.
  • 5. Define Data Mining
    • Data mining is sorting through data to identify patterns and establish relationships.
    • Data mining parameters include:
    • Association - looking for patterns where one event is connected to another event
    • Sequence or path analysis - looking for patterns where one event leads to another later event
    • Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok)
    • Clustering - finding and visually documenting groups of facts not previously known
  • 6. Part 2: Association Rules
  • 7. Association Rules
    • Association rules between Set of items in large database
  • 8. Why Association Rules? Bread ,milk Milk ,sugar Pen ,ink
  • 9. The general form of association rule is
    • X  Y
    • x  set of items {x1,x2,….xn}
    • y  Set of items {y1,y2,y3…yn}
    • The above rule can be stated as database tuples that satisfy the condition in x are also likely to satisfy the condition in y.
  • 10. Consider the Purchase Table
    • Retail shops are often interested in association between different items that people buy. If we refer the table given above it is clear that
            • People who buy pen also buys ink
            • People who buys bread also milk.
  • 11. Association rules measures
    • Support
    • Confidence
  • 12. Support
    • This is the measure of percentage of transaction that contains the union all the items in the LHS and RHS.
    • Consider the rule PEN  INK has a support of 75%. That is the items in LHS U RHS occur in 75% of transactions and a higher support.
  • 13. Confidence
    • Confidence is the measure of percentage of transactions that include the items in RHS.
    • Confidence is a measure of how often the rule is true.
    • bread  Milk
    • Confidence of 80% of the purchases that include bread also milk.
  • 14. Part 3: classification Classification rules Decision trees Mathematical formula Neural network
  • 15. Some basic operations
    • Predictive:
      • Regression
      • Classification
    • Descriptive:
      • Clustering / similarity matching
      • Association rules and variants
      • Deviation detection
  • 16. Classification
    • Given old data about customers and payments, predict new applicant’s loan eligibility.
    Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
  • 17. Classification
    • Classification is a data mining (machine learning) technique used to predict group membership for data instances.
  • 18. Why Data Mining
    • Credit ratings/targeted marketing :
      • Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
      • Identify likely responders to sales promotions
    • Fraud detection
      • Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?
    • Customer relationship management :
      • Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :
  • 19. Classification
    • Classification is defined as a process of finding a set of functions that describe and distinguish data classes.
    Training Data Classification algorithm Classification Rules If age=“31 …. 40” And income=high Then rating = good. Name Age Income Rating abc 20 low fair xyz 31…40 Medium Good mny 40…50 High Excellent
  • 20. classification
    • This function we can find out the classes of the objects whose class labels are not known based on a set of training data.
    • A training data is a data whose class label is known.
    • The following are the different forms of classification
      • Classification Rules
      • Decision trees
      • Mathematical formula
      • Neural network
  • 21. Classification methods
    • Goal: Predict class Ci = f(x1, x2, .. Xn)
    • Regression: (linear or any other polynomial)
      • a*x1 + b*x2 + c = Ci.
    • Decision tree classifier: divide decision space into piecewise constant regions.
    • Neural networks: partition by non-linear boundaries
  • 22.
    • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
    Decision trees Salary < 1 M Prof = teacher Age < 30 Good Bad Bad Good
  • 23. Pros and Cons of decision trees
    • Cons
      • Cannot handle complicated relationship between features
      • simple decision boundaries
      • problems with lots of missing data
    • Pros
      • Reasonable training time
      • Fast application
      • Easy to interpret
      • Easy to implement
      • Can handle large number of features
  • 24. Neural network
    • Set of nodes connected by directed weighted edges
    Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3 Basic NN unit A more typical NN
  • 25. Pros and Cons of Neural Network
    • Cons
      • Slow training time
      • Hard to interpret
      • Hard to implement: trial and error for choosing number of nodes
    • Pros
      • Can learn more complicated class boundaries
      • Fast application
      • Can handle large number of features
    Conclusion: Use neural nets only if decision trees/NN fail. classification
  • 26. Part 4:Clustering Partitioning clustering algorithm Hierarchical clustering algorithm
  • 27. Clustering
    • Unsupervised learning when old data with class labels not available e.g. when introducing a new product.
    • Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
    • Key requirement: Need a good measure of similarity between instances.
  • 28. clustering
  • 29. Similarity
  • 30. Prevalent  Interesting
    • Analysts already know about prevalent rules
    • Interesting rules are those that deviate from prior expectation
    • Mining’s payoff is in finding surprising phenomenon
    1995 Milk and cereal sell together! Milk and cereal sell together! 1998 Zzzz...
  • 31. Clustering Algorithm
    • Partition clustering Algorithm
    • Hierarchical clustering algorithm
  • 32. Partition clustering Algorithm
    • Partition clustering algorithm generates a tree of clusters.
    • The number of cluster k is given by the user
  • 33. Hierarchical clustering algorithm
    • Hierarchical clustering algorithm generates a tree of clusters.
    • That is in the first step each cluster consists of single record.
    • In the second step,two cluster are grouped together
    • In the final step there is a single partition
  • 34. Part 6: Approaches to data mining problems Discovery of sequential Discovery of patterns in time series Discovery of classification rules Regression
  • 35. Discovery of sequential patterns Suppose a customer visit the shop three times and purchase the following sequence of item sets. { milk, bread, juice } { bread, eggs } { cookies, milk, coffee } The problem of discovering sequential patterns is to find all subsequences from the given sets of sequences that have a user defined minimum support. Trans_id Time Item_Purchased 101 6.35 Milk, bread, juice 792 7.38 Milk, juice 1130 8.05 Milk, eggs 1735 8.40 Bread, cookies ,coffee
  • 36. Discovery of patterns in time series
    • Time series are sequence of events having a fixed type of transaction.
    • The period during which the stock is raised steady for n days.
    • The longest period over which the stock and a change of not more than 1% over last closing price.
    • The quarter of a year during which the stock had the most percentage gain or loss.
  • 37. Discovery of classification rules
    • Classification is a process of defining a function that classifies a given object into many possible classes.
  • 38. Example
    • A bank wishes to classify its loan applicants into two groups or classes.
      • A group who are loan worthy(eligible)
      • Another group who are not worthy(not eligible)
      • To do the above classification, the bank can use the classification rule given below
      • If monthly income greater than 30,000 then
      • they are worthy
      • Else not worthy
  • 39. Regression
    • Regression is defined as a function over variables which gives a target class variable.
  • 40. Example
    • Labtest(Patient id,test1,test2,….testn)
    • This contain values of n test for one patient
    • The target variable that wish to predict is p, the probability of survival of the patient.
    • p=f{test1,test2,test3…testn}
    • This function is called regression function.
  • 41. MSPVL Polytechnic college