Knowledge Discovery and Data Mining


Published on

A study

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Knowledge Discovery and Data Mining

  1. 1. KDD: A Definition• KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. Then run Data Mining algorithms 106-1012 bytes: we never see the What is the knowledge? whole data set, so will How to represent put it in the memory of and use it? computers
  2. 2. Why do we need KDD ?Some Data Overload Examples: Science Wal-Mart records 20 millions per day Retail Marketing Data Health care transactions: multi-gigabyte Overload databases Mobil Oil: geological data of over 100 Healthcare Finance terabytes Data is the most Important tool to gain a competitive edge by providing improved, customized services.
  3. 3. Knowledge Discovery Process Integration Interpretation Knowledge & Evaluation Knowledge Raw Dat __ __ __ Patterns Understanding __ __ __ a __ __ __ and Rules Transformed DATA Target Data Ware Data house
  4. 4. Knowledge Discovery in Database• Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data Clean, Data Training Data Collect, Data Data Mining PreparationSummarize Warehouse Verification, ModelOperational Evaluation PatternsDatabases
  5. 5. Knowledge Discovery Process Goals Data Selection, Acquisition & Integration Data Cleaning Data Reduction & Projection Matching the Goals Exploratory Data Analysis Data Mining Interpretation and Testing Consolidation & Use
  6. 6. Knowledge Discovery Process• Goals STEP – 1: IDENTIFYING THE GOAL• Data Selection,Acquisition & Integration • First step is developing an understanding of• Data Cleaning the application domain and the relevant• Data reduction and prior knowledge and identifying the goal ofProjection the KDD process from the customer’s•Matching the goals viewpoint.• Exploratory DataAnalysis• Data Mining•Interpretation andTesting• Consolidation & Use
  7. 7. Knowledge Discovery Process• Goals STEP – 2: CREATING A TARGET DATA SET• Data Selection,Acquisition & Integration • Selecting a data set, or focusing on a subset• Data Cleaning of variables or data samples, on which• Data reduction and discovery is to be performed.Projection•Matching the goals• Exploratory DataAnalysis• Data Mining•Interpretation andTesting• Consolidation & Use
  8. 8. Knowledge Discovery Process• Goals STEP – 3: DATA CLEANING AND PREPROCESSING• Data Selection,Acquisition & Integration • Basic operations include removing noise if• Data Cleaning appropriate, collecting the necessary• Data reduction and information to model or account for noise,Projection deciding on strategies for handling missing•Matching the goals data fields, and accounting for time-• Exploratory Data sequence information and known changes.Analysis• Data Mining•Interpretation andTesting• Consolidation & Use
  9. 9. Knowledge Discovery Process• Goals STEP – 4: DATA REDUCTION AND• Data Selection, PROJECTIONAcquisition & Integration• Data Cleaning • Finding useful features to represent the data• Data reduction and depending on the goal of the task.Projection • With dimensionality reduction or•Matching the goals transformation methods, the effective• Exploratory Data number of variables under consideration canAnalysis• Data Mining be reduced, or invariant representations for•Interpretation and the data can be found.Testing• Consolidation & Use
  10. 10. Knowledge Discovery Process• Goals STEP – 5: MATCHING THE GOALS• Data Selection,Acquisition & Integration • Matching the goals of the KDD process to a• Data Cleaning particular data-mining method such as• Data reduction and summarization, classification, regression,Projection•Matching the goals clustering, etc.• Exploratory DataAnalysis• Data Mining•Interpretation andTesting• Consolidation & Use
  11. 11. Knowledge Discovery Process• Goals STEP – 6: EXPLORATORY ANALYSIS AND• Data Selection, MODEL & HYPOTHESIS SELECTIONAcquisition & Integration• Data Cleaning • Choosing the data mining algorithms and• Data reduction and selecting methods to be used for searchingProjection for data patterns.•Matching the goals • This process includes deciding which models• Exploratory Data and parameters might be appropriate andAnalysis• Data Mining matching a particular data-mining method• Interpretation and with the overall criteria of the KDD process.Testing• Consolidation & Use
  12. 12. Knowledge Discovery Process• Goals STEP – 7: DATA MINING• Data Selection,Acquisition & Integration • Searching for patterns of interest in a• Data Cleaning particular representational form or a set of• Data reduction and such representations, including classificationProjection rules or trees, regression, and clustering.•Matching the goals • The user can significantly aid the data-• Exploratory Data mining method by correctly performing theAnalysis preceding steps.• Data Mining•Interpretation andTesting• Consolidation & Use
  13. 13. Knowledge Discovery Process• Goals STEP – 8: INTERPRETATION & TESTING• Data Selection,Acquisition & Integration • Interpreting mined patterns, possibly• Data Cleaning returning to any of steps 1 through 7 for• Data reduction and further iteration.Projection • This step can also involve visualization of the•Matching the goals extracted patterns and models or• Exploratory Data visualization of the data given the extractedAnalysis models.• Data Mining•Interpretation andTesting• Consolidation & Use
  14. 14. Knowledge Discovery Process• Goals STEP – 9: KNOWLEDGE PRESENTATION• Data Selection,Acquisition & Integration • Using the knowledge directly, incorporating• Data Cleaning the knowledge into another system for• Data reduction and further action, or simply documenting it andProjection reporting it to interested parties.•Matching the goals • This process also includes checking for and• Exploratory Data resolving potential conflicts with previouslyAnalysis believed (or extracted) knowledge.• Data Mining• Testing and Verification• Interpretation• Consolidation & Use
  15. 15. Data Warehousing• A platform for online analytical processing (OLAP)• Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis• Also called “data marts”• A critical component of the decision support system (DSS) of enterprises• Some typical DW queries: – Which item sells best in each region that has retail outlets? – Which advertising strategy is best for Dubai Markets?
  16. 16. Data Warehousing OLTP Data Cleaning Inventory Data Warehouse (OLAP)
  17. 17. Data Cleaning• Performs logical transformation of transactional data to suit the data warehouse• Model of operations  model of enterprise• Usually a semi-automatic process Data Warehouse Orders Order_id Customers Price Products Cust_id Orders Inventory Price Inventory Sales Time Prod_id Cust_id Price Cust_profit Price_change Total_sales
  18. 18. Primary Tasks of Data Mining finding the description identifying a finite of several predefined set of categories or classes and classify clusters to describe a data item into one the data. of them. Clustering Classification finding a model maps a data item which describes ? significant dependencies to a real-valued prediction variable. between variables. Regression Dependency Modeling discovering the finding a most significant compact description changes in the data for a subset of dataDeviation andchange detection Summarization
  19. 19. Data Mining Algorithm Components• Model representation – descriptions of discovered patterns – overly limited representation -- unable to capture data patterns too powerful -- potential for over fit. (decision trees, rules, linear/non-linear regression & classification, nearest neighbor and case-based reasoning methods, graphical dependency models)• Model evaluation criteria – how well a pattern (model) meets goals (fit function) – e.g., accuracy, novelty, etc.
  20. 20. Data Mining Algorithm Components• Search method – parameter search: optimization of parameters for a given model representation – model search: considers a family of models Different methods suit different problems. Proper problem formulation crucial.
  21. 21. Data Mining Techniques Data Mining Techniques Descriptive Predictive Clustering Classification Association Decision Tree Sequential Analysis Rule Induction Neural Networks Nearest Neighbor Classification Regression
  22. 22. Association Rule: Application• Supermarket Shelf Management• Goal: to identify items which are bought together (by sufficiently many customers)• Approach: process point-of-sale data (collected with barcode scanners) to find dependencies among items.• Consider discovered rule: {Diapers, Milk … } --> {Baby food}• Example: – If a customer buys Diapers and Milk, then he is very likely to buy Baby foods. – so stack baby foods next to diapers?
  23. 23. Sequential Pattern Discovery: Application• Sequences in which customers purchase goods/services• Understanding long term customer behavior -- timely promotions.• In point-of--sale transaction sequences – Computer bookstore: (Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs) – Athletic Apparel Store: (Shoes) (Racket, Racket ball) --> (Sports Jacket)
  24. 24. Hierarchical Clustering (K-Means): Application Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level109 108 10 9 97 8 86 75 7 Update 6 64 Assign 5 5 the32 each of 4 41 the 3 cluster 3 means 20 2 0 1 2 3 4 5 6 7 8 9 10 objects 1 1 to most 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 similar center reassign K=2 10 Arbitrarily choose K 9 objects as initial 8 cluster center Update 7 6 5 the 4 cluster means 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
  25. 25. Decision Tree Identification: Application Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Sunny Yes Overcast Chilly No Sunny Chilly Yes Cloudy Yes/No Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Yes/No Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes
  26. 26. Decision Tree Identification: Application Yes/No Cloudy Overcast Sunny Yes/No Yes Yes/No Pleasant Chilly Warm Chilly No Pleasant Yes No Yes Yes
  27. 27. Major Application Areas for DataMining (Classification)• Advertising• Bioinformatics• Customer Relationship Management (CRM)• Database Marketing• Fraud Detection• ecommerce• Health Care• Investment/Securities• Manufacturing, Process Control• Sports and Entertainment• Telecommunications• Web
  28. 28. Major Application Areas for DataMining: Marketing• Direct Marketing: Most major direct marketing companies are using modeling and data mining.• Customer segmentation: All industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis.• CRM: Find other people in similar life stages and determine which customers are following similar behavior patterns For e.g. Verizon – Up-sell Wireless – Cross-sell reduced churn – Keeping the customers for a longer period of time rate from 2% to 1.5%
  29. 29. Major Application Areas for DataMining: Fraud Detection• Credit Card Fraud Detection• Money laundering – FAIS (US Treasury)• Securities Fraud – NASDAQ Sonar system• Phone fraud – AT&T, Bell Atlantic, British Telecom/MCI• Bio-terrorism detection at Salt Lake Olympics 2002
  30. 30. Major Application Areas for DataMining: Retail• Sales forecasting: Examining time-based patterns helps retailers make stocking decisions.• Database Retailing: Retailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales.• Merchandise planning and allocation: When retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics.
  31. 31. Major Application Areas for DataMining: Banking• Credit Card marketing By identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs.• Cardholder pricing and profitability Card issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers.
  32. 32. Major Application Areas for Data Mining: Telecommunication• Call detail record analysis: Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions.• Customer loyalty: Some customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.
  33. 33. Major Application Areas for DataMining: Manufacturing• Manufacturing: Through choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand.• Warranties: Manufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.
  34. 34. Issues and Challenges• Large data – Number of variables (features), number of cases (examples) – Multi gigabyte, terabyte databases – Efficient algorithms, parallel processing• High dimensionality – Large number of features: exponential increase in search space – Potential for spurious patterns – Dimensionality reduction• Over fitting – Models noise in training data, rather than just the general patterns• Changing data, missing and noisy data• Use of domain knowledge – Utilizing knowledge on complex data relationships, known facts• Understandability of patterns
  35. 35. Success Stories• Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data – Won over (manual) knowledge engineering approach – provides good detailed description of the entire process• Major US bank: customer attrition prediction – First segment customers based on financial behavior: found 3 segments – Build attrition models for each of the 3 segments – 40-50% of attritions were predicted == factor of 18 increase• Targeted credit marketing: major US banks – Find customer segments based on 13 months credit balances – Build another response model based on surveys – Increased response 4 times -- 2%
  36. 36. Amitava Manna(11DCP007)Amritanshu Mehra(11DCP008)Animesh Ranjan(11DCP009)