Data mining


Published on

Published in: Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data mining

  1. 1. DATA MININGData that has relevance for managerial decisions is accumulating at an incredible rate due to ahost of technological advances. Electronic data capture has become inexpensive and ubiquitousas a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-sale devices, bar-code readers, and intelligent machines.Such data is often stored in data warehouses and data marts specifically intended formanagement decision support. Data mining is a rapidly growing field that is concerned withdeveloping techniques to assist managers to make intelligent use of these repositories.A number of successful applications have been reported in areas such as credit rating, frauddetection, database marketing, customer relationship management, and stock marketinvestments.The field of data mining has evolved from the disciplines of statistics and artificial intelligence.DefinitionTerm for confluence of ideas from statistics and computer science (machinelearning and database methods) applied to large databases in science, engineeringand business.Gartner Group• “Data mining is the process of discovering meaningful newcorrelations, patterns and trends by sifting through large amounts ofdata stored in repositories, using pattern recognition technologies as wellas statistical and mathematical techniques.”Drivers• Market: From focus on product/service to focus on customer• IT: From focus on up-to-date balances to focus on patterns in transactions - DataWarehouses -OLAP• Dramatic drop in storage costs : Huge databases – e.g Walmart: 20 milliontransactions/day, 10 terabyte database, Blockbuster: 36 million households• Automatic Data Capture of Transactions– e.g. Bar Codes , POS devices, Mouse clicks, Locationdata (GPS, cell phones)
  2. 2. • Internet: Personalized interactions, longitudinalDataProcess1. Develop understanding of application, goals2. Create dataset for study (often from DataWarehouse)3. Data Cleaning and Preprocessing4. Data Reduction and projection5. Choose Data Mining task6. Choose Data Mining algorithms7. Use algorithms to perform task8. Interpret and iterate thru 1-7 if necessaryData Mining step 4-89. Deploy: integrate into operational systems.SEMMA Methodology• Sample from data sets, Partition into Training, Validation and Testdatasets• Explore data set statistically and graphically• Modify:Transform variables, Impute missing values• Model: fit models e.g. regression, classfication tree, neural net• Assess: Compare models using Partition, Test datasetsCustomer Relationship Management• Target Marketing• Attrition Prediction/Churn Analysis• Fraud Detection• Credit ScoringTarget marketing• Business problem: Use list of prospects for direct mailing campaign• Solution: Use Data Mining to identify most promising respondentscombining demographic and geographic data with data on past purchase
  3. 3. behavior• Benefit: Better response rate, savings in campaign costExample: Fleet Financial Group• Redesign of customer service infrastructure, including $38 million investment indata warehouse and marketing automation• Used logistic regression to predict response probabilities to home-equity productfor sample of 20,000 customer profiles from 15 million customer base• Used to predict profitable customers and customers who would be unprofitableeven if they respondChurn Analysis: Telcos• Business Problem: Prevent loss of customers, avoid adding churn-pronecustomers• Solution: Use neural nets, time series analysis to identify typical patterns oftelephone usage of likely-to-defect and likely-to-churn customers• Benefit: Retention of customers, more effective promotionsExample: IDEA CELLULAR• CHURN/Customer Profiling System implemented as part of major custom datawarehouse solution• Preventive action based on customer characteristics and known cases of churningand non-churning customers identify significant characteristics for churn• Early detection Customer Profiling Systems based on usage pattern matchingwith known cases of churn customers.Fraud Detection• Business problem: Fraud increases costs or reduces revenue• Solution: Use logistic regression, neural nets to identify characteristicsof fraudulent cases to prevent in future or prosecute more vigorously• Benefit: Increased profits by reducing undesirable customersRisk Analysis
  4. 4. • Business problem: Reduce risk of loans to delinquent customers• Solution: Use credit scoring models using discriminant analysis tocreate score functions that separate out risky customers• Benefit: Decrease in cost of bad debtsFinance• Business problem: Pricing of corporate bonds depends on severalfactors, risk profile of company , seniority of debt, dividends, priorhistory, etc.• Solution Approach: Through Data Mining , develop more accuratemodels of predicting prices.E-commerce and Internet• Collaborative Filtering• From Clicks to CustomersRecommendation Systems• Business opportunity: Users rate items (,, on the web. How to use information from other users to inferratings for a particular user?• Solution: Use of a technique known as collaborative filtering• Benefit: Increase revenues by cross selling, up sellingClicks to Customers• Business problem: 50% of Dell’s clients order their computer through the web.However, the retention rate is 0.5%, i.e. of visitors of Dell’s web page becomecustomers.• Solution Approach: Through the sequence of their clicks, cluster customers anddesign website, interventions to maximize the number of customers who eventuallybuy.• Benefit: Increase revenuesEmerging Major Data Mining applications• Spam• Bioinformatics/Genomics
  5. 5. • Medical History Data – Insurance Claims• Personalization of services in e-commerce• RF Tags• Security :– Container Shipments– Network Intrusion DetectionDATA STORAGECore Concepts• Types of Data:– Numeric• Continuous – ratio and interval• Discrete• Need for Binning– Categorical – order and unordered– Binary• Overfitting and Generalization
  6. 6. • Regularization: Penalty for model complexity• Distance• Curse of Dimensionality• Random and stratified sampling, resampling• Loss FunctionsTypical characteristics of mining data• “Standard” format is spreadsheet:– Row=observation unit, Column=variable• Many rows, many columns• Many rows moderate number of columns (e.g. tel. calls)• Many columns, moderate number of rows (e.g.genomics)• Opportunistic (often by-product of transactions)– Not from designed experiments– Often has outliers, missing dataTechniques• Supervised Techniques– Classification:• k-Nearest Neighbors, Naïve Bayes, Classification Trees• Discriminant Analysis, Logistic Regression, Neural Nets– Prediction (Estimation):• Regression, Regression Trees, k-Nearest Neighbors• Unsupervised Techniques– Cluster Analysis, Principal Components– Association Rules, Collaborative Filtering