Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chapter 1 Initial Description of Data Mining in Business


Published on

  • Be the first to comment

  • Be the first to like this

Chapter 1 Initial Description of Data Mining in Business

  1. 1. Chapter 1 Initial Description of Data Mining in Business Prepared by: Dr. Tsung-Nan Tsai
  2. 2. Contents <ul><li>Introduces data mining concepts </li></ul><ul><li>Presents typical business data applications </li></ul><ul><li>Explains the meaning of key concepts </li></ul><ul><li>Gives a brief overview of data mining tools </li></ul><ul><li>Outlines the remaining chapters of the book </li></ul>
  3. 3. Definition <ul><li>DATA MINING : exploration & analysis </li></ul><ul><ul><li>Refers to the analysis of the large quantities of data that are stored in computers. </li></ul></ul><ul><ul><li>by automatic means </li></ul></ul><ul><ul><li>of large quantities of data </li></ul></ul><ul><ul><li>to discover actionable patterns & rules </li></ul></ul><ul><li>Data mining is a way to use massive quantities of data that businesses generate </li></ul><ul><li>GOAL - improve marketing, sales, customer support through better understanding of customers </li></ul>
  4. 4. Retail Outlets <ul><li>Bar coding & scanning generate masses of data </li></ul><ul><ul><li>customer service (Grocery stores can quickly process he purchases and accurately determine product prices) </li></ul></ul><ul><ul><li>inventory control (Determine the quantity of items of each product on hand, supply chain management) </li></ul></ul><ul><ul><li>MICROMARKETING </li></ul></ul><ul><ul><li>CUSTOMER PROFITABILITY ANALYSIS </li></ul></ul><ul><ul><li>MARKET-BASKET ANALYSIS </li></ul></ul>
  5. 5. Political Data Mining <ul><li>Grossman et al., 10/18/2004, Time, 38 </li></ul><ul><li>2004 Election </li></ul><ul><ul><li>Republicans: VoterVault </li></ul></ul><ul><ul><ul><li>From Mid-1990s </li></ul></ul></ul><ul><ul><ul><li>About 165 million voters </li></ul></ul></ul><ul><ul><ul><li>Massive get-out-the-vote drive for those expected to vote Republican </li></ul></ul></ul><ul><ul><li>Democrats: Demzilla </li></ul></ul><ul><ul><ul><li>Also about 165 million voters </li></ul></ul></ul><ul><ul><ul><li>Names typically have 200 to 400 information items </li></ul></ul></ul>
  6. 6. Medical Diagnosis <ul><li>J. Morris, Health Management Technology Nov 2004, 20, 22-24 </li></ul><ul><li>Electronic Medical Records </li></ul><ul><ul><li>Associated Cardiovascular Consultants </li></ul></ul><ul><ul><ul><li>31 physicians </li></ul></ul></ul><ul><ul><ul><li>40,000 patients per year, southern New Jersey </li></ul></ul></ul><ul><ul><li>Data mined to identify efficient medical practice </li></ul></ul><ul><ul><li>Enhance patient outcomes </li></ul></ul><ul><ul><li>Reduced medical liability insurance </li></ul></ul>
  7. 7. Mayo Clinic <ul><li>Swartz, Information Management Journal Nov/Dec 2004, 8 </li></ul><ul><li>IBM developed EMR program </li></ul><ul><ul><li>Complete records on almost 4.4 million patients. </li></ul></ul><ul><ul><li>Doctors can ask for how last 100 Mayo patients with same gender, age, medical history responded to particular treatments. </li></ul></ul>
  8. 8. Business Uses of Data Mining <ul><li>Toyata used the data mining of its data warehouse to determine more efficient transportation routes , reducing time-to-market by average of 19 days. </li></ul><ul><li>Bank firms used the data mining in soliciting credit card customers, </li></ul><ul><li>Insurance and Telecommunication companies used DM to detect fraud. </li></ul><ul><li>Manufacturing firms used DM in quality control, </li></ul><ul><li>Many ….. </li></ul>
  9. 9. Business Uses of Data Mining <ul><li>Customer profiling </li></ul><ul><ul><ul><li>Identify profitability from subset customers </li></ul></ul></ul><ul><li>Targeting </li></ul><ul><ul><ul><li>Determine characteristics of most profitable customers </li></ul></ul></ul><ul><li>3. Market-Basket Analysis </li></ul><ul><ul><li>Determine correlation of purchases by profile (customers) </li></ul></ul><ul><ul><li>Cross-selling </li></ul></ul><ul><ul><li>Part of Customer Relationship Management </li></ul></ul>
  10. 10. What is needed to do DM? <ul><li>DM requires the identification of a problem, along with data collection that can lead to a better understanding of the market. </li></ul><ul><li>Computer models provide statistical or other means of analysis. </li></ul><ul><li>Two general types of DM studies: </li></ul><ul><ul><li>Hypothesis testing : involving expressing a theory about the relationship between actions and outcomes. </li></ul></ul><ul><ul><li>Knowledge discovery : a preconceived notion may not be present, but rather than relationships can be identified by looking at the data (correlation analysis). </li></ul></ul>
  11. 11. Reasons why Data Mining is now effective <ul><li>Data are there </li></ul><ul><li>Data are warehoused (computerized) </li></ul><ul><ul><li>Walmart: 35 thousand queries per week </li></ul></ul><ul><li>Computing economically available </li></ul><ul><li>Competitive pressure </li></ul><ul><li>Commercial products available </li></ul>
  12. 12. Trends <ul><li>Every business is service </li></ul><ul><ul><li>hotel chains record your preferences </li></ul></ul><ul><ul><li>car rental companies the same </li></ul></ul><ul><ul><li>service versus price </li></ul></ul><ul><ul><ul><li>credit card companies </li></ul></ul></ul><ul><ul><ul><li>long distance providers </li></ul></ul></ul><ul><ul><ul><li>airlines </li></ul></ul></ul><ul><ul><ul><li>computer retailers </li></ul></ul></ul>
  13. 13. Trends <ul><li>Information as Product </li></ul><ul><ul><li>Custom Clothing Technology Corporation </li></ul></ul><ul><ul><ul><li>fit jeans, other clothing </li></ul></ul></ul><ul><li>INFORMATION BROKERING </li></ul><ul><ul><li>IMS - collects prescription data from pharmacies, sells to drug firms </li></ul></ul><ul><ul><li>AC Nielsen - TV </li></ul></ul>
  14. 14. Trends <ul><li>Commercial Software Available </li></ul><ul><ul><li>using statistical, artificial intelligence tools that have been developed </li></ul></ul><ul><ul><ul><li>Enterprise Miner SAS </li></ul></ul></ul><ul><ul><ul><li>Intelligent Miner IBM </li></ul></ul></ul><ul><ul><ul><li>Clementine SPSS </li></ul></ul></ul><ul><ul><ul><li>PolyAnalyst Megaputer </li></ul></ul></ul><ul><ul><ul><li>Specialty products </li></ul></ul></ul>
  15. 15. Fingerhut’s DM models <ul><li>Fingerhut used segmentation, decision tree, regression analysis, and neural modeling tools from SAS for regression analysis tools and SPSS for neural network tools. </li></ul><ul><li>The segmentation model combines order and basic demographic data with Fingerhut’s product offerings. </li></ul><ul><li>Neural network models used to identify in mailing patterns and order filling telephone call orders. </li></ul><ul><li>Goal: </li></ul><ul><ul><li>Create new mailings targeted at customers with the greatest potential payoff. </li></ul></ul><ul><ul><li>Create a catalog containing products that those who is interested in, such as furniture, telephones… </li></ul></ul>
  16. 16. How Data Mining Is Being Used <ul><li>U.S. Government </li></ul><ul><ul><li>track down Oklahoma City bombers, Unabomber, many others </li></ul></ul><ul><ul><li>Treasury department - international funds transfers, money laundering </li></ul></ul><ul><ul><li>Internal Revenue Service </li></ul></ul>
  17. 17. How Data Mining Is Used <ul><li>Firefly </li></ul><ul><ul><li>asks members to rate music and movies </li></ul></ul><ul><ul><li>subscribers clustered </li></ul></ul><ul><ul><li>clusters get custom-designed recommendations </li></ul></ul>
  18. 18. Warranty Claims Routing <ul><li>Diesel engine manufacturer </li></ul><ul><ul><li>stream of warranty claims </li></ul></ul><ul><ul><li>examine each by expert </li></ul></ul><ul><ul><ul><li>determine whether charges are reasonable & appropriate </li></ul></ul></ul><ul><ul><ul><li>think of expert system to automate claims processing </li></ul></ul></ul>
  19. 19. Data mining application area <ul><li>Identify potential employee turnover </li></ul>Churn Human Resource Management <ul><li>Aid telemarketers with easy data access </li></ul>Online information Telemarketing <ul><li>Identify likely customer turnover </li></ul>Churn Telecommunications <ul><li>Identify claims meriting investigation </li></ul>Fraud detection Insurance <ul><li>Identify effective market segments </li></ul><ul><li>Identify likely customer turnover </li></ul>Lift Churn, Fraud detection Credit card Management <ul><li>Identify customer value </li></ul><ul><li>develop programs to maximize revenue </li></ul>Customer relationship management Banking <ul><li>Position products effectively </li></ul><ul><li>Find more products for customers </li></ul>Affinity positioning Cross-selling Retailing Specifics Applications Application Area
  20. 20. Retailing <ul><li>Affinity positioning is based up the identification of products that the same customer is likely to want. </li></ul><ul><ul><li>Cold medicine  tissues </li></ul></ul><ul><li>Cross-selling: The knowledge of products that go together can be used by marketing the complementary product. </li></ul><ul><ul><li>Grocery stores do that through position product shelf location. </li></ul></ul><ul><li>Grocery stores generate mountains of cash register data. Current technology enables grocers to look at customers who have defected from a store, their purchase history, and characteristics of other potential defectors. </li></ul>
  21. 21. Cross-selling <ul><li>USAA </li></ul><ul><ul><li>insurance </li></ul></ul><ul><ul><li>doubled number of products held by average customer due to data mining </li></ul></ul><ul><ul><li>detailed records on customers </li></ul></ul><ul><ul><li>predict products they might need </li></ul></ul><ul><li>Fidelity Investments </li></ul><ul><ul><li>regression - what makes customer loyal </li></ul></ul>
  22. 22. Banking <ul><li>CRM involves the application of technology to monitor customer service, a function that is enhanced through data mining support. </li></ul><ul><li>DM applications in finance include predicting the prices of equities involving a dynamic environment with surprise information, some of which might be inaccurate … </li></ul><ul><li>Only 3% of the customers at Norwest bank provided 44% of their profits. </li></ul><ul><li>CRM products enable banks to define and identify customer and household relationships. </li></ul>
  23. 23. Retaining Good Customers <ul><li>Customer loss : </li></ul><ul><ul><li>Banks - Attrition </li></ul></ul><ul><ul><li>Cellular Phone Companies - Churn </li></ul></ul><ul><ul><ul><li>study who might leave, why </li></ul></ul></ul><ul><ul><ul><li>Southern California Gas </li></ul></ul></ul><ul><ul><ul><ul><li>customer usage, credit information </li></ul></ul></ul></ul><ul><ul><ul><ul><li>direct mail contact - most likely best billing plan </li></ul></ul></ul></ul><ul><ul><ul><ul><li>who is price sensitive </li></ul></ul></ul></ul><ul><li>Who should get incentives, whom to keep </li></ul>
  24. 24. Credit card management <ul><li>Bank credit card marketing promotions typically generate 1,000 responses to mailed solicitations – a response rate of about 1%. The rate is improved significantly through data mining analysis. </li></ul><ul><li>DM tools used by banks include credit scoring which is a quantified analysis of credit applicants with respect to predictions of on-time loan repayment. (Data covering deposits, savings, loans, credit card, insurance…). </li></ul><ul><li>These credit scores can be used to accept/reject recommendations, as well as to establish the size of a credit line. </li></ul><ul><li>ATM machines could be rigged up with electronic sales pitches for products that a particular customer is likely to be interested in. </li></ul>
  25. 25. Fairbank & Morris <ul><li>Credit card company’s most valuable asset: </li></ul><ul><ul><li>INFORMATION ABOUT CUSTOMERS </li></ul></ul><ul><li>Signet Banking Corporation </li></ul><ul><ul><li>obtained behavioral data from many sources </li></ul></ul><ul><ul><li>built predictive models </li></ul></ul><ul><ul><li>aggressively marketed balance transfer card </li></ul></ul><ul><li>First Union </li></ul><ul><ul><li>who will move soon - improve retention </li></ul></ul>
  26. 26. Telecommunications <ul><li>Retention of customers for telemarketing is very difficult. The phenomenon of a customer switching carriers is referred to as churn , a fundamental concept in telemarketing as well as in other fields. </li></ul><ul><li>A communications company considered the 1/3 of churn is due to poor call quality, and up to ½ is due to poor equipment. </li></ul><ul><li>A cellular fraud prevention monitors traffic to spot problems with faulty telephones. When a telephone begins to go bad, telemarketing personal are alerted to contact the customer and suggest bringing the equipment in for service. </li></ul><ul><li>Another way to reduce churn is to protect customers from subscription and cloning (duplication) fraud. Fraud prevention systems provide verification that is transparent to legitimate subscribers. </li></ul>
  27. 27. Human resource management <ul><li>Business intelligence is a way to truly understand markets, competitors, and processes. </li></ul><ul><li>Software technology such as data warehouses, data marts, online analytical processing (OLAP), and data mining can be used to improve firm’s profitability. </li></ul><ul><li>In HRM, the analysis can lead to the identification of individuals who are liable to leave the company unless additional compensation or benefits are provided. </li></ul><ul><li>HRM would identify the right people so that organizations could treat them well and retain them (reduce churn). </li></ul>
  28. 28. Methodology and Tools Analyzing data Given management goals and that management can translate knowledge into action
  29. 29. Basic Styles <ul><li>Top-Down: HYPOTHESIS TESTING </li></ul><ul><ul><li>SUPERVISED </li></ul></ul><ul><ul><li>have a theory, experiment to prove or disprove </li></ul></ul><ul><ul><li>SCIENCE </li></ul></ul><ul><li>Bottom-Up: KNOWLEDGE DISCOVERY </li></ul><ul><ul><li>UNSUPERVISED </li></ul></ul><ul><ul><li>start with data, see new patterns </li></ul></ul><ul><ul><li>CREATIVITY </li></ul></ul>
  30. 30. Hypothesis Testing <ul><li>Generate theory </li></ul><ul><li>Determine data needed </li></ul><ul><li>Get data </li></ul><ul><li>Prepare data </li></ul><ul><li>Build computer model </li></ul><ul><li>Evaluate model results </li></ul><ul><ul><li>confirm or reject hypotheses </li></ul></ul>
  31. 31. Generate Theory <ul><li>Systematically tie different input sources together (MENTAL MODEL) </li></ul><ul><ul><li>What causes sales volume? </li></ul></ul><ul><ul><ul><li>sales rep performance </li></ul></ul></ul><ul><ul><ul><li>economy, seasonality </li></ul></ul></ul><ul><ul><ul><li>product quality, price, promotion, location </li></ul></ul></ul>
  32. 32. Generate Theory <ul><li>Brainstorm: </li></ul><ul><ul><li>diverse representatives for broad coverage of perspectives (electronic) </li></ul></ul><ul><ul><li>keep under control (keep positive) </li></ul></ul><ul><ul><li>generate testable hypotheses </li></ul></ul>
  33. 33. Define Data Needed <ul><li>Determine data needed to test hypothesis </li></ul><ul><ul><li>Lucky - query existing database </li></ul></ul><ul><ul><li>More often - gather </li></ul></ul><ul><ul><ul><li>pull together from diverse databases, survey, buy </li></ul></ul></ul>
  34. 34. Locate Data <ul><li>Usually scattered or unavailable </li></ul><ul><li>Sources: warranty claims </li></ul><ul><ul><li>point-of-sale data (cash register records) </li></ul></ul><ul><ul><li>medical insurance claims </li></ul></ul><ul><ul><li>telephone call detail records </li></ul></ul><ul><ul><li>direct mail response records </li></ul></ul><ul><ul><li>demographic data, economic data </li></ul></ul><ul><li>PROFILE: counts, summary statistics, cross-tabs, cleanup </li></ul>
  35. 35. Prepare Data for Analysis <ul><li>Summarize: too much - no discriminant information </li></ul><ul><li> too little - swamped with useless detail </li></ul><ul><li>Process for computer: ASCII, Spreedsheet </li></ul><ul><li>Data encoding: how data are recorded can vary - </li></ul><ul><li>may have been collected with specific purpose </li></ul><ul><li>Textual data: avoid if possible (may need to code) </li></ul><ul><li>Missing values: missing salary - use mean? </li></ul>
  36. 36. Build and Evaluate Model <ul><li>Build Computer Model </li></ul><ul><ul><li>Choice the appropriate modeling tools and algorithms </li></ul></ul><ul><ul><li>Training and test data sets. </li></ul></ul><ul><li>Determine if hypotheses supported </li></ul><ul><ul><li>statistical practice </li></ul></ul><ul><ul><li>test rule-based systems for accuracy </li></ul></ul><ul><li>Requires both business and analytic knowledge </li></ul>
  37. 37. SUPERVISED <ul><li>Dorn, National Underwriter Oct 18, 2004, 34,39 </li></ul><ul><li>Health care fraud </li></ul><ul><ul><li>Use statistics to identify indicators of fraud or abuse </li></ul></ul><ul><ul><li>Can rapidly sort through large databases </li></ul></ul><ul><ul><ul><li>Identify patterns different from norm </li></ul></ul></ul><ul><ul><li>Moderately successful </li></ul></ul><ul><ul><ul><li>But only effective on schemes already detected </li></ul></ul></ul><ul><ul><ul><li>To benefit firm, need to identify fraud before paying claim </li></ul></ul></ul>
  38. 38. Knowledge Discovery <ul><li>Machine learning? </li></ul><ul><ul><li>Usually need intelligent analyst </li></ul></ul><ul><li>Directed : explain value of some variable </li></ul><ul><li>Undirected : no dependent variable selected </li></ul><ul><ul><li>identify patterns </li></ul></ul><ul><li>Use undirected to recognize relationships; use directed to explain once found </li></ul>
  39. 39. Directed <ul><li>Goal-oriented </li></ul><ul><li>Examples: If discount applies, impact on products - </li></ul><ul><li>who is likely to purchase credit insurance? </li></ul><ul><li>Predicted profitability of new customer - what to bundle with a particular package </li></ul><ul><li>Identify sources of preclassified data </li></ul><ul><li>Prepare data for analysis </li></ul><ul><li>Built & train computer model </li></ul><ul><li>Evaluate </li></ul>
  40. 40. Identify Data Sources <ul><li>Best - existing corporate data warehouse </li></ul><ul><ul><li>data clean, verified, consistent, aggregated </li></ul></ul><ul><li>Usually need to generate </li></ul><ul><ul><li>most data in form most efficient for designed purpose </li></ul></ul><ul><ul><li>historical sales data often purged for dormant customers (but you need that information) </li></ul></ul>
  41. 41. Prepare Data <ul><li>Put in needed format for computer </li></ul><ul><li>Make consistent in meaning </li></ul><ul><li>Need to recognize what data are missing </li></ul><ul><ul><li>change in balance = new – old </li></ul></ul><ul><ul><li>add missing but known-to-be-important data </li></ul></ul><ul><li>Divide data into training, test, evaluation </li></ul><ul><li>Decide how to treat outliers </li></ul><ul><ul><li>statistically biasing, but may be most important </li></ul></ul>
  42. 42. Build & Train Model <ul><li>Regression - human builds (selects IVs) </li></ul><ul><li>Automatic systems train </li></ul><ul><ul><li>give it data, let it hammer </li></ul></ul><ul><li>OVERFITTING: </li></ul><ul><ul><li>fit the data </li></ul></ul><ul><ul><li>TEST SET a means to evaluate model against data not used in training </li></ul></ul><ul><ul><ul><li>tune weights before using to evaluate </li></ul></ul></ul>
  43. 43. Evaluate Model <ul><li>ERROR RATE : proportion of classifications in evaluation set that were wrong </li></ul><ul><li>too little training : poor fit on training data and poor error rate </li></ul><ul><li>optimal training : good fit on both </li></ul><ul><li>too much training : great fit on training data and poor error rate </li></ul>
  44. 44. Undirected Discovery <ul><li>What items sell together? Strawberries & cream </li></ul><ul><ul><li>Directed: What items sell with tofu? tabasco </li></ul></ul><ul><li>Long distance caller market segmentation </li></ul><ul><ul><li>Uniform usage - weekday & weekend, spikes on holidays </li></ul></ul><ul><ul><li>After segmentation: </li></ul></ul><ul><ul><ul><li>high & uniform except for several months of nothing </li></ul></ul></ul>
  45. 45. UNSUPERVISED <ul><li>Dorn, National Underwriter Oct 18, 2004, 34,39 </li></ul><ul><li>Health care fraud </li></ul><ul><ul><li>Look at historical claim submissions </li></ul></ul><ul><ul><ul><li>Build ad hoc model to compare with current claims </li></ul></ul></ul><ul><ul><li>Assign similarity score to fraudulent claims </li></ul></ul><ul><ul><li>Predict fraud potential </li></ul></ul>
  46. 46. Undirected Process <ul><li>Identify data sources </li></ul><ul><li>Prepare data </li></ul><ul><li>Build & train computer model </li></ul><ul><li>Evaluate model </li></ul><ul><li>Apply model to new data </li></ul><ul><li>Identify potential targets for undirected </li></ul><ul><li>Generate new hypotheses to test </li></ul>
  47. 47. Generate hypotheses <ul><li>Any commonalities in data? </li></ul><ul><li>Are they useful? </li></ul><ul><ul><li>Many adults watch children’s movies </li></ul></ul><ul><ul><ul><li>chaperones are an important market segment </li></ul></ul></ul><ul><ul><ul><li>they probably make final decision </li></ul></ul></ul><ul><li>When hypothesis is generated, that determines data needed </li></ul>
  48. 48. Bank Case Study <ul><li>Directed knowledge discovery to recognize likely prospects for home equity loan </li></ul><ul><ul><ul><li>training set - current loan holders </li></ul></ul></ul><ul><ul><ul><li>developed model for propensity to borrow </li></ul></ul></ul><ul><ul><ul><li>got continuous scores, ranked customers </li></ul></ul></ul><ul><ul><ul><li>sent top 11% material </li></ul></ul></ul><ul><li>Undirected: segmented market into clusters </li></ul><ul><ul><ul><li>in one, 39% had both business & personal accounts </li></ul></ul></ul><ul><ul><ul><li>cluster had 27% of the top 11% </li></ul></ul></ul><ul><li>Hypothesis: people use home equity to start business </li></ul>
  49. 49. Data mining products and data sets <ul><li>A good source to view current DM products is . </li></ul><ul><li>The UCI Machine Learning Repository is a source of very good data mining datasets at . </li></ul><ul><li>Weka DM software at </li></ul><ul><li>Tanagra DM software at </li></ul>