DATA MINING

315 views
261 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
315
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DATA MINING

  1. 1. Data Mining David L. Olson James & H.K. Stuart Professor in MIS University of Nebraska Lincoln
  2. 2. Definition <ul><li>DATA MINING : exploration & analysis </li></ul><ul><ul><li>by automatic means </li></ul></ul><ul><ul><li>of large quantities of data </li></ul></ul><ul><ul><li>to discover actionable patterns & rules </li></ul></ul><ul><li>Data mining a way to utilize massive quantities of data that businesses generate </li></ul>
  3. 3. Retail Outlets <ul><li>Bar coding & Scanning generate masses of data </li></ul><ul><ul><li>customer service </li></ul></ul><ul><ul><li>inventory control </li></ul></ul><ul><ul><li>MICROMARKETING </li></ul></ul><ul><ul><li>CUSTOMER PROFITABILITY ANALYSIS </li></ul></ul><ul><ul><li>MARKET BASKET ANALYSIS </li></ul></ul>
  4. 4. FINGERHUT <ul><li>Founded 1948 </li></ul><ul><ul><li>today sends out 130 different catalogs </li></ul></ul><ul><ul><li>to over 65 million customers </li></ul></ul><ul><ul><li>6 terabyte data warehouse </li></ul></ul><ul><ul><li>3000 variables of 12 million most active customers </li></ul></ul><ul><ul><li>over 300 predictive models </li></ul></ul><ul><li>Focused marketing </li></ul>
  5. 5. Fingerhut <ul><li>Purchased by Federated Department Stores for $1.7 billion in 1999 (for database) </li></ul><ul><li>Fingerhut had $1.6 to $2 billion business per year, targeted at lower-income households </li></ul><ul><li>Can mail 400,000 packages per day </li></ul><ul><li>Each product line has its own catalog </li></ul>
  6. 6. Fingerhut <ul><li>Uses segmentation , decision tree , regression , neural network tools from SAS and SPSS </li></ul><ul><li>Segmentation - combines order & demographic data with product offerings </li></ul><ul><ul><li>can target mailings to greatest payoff </li></ul></ul><ul><ul><ul><li>customers who recently had moved tripled their purchasing 12 weeks after the move </li></ul></ul></ul><ul><ul><ul><li>send furniture, telephone, decoration catalogs </li></ul></ul></ul>
  7. 7. Data for SEGMENTATION <ul><li>cluster indices </li></ul><ul><li>subj age income marital grocery dine out savings </li></ul><ul><li>1001 53 80000 wife 180 90 30000 </li></ul><ul><li>1002 48 120000 husband 120 110 20000 </li></ul><ul><li>1003 32 90000 single 30 160 5000 </li></ul><ul><li>1004 26 40000 wife 80 40 0 </li></ul><ul><li>1005 51 90000 wife 110 90 20000 </li></ul><ul><li>1006 59 150000 wife 160 120 30000 </li></ul><ul><li>1007 43 120000 husband 140 110 10000 </li></ul><ul><li>1008 38 160000 wife 80 130 15000 </li></ul><ul><li>1009 35 70000 single 40 170 5000 </li></ul><ul><li>1010 27 50000 wife 130 80 0 </li></ul>
  8. 8. Initial Look at Data <ul><li>Want to know features of those who spend a lot dining out </li></ul><ul><li>INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLE </li></ul><ul><ul><li>things you can identify </li></ul></ul><ul><li>Manipulate data </li></ul><ul><ul><li>sort on most likely indicator ( dine out ) </li></ul></ul>
  9. 9. Sorted by Dine Out <ul><li>cluster indices </li></ul><ul><li>subject age income marital grocery dine out savings </li></ul><ul><li>1004 26 40000 wife 80 40 0 </li></ul><ul><li>1010 27 50000 wife 130 80 0 </li></ul><ul><li>1001 53 80000 wife 180 90 30000 </li></ul><ul><li>1005 51 90000 wife 110 90 20000 </li></ul><ul><li>1002 48 120000 husband 120 110 20000 </li></ul><ul><li>1007 43 120000 husband 140 110 10000 </li></ul><ul><li>1006 59 150000 wife 160 120 30000 </li></ul><ul><li>1008 38 160000 wife 80 130 15000 </li></ul><ul><li>1003 32 90000 single 30 160 5000 </li></ul><ul><li>1009 35 70000 single 40 170 5000 </li></ul>
  10. 10. Analysis <ul><li>Best indicators </li></ul><ul><ul><li>marital status </li></ul></ul><ul><ul><li>groceries </li></ul></ul><ul><li>Available </li></ul><ul><ul><li>marital status might be easier to get </li></ul></ul>
  11. 11. Fingerhut <ul><li>Mailstream optimization </li></ul><ul><ul><li>which customers most likely to respond to existing catalog mailings </li></ul></ul><ul><ul><li>save near $3 million per year </li></ul></ul><ul><ul><li>reversed trend of catalog sales industry in 1998 </li></ul></ul><ul><ul><li>reduced mailings by 20% while increasing net earnings to over $37 million </li></ul></ul>
  12. 12. Banking <ul><li>Among first users of data mining </li></ul><ul><li>Used to find out what motivates their customers (reduce churn ) </li></ul><ul><li>Loan applications </li></ul><ul><li>Target marketing </li></ul><ul><li>Norwest: 3% of customers provided 44% profits </li></ul><ul><li>Bank of America: program cultivating top 10% of customers </li></ul>
  13. 13. CREDIT SCORING <ul><li>Bank Loan Applications </li></ul><ul><li>Age Income Assets Debts Want On-time </li></ul><ul><li>24 55557 27040 48191 1500 1 </li></ul><ul><li>20 17152 11090 20455 400 1 </li></ul><ul><li>20 85104 0 14361 4500 1 </li></ul><ul><li>33 40921 91111 90076 2900 1 </li></ul><ul><li>30 76183 101162 114601 1000 1 </li></ul><ul><li>55 80149 511937 21923 1000 1 </li></ul><ul><li>28 26169 47355 49341 3100 0 </li></ul><ul><li>20 34843 0 21031 2100 1 </li></ul><ul><li>20 52623 0 23054 15900 0 </li></ul><ul><li>39 59006 195759 161750 600 1 </li></ul>
  14. 14. Characteristics of Not On-time <ul><li>Age Income Assets Debts Want On-time </li></ul><ul><li>28 26169 47355 49341 3100 0 </li></ul><ul><li>20 52623 0 23054 15900 0 </li></ul><ul><li>Here, Debts exceed Assets </li></ul><ul><li>Age Young </li></ul><ul><li>Income Low </li></ul><ul><li>BETTER: Base on statistics, large sample </li></ul><ul><li>supplement data with other relevant variables </li></ul>
  15. 15. CHURN <ul><li>Customer turnover </li></ul><ul><li>critical to: </li></ul><ul><ul><li>telecommunications </li></ul></ul><ul><ul><li>banks </li></ul></ul><ul><ul><li>human resource management </li></ul></ul><ul><ul><li>retailers </li></ul></ul>
  16. 16. Identify characteristics of those who leave <ul><li>Age Time-job Time-town min bal checking savings card loan </li></ul><ul><li>years months months $ </li></ul><ul><li>27 12 12 549 x x </li></ul><ul><li>41 18 41 3259 x x x </li></ul><ul><li>28 9 15 286 x x </li></ul><ul><li>55 301 5 2854 x x x </li></ul><ul><li>43 18 18 1112 x x x </li></ul><ul><li>29 6 3 0 x </li></ul><ul><li>38 55 20 321 x x x </li></ul><ul><li>63 185 3 2175 x x x </li></ul><ul><li>26 15 15 386 x x </li></ul><ul><li>46 13 12 1187 x x x </li></ul><ul><li>37 32 25 1865 x x x </li></ul>
  17. 17. Analysis <ul><li>What are the characteristics of those who leave? </li></ul><ul><ul><li>Correlation analysis </li></ul></ul><ul><li>Which customers do you want to keep? </li></ul><ul><ul><li>Customer value - net present value of customer to the firm </li></ul></ul>
  18. 18. Correlation <ul><li>Age Time Time min-bal check saving card loan </li></ul><ul><li>Job Town </li></ul><ul><li>Age 1.0 0.6 0.4 -0.4 0.0 0.4 0.2 0.3 </li></ul><ul><li>Job 1.0 0.9 -0.6 0.1 0.6 0.9 -0.2 </li></ul><ul><li>Town 1.0 -0.5 -0.1 0.3 0.5 0.4 </li></ul><ul><li>Min-Bal 1.0 -0.2 0.3 0.6 -0.1 </li></ul><ul><li>Check 1.0 0.5 0.2 0.2 </li></ul><ul><li>Saving 1.0 0.9 0.3 </li></ul><ul><li>Card 1.0 0.5 </li></ul><ul><li>Loan 1.0 </li></ul>
  19. 19. Mortgage Market <ul><li>Early 1990s - massive refinancing </li></ul><ul><li>need to keep customers happy to retain </li></ul><ul><li>contact current customers who have rates significantly higher than market </li></ul><ul><ul><li>a major change in practice </li></ul></ul><ul><ul><li>data mining & telemarketing increased Crestar Mortgage ’s retention rate from 8% to over 20% </li></ul></ul>
  20. 20. Banking <ul><li>Fleet Financial Group </li></ul><ul><ul><li>$30 million data warehouse </li></ul></ul><ul><ul><li>hired 60 database marketers, statistical/quantitative analysts & DSS specialists </li></ul></ul><ul><ul><li>expect to add $100 million in profit by 2001 </li></ul></ul>
  21. 21. Banking <ul><li>First Union </li></ul><ul><ul><li>concentrated on contact-point </li></ul></ul><ul><ul><li>previously had very focused product groups, little coordination </li></ul></ul><ul><ul><li>Developed offers for customers </li></ul></ul>
  22. 22. CREDIT SCORING <ul><li>Data warehouse including demand deposits, savings, loans, credit cards, insurance, annuities, retirement programs, securities underwriting, other </li></ul><ul><li>Statistical & mathematical models ( regression ) to predict repayment </li></ul>
  23. 23. CUSTOMER RELATIONSHIP MANAGEMENT ( CRM ) <ul><li>understanding value customer provides to firm </li></ul><ul><ul><li>Kathleen Khirallah - The Tower Group </li></ul></ul><ul><ul><ul><li>Banks will spend $9 billion on CRM by end of 1999 </li></ul></ul></ul><ul><ul><li>Deloitte </li></ul></ul><ul><ul><ul><li>only 31% of senior bank executives confident that their current distribution mix anticipated customer needs </li></ul></ul></ul>
  24. 24. Customer Value <ul><li>Middle aged (41-55), 3-9 years on job, 3-9 years in town, savings account </li></ul><ul><li>year annual purchases profit discounted net 1.3 rate </li></ul><ul><li>1 1000 200 153 153 </li></ul><ul><li>2 1000 200 118 272 </li></ul><ul><li>3 1000 200 91 363 </li></ul><ul><li>4 1000 200 70 433 </li></ul><ul><li>5 1000 200 53 487 </li></ul><ul><li>6 1000 200 41 528 </li></ul><ul><li>7 1000 200 31 560 </li></ul><ul><li>8 1000 200 24 584 </li></ul><ul><li>9 1000 200 18 603 </li></ul><ul><li>10 1000 200 14 618 </li></ul>
  25. 25. Younger Customer <ul><li>Young (21-29), 0-2 years on job, 0-2 years in town, no savings account </li></ul><ul><li>year annual purchases profit discounted net 1.3 </li></ul><ul><li>1 300 60 46 46 </li></ul><ul><li>2 360 72 43 89 </li></ul><ul><li>3 432 86 39 128 </li></ul><ul><li>4 518 104 36 164 </li></ul><ul><li>5 622 124 34 198 </li></ul><ul><li>6 746 149 31 229 </li></ul><ul><li>7 896 179 29 257 </li></ul><ul><li>8 1075 215 26 284 </li></ul><ul><li>9 1290 258 24 308 </li></ul><ul><li>10 1548 310 22 331 </li></ul>
  26. 26. Credit Card Management <ul><li>Very profitable industry </li></ul><ul><li>Card surfing - pay old balance with new card </li></ul><ul><li>promotions typically generate 1000 responses, about 1% </li></ul><ul><li>in early 1990s, almost all mass-marketing </li></ul><ul><li>data mining improves ( lift ) </li></ul>
  27. 27. LIFT <ul><li>LIFT = probability in class by sample divided by probability in class by population </li></ul><ul><ul><li>if population probability is 20% and </li></ul></ul><ul><ul><li>sample probability is 30%, </li></ul></ul><ul><ul><li>LIFT = 0.3/0.2 = 1.5 </li></ul></ul><ul><li>best lift not necessarily best </li></ul><ul><ul><li>need sufficient sample size </li></ul></ul><ul><ul><li>as confidence increases, longer list but lower lift </li></ul></ul>
  28. 28. Lift Example <ul><li>Product to be promoted </li></ul><ul><li>Sampled over 10 identifiable segments of potential buying population </li></ul><ul><ul><li>Profit $50 per item sold </li></ul></ul><ul><ul><li>Mailing cost $1 </li></ul></ul><ul><ul><li>Sorted by Estimated response rates </li></ul></ul>
  29. 29. Lift Data
  30. 30. Lift Chart
  31. 31. Profit Impact
  32. 32. INSURANCE <ul><li>Marketing, as retailing & banking </li></ul><ul><li>Special: </li></ul><ul><ul><li>Farmers Insurance Group - underwriting system generating $ millions in higher revenues, lower claims </li></ul></ul><ul><ul><ul><li>7 databases, 35 million records </li></ul></ul></ul><ul><ul><li>better understanding of market niches </li></ul></ul><ul><ul><ul><li>lower rates on sports cars, increasing business </li></ul></ul></ul>
  33. 33. Insurance Fraud <ul><li>Specialist criminals - multiple personas </li></ul><ul><li>InfoGlide specializes in fraud detection products </li></ul><ul><ul><li>similarity search engine </li></ul></ul><ul><ul><ul><li>link names, telephone numbers, streets, birthdays, variations </li></ul></ul></ul><ul><ul><ul><li>identify 7 times more fraud than exact-match systems </li></ul></ul></ul>
  34. 34. Insurance Fraud - Link Analysis <ul><li>claim </li></ul><ul><li>type amount physician attorney </li></ul><ul><li>back 50000 Welby McBeal </li></ul><ul><li>neck 80000 Frank Jones </li></ul><ul><li>arm 40000 Barnard Fraser </li></ul><ul><li>neck 80000 Frank Jones </li></ul><ul><li>leg 30000 Schmidt Mason </li></ul><ul><li>multiple 120000 Heinrich Feiffer </li></ul><ul><li>neck 80000 Frank Jones </li></ul><ul><li>back 60000 Schwartz Nixon </li></ul><ul><li>arm 30000 Templer White </li></ul><ul><li>internal 180000 Weiss Richards </li></ul>
  35. 35. Insurance Fraud <ul><li>Analytics’ NetMap for Claims </li></ul><ul><ul><li>uses industry-wide database </li></ul></ul><ul><ul><li>creates data mart of internal, external data </li></ul></ul><ul><ul><li>unusual activity for specific chiropractors, attorneys </li></ul></ul><ul><li>HNC Insurance Solutions </li></ul><ul><ul><li>workers compensation fraud </li></ul></ul><ul><li>VeriComp - predictive software ( neural nets ) </li></ul><ul><ul><li>saved Utah over $2 million </li></ul></ul>
  36. 36. TELECOMMUNICATIONS <ul><li>Deregulation - widespread competition </li></ul><ul><ul><li>churn </li></ul></ul><ul><ul><ul><li>1/3rd poor call quality, 1/2 poor equipment </li></ul></ul></ul><ul><ul><li>wireless performance monitor tracking </li></ul></ul><ul><ul><ul><li>reduced churn about 61%, $580,000/year </li></ul></ul></ul><ul><ul><li>cellular fraud prevention </li></ul></ul><ul><ul><li>spot problems when cell phones begin to go bad </li></ul></ul>
  37. 37. Telecommunications <ul><li>Metapath’s Communications Enterprise Operating System </li></ul><ul><ul><li>help identify telephone customer problems </li></ul></ul><ul><ul><ul><li>dropped calls, mobility patterns, demographics </li></ul></ul></ul><ul><ul><ul><li>to target specific customers </li></ul></ul></ul><ul><ul><li>reduce subscription fraud </li></ul></ul><ul><ul><ul><li>$1.1 billion </li></ul></ul></ul><ul><ul><li>reduce cloning fraud </li></ul></ul><ul><ul><ul><li>cost $650 million in 1996 </li></ul></ul></ul>
  38. 38. Telecommunications <ul><li>Churn Prophet , ChurnAlert </li></ul><ul><ul><li>data mining to predict subscribers who cancel </li></ul></ul><ul><li>Arbor/Mobile </li></ul><ul><ul><li>set of products, including churn analysis </li></ul></ul>
  39. 39. TELEMARKETING <ul><li>MCI uses data marts to extract data on prospective customers </li></ul><ul><ul><li>typically a 2 month program </li></ul></ul><ul><ul><li>20% improvement in sales leads </li></ul></ul><ul><ul><li>multimillion investment in data marts & hardware </li></ul></ul><ul><ul><li>staff of 45 </li></ul></ul><ul><ul><li>trend spotting (which approaches specific customers like) </li></ul></ul>
  40. 40. Telemarketing <ul><li>Australian Tourist Commission </li></ul><ul><ul><li>maintained database since 1992 </li></ul></ul><ul><ul><ul><li>responses to travel inquiries on tours, hotels, airlines, travel agents, consumers </li></ul></ul></ul><ul><ul><ul><li>data mine to identify travel agents & consumers responding to various media </li></ul></ul></ul><ul><ul><ul><li>sales closure rate at 10% and up </li></ul></ul></ul><ul><ul><ul><li>lead lists faxed weekly to productive travel agents </li></ul></ul></ul>
  41. 41. Telemarketing <ul><li>Segmentation </li></ul><ul><ul><li>which customers respond to new promotions, to discounts, to new product offers </li></ul></ul><ul><ul><li>Determine who </li></ul></ul><ul><ul><ul><li>to offer new service to </li></ul></ul></ul><ul><ul><ul><li>those most likely to commit fraud </li></ul></ul></ul>
  42. 42. Human Resource Management <ul><li>Identify individuals liable to leave company without additional compensation or benefits </li></ul><ul><li>Firm may already know 20% use 80% of offered services </li></ul><ul><ul><li>don’t know which 20% </li></ul></ul><ul><ul><li>data mining (business intelligence) can identify </li></ul></ul><ul><li>Use most talented people in highest priority(or most profitable) business units </li></ul>
  43. 43. Human Resource Management <ul><li>Downsizing </li></ul><ul><ul><li>identify right people, treat them well </li></ul></ul><ul><ul><li>track key performance indicators </li></ul></ul><ul><ul><li>data on talents, company needs, competitor requirements </li></ul></ul><ul><li>State of Mississippi ’s MERLIN network </li></ul><ul><ul><li>30 databases (finance, payroll, personnel, capital projects) </li></ul></ul><ul><ul><li>Cognos Impromptu system - 230 users </li></ul></ul>
  44. 44. CASINOS <ul><li>Casino gaming one of richest data sets known </li></ul><ul><li>Harrah ’s - incentive programs </li></ul><ul><ul><li>about 8 million customers hold Total Gold cards, used whenever the customer spends money in the casino </li></ul></ul><ul><ul><li>comprehensive data collection </li></ul></ul><ul><li>Trump’s Taj Card similar </li></ul>
  45. 45. Casinos <ul><li>Bellagio & Mandelay Bay </li></ul><ul><ul><li>strategy of luxury visits </li></ul></ul><ul><ul><li>child entertainment </li></ul></ul><ul><ul><li>change from old strategy - cheap food </li></ul></ul><ul><li>Identify high rollers - cultivate </li></ul><ul><ul><li>identify those to discourage from play </li></ul></ul><ul><ul><li>estimate lifetime value of players </li></ul></ul>
  46. 46. ARTS <ul><li>computerized box offices leads to high volumes of data </li></ul><ul><li>Identify potential consumers for shows </li></ul><ul><li>software to manage shows </li></ul><ul><ul><li>similar to airline seating chart software </li></ul></ul>
  47. 47. Research Projects <ul><li>Techniques </li></ul><ul><ul><li>Statistics (difference between data mining, conventional statistics) </li></ul></ul><ul><li>Data Management </li></ul><ul><ul><li>How to beat data into usable form </li></ul></ul><ul><li>Visualization </li></ul><ul><ul><li>Manny Parzen </li></ul></ul><ul><li>Applications </li></ul>
  48. 48. Class Projects <ul><li>Application </li></ul><ul><ul><li>Gallup: rehabilitation of drug-using women </li></ul></ul><ul><ul><li>Relationship between strengths-based counseling, success </li></ul></ul><ul><ul><li>Finding: good counselor relationship key </li></ul></ul><ul><li>Data: limited (but Gallup tends to agglomerate thousands over time) </li></ul><ul><li>Technique </li></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><ul><li>On ordinal data </li></ul></ul></ul>
  49. 49. Sleep Disorder Prediction <ul><li>OSHA data </li></ul><ul><ul><li>11 Nebraska plants </li></ul></ul><ul><ul><li>Demographic data </li></ul></ul><ul><ul><li>Epworth sleepiness scale </li></ul></ul><ul><ul><li>21 sleep disorder variables </li></ul></ul><ul><li>Applied Clementine models </li></ul><ul><ul><li>Trained on 1500 </li></ul></ul><ul><ul><li>Tested 214 </li></ul></ul><ul><li>If 4 of 6 models predicted problem, assigned </li></ul><ul><ul><li>Increased prediction accuracy </li></ul></ul>
  50. 50. Test Bank Analysis <ul><li>Effort to develop on-line test </li></ul><ul><ul><li>Math department </li></ul></ul><ul><ul><li>Service course – freshmen, entire University </li></ul></ul><ul><ul><ul><li>Thousands of cases </li></ul></ul></ul><ul><li>Data manipulation problem </li></ul><ul><ul><li>Some link analysis </li></ul></ul><ul><ul><ul><li>Early prediction of performance </li></ul></ul></ul><ul><li>Identified which questions predicted results </li></ul><ul><ul><li>Used to take corrective action early </li></ul></ul>
  51. 51. Survey Mode Effects <ul><li>Gallup </li></ul><ul><li>Surveys via telephone or internet </li></ul><ul><ul><li>Effect of Interviewer </li></ul></ul><ul><li>DATA </li></ul><ul><ul><li>2,979 Internet, 900 telephone </li></ul></ul><ul><li>DATA MINING </li></ul><ul><ul><li>Decision trees & neural networks </li></ul></ul><ul><ul><li>Provided valuable information when traditional models limited by missing data </li></ul></ul>
  52. 52. IT R&D & Economy <ul><li>Proposal: more IT R&D, better economy </li></ul><ul><ul><li>Apparently not reverse </li></ul></ul><ul><li>Data: 30 thousand cases, COMPUSTAT </li></ul><ul><ul><li>28 quarter differences </li></ul></ul><ul><li>Technique: </li></ul><ul><ul><li>Decision Tree, SQL Server </li></ul></ul><ul><li>Results </li></ul><ul><ul><li>Weak </li></ul></ul><ul><ul><li>Look at more variables </li></ul></ul>
  53. 53. IT Effect on Firm Size <ul><li>IT reduces transaction costs, reducing firm size </li></ul><ul><li>IT reduces coordination costs, increasing firm size </li></ul><ul><li>DATA: </li></ul><ul><ul><li>Private fixed investment on IT </li></ul></ul><ul><ul><li>Firm size </li></ul></ul><ul><ul><li>Compustat </li></ul></ul><ul><li>DATA MINING: </li></ul><ul><ul><li>Rules </li></ul></ul><ul><li>Source of hypotheses </li></ul>
  54. 54. Online Auction Fraud Prediction <ul><li>eBay </li></ul><ul><ul><li>Over 16 million items per day </li></ul></ul><ul><ul><li>Fraud: 3,700 in 1999, 6,600 in 2000 </li></ul></ul><ul><li>Purpose: </li></ul><ul><ul><li>Predict seller fraud profile, which products </li></ul></ul><ul><li>User Trust </li></ul><ul><li>DATA: golf clubs, humidifiers </li></ul><ul><li>Initial results inconclusive – more work </li></ul>

×