Your SlideShare is downloading. ×
Data Mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Data Mining

1,696
views

Published on


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,696
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
113
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Mining Industrial Projects and Case Studies Kwok-Leung Tsui Industrial and Systems Engineering Georgia Institute of Technology
  • 2. Industrial Projects 1. AT&T business data mining 2. Inventory management in military maintenance 3. Sea cargo demand forecasting 4. SMATRAQ project in transportation policies 5. Location problem of letterbox 6. Home improvement store shrinkage analysis 7. Hotels & resorts chain data mining 8. Used car auction sales data mining 9. Fast food restaurant call center
  • 3. Data Mining in Telecom. (Funded AT&T project) ~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local) 100 million + customers/accounts/lines >1 billion phone calls per day Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue) Segmentation (Clustering of usage, growth, …) Cross Selling (Association Rule) Churn (Disconnect prediction & Tracking) Fraud (Detection of unusual usage time series behavior) Each of these problems worth hundreds millions dollars
  • 4. Inventory Management in Air Force (Funded project) A contractor manages parts inventory for aircraft maintenance Characterization and forecasting of demand and lead time distributions 60,000 different parts and 500 bench locations Data tracked by an automated system Demand data not available & stockout penalty
  • 5. Data Mining in Sea Cargo Application (Funded TLIAP project) Sea cargo network optimization Contract planning & booking control Characterize & forecast sea cargo demand distribution & cost structure Improve ocean carrier and terminal operation efficiency
  • 6. SMARTRAQ Project for Transportation Policies Strategies for Metropolitan Atlanta’s Regional Transportation & Air Quality Five-year project sponsored by Transportation Dept., Federal Highway Admin., EPA, CDC, etc. Assess air quality, travel behavior, land use & transportation policies Reduce auto-dependence and vehicle emissions
  • 7. Mining of Letter Box Transaction Data Improve performance of express mail dropoff letter boxes 50,000 letter boxes & 8 month transaction data Relate performance with important factors, e.g. regions, demographic, adjacent competition, pick-up schedule Comparison with direct competitors Customer demand analysis and forecast
  • 8. Data Mining for Shrinkage Analysis in Retail Industry Inventory shrinkage costs US retailers 32 billions Shrinkage = book inventory – inventory on hand Working with a home improvement store’s Loss Prevention Group Develop predictive model to relate shrinkage to important variables Extract hidden knowledge to reduce loss and improve operation efficiency
  • 9. Data Mining for Hotels and Resorts Chain Business Manage chain hotels and resorts in different scale Evaluate impact of promotional programs Forecasting of customer behavior in frequent stay program Monitor performance in customer survey Predict performance with important factors
  • 10. Data Mining of Used Car Auction Data Maintain all used car auction data in last 20 years Provide service to customers and dealers on auction price projection Price depreciations on year, Develop methods for mileage, seasonal, and regional adjustments
  • 11. Fast Food Restaurant Call Center Centralized call center for drive through customers of over 50 chain restaurants Contractor manages call center with constraints on time to answer customers Scheduling and management of human resources Simulation and optimization algorithms Data mining and forecasting on aggregate and individual demand
  • 12. Data Mining Case Studies 1. A Medical Case Study 2. Profile Monitoring in Telecommunication 3. Letterbox Transaction Data Mining 4. A Market Analysis Case Study 5. Air Force Parts Inventory Data Mining
  • 13. More DM Case Studies (Berry & Linoff) 1. Telecommunication Data Mining 2. Churn Modeling in Wireless Industry 3. Market Basket Analysis 4. Supermarket Mining I 5. Supermarket Mining II 6. Banking and Finance
  • 14. A Medical Case Study using MTS and DM Methods A Review & Analysis of MTS (Technometrics, 2003) W. H. Woodall and R. Koudelik, Virginia Tech K.-L. Tsui and S. B. Kim, Georgia Tech Z. G. Stoumbos, Rutgers University Christos P. Carvounis, MD State University at Stony Brook
  • 15. Primary MTS References Taguchi, G., and Rajesh, J. (2000), “New Trends in Multivariate Diagnosis,” Sankhya: The Indian Journal of Statistics, 62, 233-248. Taguchi, G., Chowdhury, S., and Wu, Y. (2001), The Mahalanobis-Taguchi System, New York: McGraw Hill. Taguchi, G., and Rajesh, J. (2002), a new book in MTS.
  • 16. P.C. Mahalanobis Very influential in large-scale sample survey methods Founder of the Indian Statistical Institute in 1931 Architect of India’s industrial strategy Advisor to Nehru and friend of R.A. Fisher
  • 17. Genichi Taguchi Japanese Quality Engineer Deming prize in Japan: 4 times Rockwell Medal (1986) Citation: Combine engineering & statistical methods to achieve rapid improvements in costs and quality by optimizing product design and manufacturing processes. 1978-79: Ford / Bell Labs Teams "Discover" Method 1980: First US Experiences (Xerox / Bell Labs) 1990 - : Taguchi Methods or DOE well recognized by all industries for improving product or manufacturing process design.
  • 18. MTS is said to be ……… A groundbreaking new philosophy for data mining from multivariate data. A process of recognizing patterns and forecasting results Used by Fiju, Nissan, Sharp, Xerox, Delphi Automotive Systems, Ford, GE and others Beyond theory Intended to create an atmosphere of excitement for management, engineering and academia.
  • 19. Applications include the following: Patient monitoring Medical diagnosis Weather and earthquake forecasting Fire detection Manufacturing inspection Clinical trials Credit scoring
  • 20. MTS Overview Similar to a classification method using a discriminant-type function. Based on multivariate observations from a “normal” and an “abnormal” group. Used to develop a scale to measure how abnormal an item is while matching a pre- specified or estimated scale. MTS scale is used for variable selection, diagnosis, forecasting, and classification.
  • 21. MTS Procedure: Stage 1 Identify p variables, Vi , i = 1, 2, …, p that measure the “normality” of an item. Collect multivariate data on the normal group, Xj , j = 1, 2, …, m. Standardize each variable to obtain Zi vectors. Calculate the Mahalanobis distances (MD) for the m observations.
  • 22. MDi = (1 p) Z S Zi T −1 i i=1, …, m where S is the sample correlation matrix of the Z’s for the normal group.
  • 23. Stage 2 Collect data on t abnormal items, Xi, i = m + 1, m + 2, …, m + t. Standardize each variable using the normal group means and standard deviations. Calculate MD values MDi , i = m + 1, m + 2, …, m + t.
  • 24. According to the MTS, the scale is good if the MD values for the abnormal items are higher than those for the normal items (good separation).
  • 25. Stage 3 Identify the useful variables using orthogonal arrays (OAs) and signal to noise (S/N) ratios. The MTS uses a design of experiments approach as an optimization tool to choose the variables that maximize the average S/N ratio.
  • 26. Use of DOE for Variable Selection Design an OA experiment using all variables. For each row of the OA (a given set of variables) Compute MDi for each observation in abnormal groups; Determine a Mi value (the true severity level or working average) for each abnormal group; Compute S/N ratio based on MDi and Mi. Determine significant variables using main effect analysis with S/N ratio as response.
  • 27. An Example of OA + including variable; - excluding variable Run V1 V2 V3 ...... ... V17 S/N Ratio 1 + + + . . . . . .. . . + SN1 2 - + + + SN2 3 + - + + SN3 4 - - + + SN4 . . .. . . . . . 5 + + - + SN5 6 - + - + SN6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 - - - . . . . . .. . . - SN32
  • 28. Dynamic S/N ratio (multiple abnormal groups) First regress Yi = SQRT(MDi) to Mi to obtain slope estimate (beta hat), then define S/N ratio: ⎡ 1 SSR − MSE ⎤ ⎡ β2 ⎤ ˆ 10 log ⎢ ⎥ = 10 log ⎢ MSE ⎥ ⎣r MSE ⎦ ⎢ ⎣ ⎥ ⎦
  • 29. Larger-is-better S/N Ratio (single abnormal group) For t abnormal observations, the larger-is-better S/N ratio is ⎡1 t 1 ⎤ − 10 log ⎢ ∑ ⎥ ⎣ t i =1 MDi ⎦
  • 30. Main Effect Analysis Compute level averages of S/N ratios (+ and - ) for each variable. Keep variables only with positive (significant) estimated main effects. + − S Ni − S Ni
  • 31. Stage 4 Based on the chosen variables, use the MD scale for diagnosis and forecasting. A threshold is given such that the losses due to the two types of classification errors are balanced in some sense.
  • 32. A Medical Case Study Medical diagnosis of liver disease. 200 healthy patients and 17 unhealthy patients (10 with a mild level of disease and 7 with a medium case). Age, Gender and 15 blood test variables (Data is made available.)
  • 33. Case Study Blood Test Variables with Normal Ranges Taguchi et al. Variables Symbol Acronym Normal Ranges (2001) Normal Ranges Total Protein in Blood V3 6.0 to 8.3 gm/dL 6.5-7.5 gm/dL TP Albumin in Blood V4 Alb 3.4 to 5.4 gm/dL 3.5-4.5 gm/dL Cholinesterase Depends on Technique V5 ChE (Pseudocholinesterase) 8 to 18 U/mL 0.60-1.00 dpH Glutamate O Transaminase V6 GOT 10 to 34 IU/L 2-25 Units (Asparate Aminotransferase) Glutamate P Transaminase V7 GPT 6 to 59 U/L 0-22 Units (Alanine Transaminase) Lactic Dehydrogenase V8 LDH 105 to 333 IU/L 130-250 Units Alkaline Phosphatase 0-250 U/L Normal V9 Alp 2.0-10.0 Units 250-750U/L Moderate Elevation r-Glutamyl Transpeptidase 0 to 51 IU/L V10 r-GPT Serum: 0-68 Units (gamma-Glutamate Transferase) Leucine Aminopeptidase V11 LAP Male: 80 to 200 U/mL ⎯ Female: 75 to 185 U/mL < 200 Desirable Total Cholesterol V12 TCh 200-239 Borderline high ⎯ 240+ High Triglyceride V13 TG 10 to 190 mg/dL ⎯ Phospholipid V14 PL . Platelet: 150,000 to 400,000/mm3 ⎯ Creatinine Cr 8 to 1.4 mg/dL ⎯ V15 Blood Urea Nitrogen V16 BUN 7 to 20 mg/dL ⎯ Uric Acid V17 UA 4.1 to 8.8 mg/dL ⎯
  • 34. Some results and conclusions Largest MD in healthy group 2.36 Lowest MD in unhealthy group 7.73 Thus, there is a lot of separation between the healthy and unhealthy group. The Mi values are estimated from averages of MD values.
  • 35. OA32 + including variable; - excluding variable Run V1 V2 V3 ...... ... V17 S/N Ratio 1 + + + . . . . . .. . . + SN1 2 - + + + SN2 3 + - + + SN3 4 - - + + SN4 . . .. . . . . . 5 + + - + SN5 6 - + - + SN6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 - - - . . . . . .. . . - SN32
  • 36. average S/N ratio All variables -6.25 MTS combination -4.27 OA optimal comb. -3.34 Overall optimal comb. -1.76 Thus, the proposed method does not yield the optimum combination. MTS average S/N ratio was at about the 95th percentile.
  • 37. MDs for Unhealthy Group for Various Combinations of Variables Subject Disease Level All MTS OA Optimal Optimal 1 Mild 7.727 13.937 8.058 13.329 2 Mild 8.416 14.726 7.485 8.616 3 Mild 10.291 17.342 9.498 8.002 4 Mild 7.204 10.804 4.951 12.311 5 Mild 10.590 18.379 9.367 12.042 6 Mild 10.557 8.605 6.643 6.139 7 Mild 13.317 13.896 7.794 6.139 8 Mild 14.812 27.910 8.162 22.666 9 Mild 15.693 28.110 10.278 26.000 10 Mild 18.911 35.740 20.992 14.422 11 Me dium 12.610 20.828 16.517 20.833 12 Me dium 12.256 18.578 14.607 19.312 13 Me dium 19.655 34.127 35.229 44.614 14 Me dium 43.039 85.564 13.105 32.720 15 Me dium 78.639 74.175 9.560 28.560 16 Me dium 97.268 104.424 29.201 31.810 17 Me dium 135.698 123.022 44.742 57.226
  • 38. Plots of MDs for Unhealthy Group for Various Combinations of Variables . :. .::. Mild +---------+---------+---------+---------+---------+-------All ... . . . . Medium +---------+---------+---------+---------+---------+-------All : ::. :. Mild +---------+---------+---------+---------+---------+-------MTS : . . . . . Medium +---------+---------+---------+---------+---------+-------MTS : .: :: . Mild +---------+---------+---------+---------+---------+-------OA Optimal . .: .. . Medium +---------+---------+---------+---------+---------+-------OA Optimal : ::: : Mild +---------+---------+---------+---------+---------+-------Optimal : :. . . Medium +---------+---------+---------+---------+---------+-------Optimal
  • 39. Case Study Blood Test Variables with Normal Ranges Taguchi et al. Variables Symbol Acronym Normal Ranges (2001) Normal Ranges Total Protein in Blood V3 6.0 to 8.3 gm/dL 6.5-7.5 gm/dL TP Albumin in Blood V4 Alb 3.4 to 5.4 gm/dL 3.5-4.5 gm/dL Cholinesterase Depends on Technique V5 ChE (Pseudocholinesterase) 8 to 18 U/mL 0.60-1.00 dpH Glutamate O Transaminase V6 GOT 10 to 34 IU/L 2-25 Units (Asparate Aminotransferase) Glutamate P Transaminase V7 GPT 6 to 59 U/L 0-22 Units (Alanine Transaminase) Lactic Dehydrogenase V8 LDH 105 to 333 IU/L 130-250 Units Alkaline Phosphatase 0-250 U/L Normal V9 Alp 2.0-10.0 Units 250-750U/L Moderate Elevation r-Glutamyl Transpeptidase 0 to 51 IU/L V10 r-GPT Serum: 0-68 Units (gamma-Glutamate Transferase) Leucine Aminopeptidase V11 LAP Male: 80 to 200 U/mL ⎯ Female: 75 to 185 U/mL < 200 Desirable Total Cholesterol V12 TCh 200-239 Borderline high ⎯ 240+ High Triglyceride V13 TG 10 to 190 mg/dL ⎯ Phospholipid V14 PL . Platelet: 150,000 to 400,000/mm3 ⎯ Creatinine Cr 8 to 1.4 mg/dL ⎯ V15 Blood Urea Nitrogen V16 BUN 7 to 20 mg/dL ⎯ Uric Acid V17 UA 4.1 to 8.8 mg/dL ⎯
  • 40. Variables for Unhealthy Patients Well Outside Normal Ranges Subject Number Variable Number 1 12, 13 2 None 3 None 4 13 5 10 6 7 7 7 8 13 9 12, 13 10 4, 12 11 10, 12 12 10 13 10 14 10, 13 15 6, 7, 13 16 3, 6, 7, 10, 12 17 6, 7, 8, 10, 13
  • 41. Medical Analysis V4, V6, V7, V9, and V10 are crucial for liver disease diagnosis and classification. Medical diagnosis shows that patients 15- 17 exhibit some chronic liver disorder. Cluster analysis on V4, V6, V7, V9, and V10 yields only two groups. Only patients 15- 17 are classified as “abnormal”. This result is consistent with medical diagnosis
  • 42. Dotplot for V4 Alb 16 17 15 Medium Mild Normal 3.8 4.8 5.8
  • 43. Dotplot for V6 GOT 17 15 16 Medium Mild Normal 50 100 150
  • 44. Dotplot for V7 GPT 17 16 15 Medium Mild Normal 20 70 120 170
  • 45. Dotplot for V9 Alp 16 15 17 Medium Mild Normal 100 200 300
  • 46. Dotplot for V10 r-GPT 15 16 17 Medium Mild Normal 0 100 200
  • 47. Tree Classification Methods
  • 48. Classification Trees • The CART (Classification And Regression Tree) methodology known as binary recursive partitioning. For more detailed information on CART, please see: Breiman, Friedman, Olshen, & Stone (1984): Classification and Regression Trees • C4.5 is a decision tree learning system introduced by Quinlan (Quinlan, J. Ross (1993): C4.5: Programs for Machine Learning). The software is available at: http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtr ees/c4.5/tutorial.html
  • 49. Tree from Splus V5 < 381.5 Yes No V10 < 63 V6 < 37.5 Yes No Yes No 2(8) 2(2) 1(196) 1(4) 3(6) 3(1)
  • 50. Tree from Splus Variables actually used in tree construction: V5, V10, and V6. Number of terminal nodes: 4 Misclassification error rate: 0.01382 = 3 / 217 Classification matrix based on learning sample Predicted Class Actual Class 1 2 3 1 200 0 0 2 0 8 2 3 1 0 6
  • 51. Tree from C4.5 V5 <= 364 Yes No V10 <= 63 1(200) Yes No 3(1) V6 <= 26 2(8) Yes No 2(2) 3(6)
  • 52. Tree from C4.5 Variables actually used in tree construction: V5, V10, and V6. Number of terminal nodes: 4 Misclassification error rate: 0.0046 = 1 / 217 Classification matrix based on learning sample Predicted Class Actual Class 1 2 3 1 200 0 0 2 0 10 0 3 1 0 6
  • 53. Scatter Plot of V5 vs. V10 vs. V6 Normal Mild Medium 700 600 500 400 15 V5 ChE 300 16 200 17 100 150 0 100 0 50 V6 GOT 50 100 150 0 V10 r-GPT 200 250
  • 54. Scatter Plot of V5 vs. V6 150 16 15 Normal Mild 17 Medium 100 V6 GOT 50 0 0 100 200 300 400 500 600 700 V5 ChE
  • 55. Scatter Plot of V5 vs. V10 250 Normal 17 Mild 200 Medium V10 r-GPT 150 16 100 15 50 0 0 100 200 300 400 500 600 700 V5 ChE
  • 56. Scatter Plot of V10 vs. V6 150 Normal 16 17 15 Mild Medium 100 V6 GOT 50 0 0 50 100 150 200 250 V10 r-GPT
  • 57. Dotplot for V5 ChE 16 15 Medium 17 Mild Normal 100 200 300 400 500 600 700
  • 58. Dotplot for V6 GOT 17 15 16 Medium Mild Normal 50 100 150
  • 59. Dotplot for V10 r-GPT 15 16 17 Medium Mild Normal 0 100 200
  • 60. Comparison with Taguchi Approaches All variables: V1 – V17 MTS: V4, V5, V10, V12, V13, V14, V15, V17 OA Optimal: V1, V4, V5, V10, V11, V14, V15, V16, V17 Optimal: V3, V5, V10, V11, V12, V13, V17 Classification Trees : V5, V6, V10
  • 61. MDs for Unhealthy Group for Various Combinations of Variables Disease Level All MTS OA Optimal Optimal Trees Mild 7.727 13.937 8.058 13.329 7.366 Mild 8.416 14.726 7.485 8.616 18.789 Mild 10.291 17.342 9.498 8.002 9.068 Mild 7.204 10.804 4.951 12.311 6.517 Mild 10.590 18.379 9.367 12.042 29.864 Mild 10.557 8.605 6.643 6.139 10.869 Mild 13.317 13.896 7.794 6.139 10.869 Mild 14.812 27.910 8.162 22.666 8.222 Mild 15.693 28.110 10.278 26.000 9.155 Mild 18.911 35.740 20.992 14.422 16.420 Medium 12.610 20.828 16.517 20.833 42.681 Medium 12.256 18.578 14.607 19.312 38.523 Medium 19.655 34.127 35.229 44.614 86.796 Medium 43.039 85.564 13.105 32.720 28.252 Medium 78.639 74.175 9.560 28.560 208.102 Medium 97.268 104.424 29.201 31.810 228.428 Medium 135.698 123.022 44.742 57.226 199.304
  • 62. . :. .::. Mild +---------+---------+---------+---------+---------+-------All ... . . . . Medium +---------+---------+---------+---------+---------+-------All : ::. :. Mild +---------+---------+---------+---------+---------+-------MTS : . . . . . Medium +---------+---------+---------+---------+---------+-------MTS : .: :: . Mild +---------+---------+---------+---------+---------+-------OA Optimal . .: .. . Medium +---------+---------+---------+---------+---------+-------OA Optimal : ::: : Mild +---------+---------+---------+---------+---------+-------Optimal : :. . . Medium +---------+---------+---------+---------+---------+-------Optimal . : ::.. . Mild +---------+---------+---------+---------+---------+-------Trees . .. . . . . Medium +---------+---------+---------+---------+---------+-------Trees 0 50 100 150 200 250
  • 63. Conclusion The MD values and dotplots show that only the MD scale based on the variables used by classification trees, i.e., V5, V6 and V10, does a good job discriminating between patients with mild level disease and patients with medium level disease. (Maybe MD is a good measure for multivariate data.)
  • 64. Comparison with Medical Analysis V4, V6, V7, V9, and V10 are crucial for liver disease diagnosis and classification. Medical diagnosis shows that patients 15- 17 exhibit some chronic liver disorder. Cluster analysis on V4, V6, V7, V9, and V10 yields only two groups. Only patients 15- 17 are classified as “abnormal”. This result is consistent with medical diagnosis
  • 65. Correlations Variables in Classification Trees V5 V6 V10 V4 0.501 -0.505 -0.184 Variables Crucial for Medial Diagnosis V6 -0.370 1 0.507 V7 -0.365 0.905 0.485 V9 -0.305 0.197 0.269 V10 -0.189 0.507 1
  • 66. Dotplot for V4 Alb 16 17 15 Medium Mild Normal 3.8 4.8 5.8
  • 67. Dotplot for V7 GPT 17 16 15 Medium Mild Normal 20 70 120 170
  • 68. Dotplot for V9 Alp 16 15 17 Medium Mild Normal 100 200 300
  • 69. Case Study Summary OA & main effect analysis do not give overall optimum. MTS discriminant function (S/N ratios) does not separate the two unhealthy groups. The variables selected from MTS are not appropriate to detect liver disease based on medical diagnosis. Tree methods separate the two unhealthy groups. MD may be a good distance measure for multivariate data. Results are based on current data and training error.
  • 70. Discussions The MTS ignores considerable previous work in application areas such as medical diagnosis and classification methods. The MTS ignores sampling variation and discounts variation between units. The use of OA cannot be justified. The MTS is not a well-defined approach. Traditional statistical approaches may work better in many cases. Despite flaws, we expect the MTS to be used, in many companies.
  • 71. Correlation (V6, V7) = 0.905 180 15 Normal 160 16 Mild Medium 140 120 V7 GPT 100 17 80 60 40 20 0 0 50 100 150 V6 GOT
  • 72. Correlation (V12, V14) = 0.807 Normal Mild 350 Medium V14 PL 17 250 15 16 150 100 200 300 V12 TCh
  • 73. Correlation (V10, V11) = 0.646 120 Normal 17 Mild 110 Medium 100 V11 LAP 90 80 16 70 15 60 50 40 0 50 100 150 200 250 V10 r-GPT
  • 74. Correlation (V13, V14) = 0.616 Normal Mild 350 Medium V14 PL 17 250 16 15 150 0 100 200 300 400 V13 TG
  • 75. Correlation (V3, V4) = 0.604 6 Normal Mild Medium 5 V4 Alb 15 4 17 16 5.5 6.5 7.5 8.5 V3 TP
  • 76. A Telecommunication Case Study A SPC Approach for Business Activity Monitoring (IIE Transcations, 2006) W. Jiang, Stevens Institute of Technology T. Au, AT&T K.-L. Tsui, Georgia Institute of Technology
  • 77. A General Framework for Modeling & Monitoring of Dynamic Systems
  • 78. Dynamic Monitoring (A General Framework) Profile –Time domain profile Problem – Profile w. controllable predictors – Profile w. uncontrollable predictors –Detection/Classification – Interpretation Objective –Forecasting/Prediction Segmentation – Known Segmentation – Unknown & Model Selection Model Selection – Global w/o segmentation – Global w. segmentation Monitoring – Local within Segment Dynamic – Phase I: estimating unknown parameter Update – Phase II: monitoring and detecting – Anticipated drifts Vs. unanticipated changes Actions
  • 79. Applications Manufacturing Processes Stamping Tonnage Signal Data (functional data) Nortel’s Antenna Signal Data (functional data) Mass Flow Controller (MFC) Calibration (linear profile) Vertical Density Profile (VDP) Data (nonlinear profile) Service Operations Used Car Price Mining and Prediction Telecom. Customer Usage Hotel Performance Monitoring Fast food drive through call center forecasting & scheduling
  • 80. Manufacturing: Stamping Tonnage Signal Data Figure 2: An Tonnage Signal and Some Possible Faults (Jin and Shi 1999)
  • 81. Stamping Tonnage Signal Data Problem Time domain profile (a tonnage signal represents the stamping force in a process cycle). Objective Fault detection and classification Segmentation & Model Selection Known segmentation: most process faults occur only in specific working stages. Boundaries and sizes of segments are determined by process knowledge. (Jin and Shi 1999) Global model: wavelet transforms Monitoring For each segment, use T2 charts based on selected wavelet coefficients to conduct monitoring. (Jin and Shi 2001) Dynamic Update Classify a new signal as normal, a known fault or a new fault, and update wavelet coefficients’ selection and parameter estimates (e.g. μ, ∑, etc.) using all available data. Actions Identify and remove assignable causes.
  • 82. Service: Telecom. Customer Usage Problem Profile with uncontrollable predictors Objective Abnormal behavior detection and classification Forecasting/prediction Segmentation & Model Selection Unknown segmentation: segment customers based on demographic, geographic, psychographic and/or behavioral information. Segmental: fit model for each customer segment, e.g. linear regression. Monitoring Use the model built for each segment to monitor customer behaviors, e.g. monitor linear regression parameter vector β using T2 chart. Dynamic Update Update customer segmentation, segmental model fitting and/or parameter monitoring, e.g. parameters update based on known trend. Actions Service improvement, customer approval, etc.
  • 83. Telecom. Customer Usage Profile: profile with uncontrollable predictors Objective – Abnormal behavior detection and classification – Forecasting/prediction Segmentation – Unknown (segments are defined by customer information.) Model Selection – segmental (e.g. linear regression on uncontrollable predictors for each segment) Monitoring – Phase I: unknown control chart parameters estimated from data – Phase II: monitoring by control charts, like T2 chart, EWMA chart, etc. Dynamic Update – Update segmentation, model selection and/or parameter monitoring Actions: service improvement, customer approval, etc.
  • 84. A SPC Approach for Business Activity Monitoring Jiang, Au, and Tsui (2006), to appear in IIE Transactions
  • 85. Churn Detection via Customer Profiling Qian, Jiang, and Tsui (2006), appear in International J. of Production Research
  • 86. Activity Monitoring Activity monitoring for interesting events that require actions (Fawcett and Provost, 1999) Examples: Credit card or insurance fraud detection Churn modeling and detection Computer intrusion detection Network performance monitoring Objective: Trigger alarms for action accurately and as quickly as possible once activity occurs
  • 87. Activity Monitoring Profiling Approach (SPC & hypothesis test): Characterize populations of key variables that describe normal activity Trigger alarm on activity that deviates from normal Discriminating Approach (classification): Establish models & patterns of abnormal activity w.r.t. normal Apply pattern recognition to identify abnormal activity Other Approaches: Hypothesis Vs. classification Neural network for SPC problems (Hwarng et al. ) Apply other classification to SPC DOE for variable selections on discrimination Detect complex patterns in SPC
  • 88. Activity Monitoring Objective of Activity monitoring is similar to that of statistical process control (SPC) Multivariate control chart methods for continuous and attribute data may be needed More sophisticated tools are needed
  • 89. STATISTICAL PROCESS CONTROL Widely used in manufacturing industry for variation reduction by discriminating: Common causes Assignable causes Evaluation: in-control vs. out-of-control Performance: False alarm rate Average run length (ARL) Techniques: Shewhart chart , EWMA chart, CUSUM chart
  • 90. STATISTICAL PROCESS CONTROL Two stages of implementation: Phase 1 (retrospective): off-line modeling Identify and clear outliers Estimate in-control models Phase 2 (prospective): on-line deployment Trigger out-of-control conditions Isolate and remove causes of signals
  • 91. AN EXAMPLE Shewhart Chart 4 + + + + ++ + + + + +++ ++ + + 2 ++ + + ++ + + + + + + ++ + + ++ ++ + + ++ + + + + + ++ + + + + ++ +++ + + ++ + 0 x + + + + + ++ + + + + ++ + + + + + + + + + + + -2 + + + +++ ++ ++ + + -4 + + 0 20 40 60 80 100 Time EWMA Chart CUSUM Chart 2 + + x.ses$pred ++++++ +++ + ++++ +++ + 30 ++ + + + ++ +++ + ++ + +++++++++++ + ++++ +++++ ++ ++++++++++ + ++++ +++++++++ + 0 ++ ++ + ++ ++ +++ ++ -2 ++ + + 0 10 -4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time
  • 92. KEY CHALLENGES TO SPC Off-line modeling Robust models w/ outliers and change points Automatic model building Scalability - a single algorithm tracking millions of data streams Importance of early signals Interpretation is mostly qualitative, sacrificing accuracy for speed is acceptable Diagnosis and updating - business rules Online fashion: incomplete data - censored and/or truncated
  • 93. SPC Approach for CRM Monitoring PHASE 1 PHASE 2 AUTOMATIC PROFILE PHASE 3 MODELING & MONITORING & EVENT PROFILING UPDATING DIAGNOSIS
  • 94. CRM MONITORING PROCESS Business Event Definition Customer Profiling Profile Updating Small Set of Event Monitoring Interesting and Triggering Customers Customer Diagnosis
  • 95. SPC FOR CRM - PHASE 1 Off-line Modeling: building customer profile robustly - time consuming Requirements A single, time variant model capturing most customers’ behavior Automatic modeling, less human intervention Techniques Robust and efficient estimation methods Change-point modeling Parameter Selection MSE/AIC/BIC Business Requirement/Domain Knowledge
  • 96. SPC FOR CRM - PHASE 2 On-line customer profile updating and monitoring, in search for interesting events requiring action Requirements: Recursive vs. time window Signal accurately and as quickly as possible Techniques: Markovian Type Updating – storage space & time State Space control models
  • 97. SPC FOR CRM - PHASE 3 Diagnosis and Re-profiling Requirements Following signals Robust - outliers, trends, … Attribute identification Techniques: Bayesian models Nonlinear filtering methods
  • 98. PHASE 1: CUSTOMER PROFILE Dynamic Linear Model (West and Harrison, 1997) { X t (i )} a Pt (i ) = [ M t (i ), Tt (i ), Vt (i )]' Size/Level M t (i ) Trend Tt (i ) Variability/Variance Vt (i ) Seasonaility (optional) S t (i )
  • 99. Estimation Methods Least Square Estimation (LSE) Least Absolute Deviation (LAD) Dummy Change Point Model with LSE Dummy Change Point Model with LAD
  • 100. LSE and LAD
  • 101. A DUMMY CHANGE-POINT MODEL
  • 102. A DUMMY CHANGE-POINT MODEL Solve global models assuming1 dummy change points p− a( p) = arg Min ∑[ X t −k − (a0 + a1k )]2 a k =0 a( p)can be recursively obtained by reversing DES method with λ =1 Combine forecasts with exponential weights t ∑ w a ( p) bootstrap resampling p 0 Local variance can be estimated via p =2
  • 103. A DUMMY CHANGE-POINT MODEL
  • 104. PHASE 2: CUSTOMER PROFILE UPDATING AND MONITORING History data cleaning and profiling Forecasting M t +1 (i) = M t (i) + Tt (i) ˆ Online monitoring | X t +1 − M t +1 | > K Vt ˆ Markovian updating Mt +1(i) = (1− λM )Mt (i) + λM ( Xt +1(i) − Mt +1(i)) ˆ Tt +1(i) = (1− λT )Tt (i) + λT (Mt +1(i) − Mt (i)) Vt +1(i) = (1− λV )Vt (i) + λV ( Xt +1(i) − Mt +1(i))2
  • 105. Comparisons Objectives: Robust at phase 1. Sensitive at phase 2. Four methods: 1. LSA 2. LAD 3. Dummy change point model with LSE 4. Dummy change point model with LAD
  • 106. Case Study Data Mining in Telecommunications Industry (Source: AT & T, Mastering Data Mining by Berry & Linoff.)
  • 107. Outline Background Dataflows Business problems Data A voyage of discovery Summary
  • 108. Telecommunication Industry ~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local) 100 million + customers/accounts/lines >1 billion phone calls per day Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue) Segmentation (Clustering of usage, growth, …) Cross Selling (Association Rule) Churn (Disconnect prediction & Tracking) Fraud (Detection of unusual usage time series behavior) Each of these problems worth hundreds millions dollars
  • 109. Information Sources Add a phone Make a call Ordering Billing Customer Network FCC System System Competitive Call Revenue, Official Census Win/Loss/New Details/ Price, ... Competitive high Dun & /No Further Use, ... Web access level reports Bradstreet,... Delayed Delayed Delayed Daily Real time Annually/ Monthly Annually/ Quarterly Quarterly (Tera bytes of interesting information)
  • 110. Customer Focus Telecommunication companies want to meet all the needs of their customers: Local, long distance, and international voice telephone services Wireless voice communications Data communications Gateways to the Internet Data networks between corporations Entertainment services, cable and satellite television Instead of miles of cable and numbers of switches, customers are becoming the biggest asset of a telephone company.
  • 111. Dataflows Customer behavior is in the data. Over a billion phone calls every day. A dataflow is a way of visually representing transformations on data. A dataflow graph consists of nodes and edges. Data flows along edges, and gets processed at node. A basic dataflow to read a file, uncompress it, and write it out: Compressed Uncompressed uncompress input file (in.z) output file (out.text)
  • 112. Why are Dataflows efficient? Dataflows dispense with most of the overhead that traditional databases have, like logging transaction, indexes, pages, etc. Dataflows can be run in parallel, taking advantage of multiple processors and disks. Dataflows provide a much richer set of transforma-tions than traditional SQL.
  • 113. Basic Operations in Dataflow Basic Operations Compress and uncompress Reformat Select Sort Aggregate and hash aggregate Merge/Join Very important steps Data is very, very large Need super computer power
  • 114. Business Problems Telecommunication business has shifted from an infrastructure business to a customer business. Understanding customer behavior becomes critical (market segmentation). Revenue forecasting, churn prediction, fraud detection, new business customer identification. The detailed transaction data contains a wealth of information, but unexploited due to its huge volume.
  • 115. Important Marketing Questions Discussions with business users highlight the areas for analysis: Understanding the behavior of individual customers Regional differences in calling patterns High-margin services Supporting marketing and new sales initiatives
  • 116. Data Call detail data Customer data Auxiliary files
  • 117. Call Detail Data Definition: A call detail data is a single record for every call made over the telephone network. Three sources of call detail data: Direct network/switch recordings Switch records: the least clean, but the most informative. Inputs into the billing system Billing records: cleaner, but not complete. Data warehouse feeds Rather clean, but limited by the needs of the data warehouse.
  • 118. Network Call Details Hundred million calls a day >100 byte per call record (>10 giga-bytes per days) Originating number Terminating number Day/Time of the Call Length of the call Types of call, ….. 2 year data online ??? ---> Statistical Compression >70 billion records (> 7 Tera bytes) Currently in tapes, Batch processing Real time, low level details +++ Raw data, Massive data processing --- Key applications : Book closing, Fraud Detection, Early Warning, ...
  • 119. Billing Details Millions of customer/accounts Tons of other information about the customer/accounts 100+ services (Regular long distance, Digital 1 rate, easylink, Readyline, VTNS,..) 5 Jurisdiction (International, Interstate, …) 50 states NPA-NXX 24-36 months of Message, Minute, Revenue Length of call, Average revenue per minute ~? Billions observations $, Detailed +++ Dirty, Delayed ---- Key Applications : Budgeting/Forecasting, Segmentation/Clustering.
  • 120. Call Detail Data Record format Important fields in a call detail record includes: from_number to_number duration_of_call start_time band service_field
  • 121. Customer Data Customers can have multiple telephone lines. Customer data is needed to match telephone numbers to information about customers. Telecommunication companies have made significant investments in building and populating data models for their customers.
  • 122. Customer Ordering Data Hundred thousands of add/disconnect order weekly Add a line or disconnect a line, … Tons of other information about the customer/accounts 4+ Order types (Add, Win, Loss, No Further Use) 100+ services Related Carrier Require Minute/Revenue Estimation/Prediction Summarizing the historical usage of a loss/NFU into 1 number Predicting the future usage of a win/new (Growth Curve) 5 year online, a few hundred million records Timely, Small Volume +++ Missing information, Massive Data Integration --- Major Applications : Customer Churn, Early Warning, Predicting disconnects
  • 123. Auxiliary Files ISP access numbers A list of access number of Internet Service Providers Fax numbers A list of known fax machines Wireless exchanges A list of exchanges that correspond to mobile carriers Exchange geography A list of geographic areas represented by the phone number exchange International A list of country code, and the names of the corresponding countries.
  • 124. Discovery Call duration Calls by time of day Calls by market segment International calling patterns When are customers at home Internet service providers Private networks Concurrent calls Broad band customers
  • 125. Call Duration
  • 126. Call Duration
  • 127. Calls by Time of Day In call detail data, the field band is a number representing how the call should be charged. This provides a breakdown: local regional national international fixed-to-mobile other Unknown Question: when are different types of calls being made?
  • 128. Calls by Time of Day
  • 129. Calls by Time of Day
  • 130. Calls by Time of Day
  • 131. Calls by Market Segment The market segment is a broad categorization of customers: Residential Small business Medium business Large business Global Named accounts Government Question: Are customers within market segments similar to each other? What are the calling patterns between market segments?
  • 132. Calls by Market Segment Solution approach from_number customer data from_market_segment to_number to_market_segment call detail records Results
  • 133. Calls by Market Segment
  • 134. Calls by Market Segment
  • 135. Calls by Market Segment
  • 136. International Calling Patterns International calls are highly profitable, but highly competitive. Questions: where are calls going to? how do calling patterns change over time? how do calling patterns change during the day? what are differences between business and consumer usage? which customers primarily call one country? which customers call a wider variety of international numbers?
  • 137. International Calling Patterns
  • 138. When are Customers at Home?
  • 139. Internet Providers Question: which customers own modems? which Internet service providers (ISPs) are customers using? do different segments of customers use different ISPs?
  • 140. Internet Providers
  • 141. Private Networks Special customers: Businesses that operate from multiple sites likely make large volumes of phone calls and data transfers between the sites. Some businesses must exchange large volumes of data with other businesses. Virtual private network (VPN) is a telephone product designed for this situation. For large volumes of phone calls, it provide less expensive service than pay-by-call service Question: Which customers are good candidates for VPN? Result: A list of businesses that have multiple offices and make phone calls between them.
  • 142. Concurrent Calls For businesses having a limited number of outbound lines connected to a large number of extensions, the following questions are of interest: When do a customer need additional outside line? When is the right time to offer upgrades to their phone systems? One measure of a customer’s need for new lines is the maximum number of lines that are used concurrently.
  • 143. Concurrent Calls
  • 144. Identify Broad Band Customers Objective: Identify customers who use the telephone lines for data/computer access (potential broad band customers) Collect sample of 4000 lines in which voice or data/computer access information are available Divide to two halves for training and testing Define hundreds of call behavior variables Run neural network, logistic regression, and tree
  • 145. Identify Broad Band Customers Key predictive drivers: length of call (10+ min.) number of repeat phone call to the same number (5+) call by the time of day (at night) Call by day of the week (weekend) Neural network performed the best. Tree is most intuitive.
  • 146. Summary Call detail records contain rich information about customers: Customer behavior varies from one region of a country to another. Thousands of companies place calls to ISPs. They own modems and have the ability to respond to web-based marketing. Residential customers indicate when they are home by using the phone. These patterns can be important, both for customer contact and for customer segmentation. The market share of ISPs differs by market segment. International calls show regional variations. The length of calls varies considerably depending on the destinations. International calls made during the evening and early morning are longer than international calls made during the day. Companies making calls between their different sites are candidates for private networking.
  • 147. Case Study: Churn Modeling in Wireless Communications This case study took place at the largest mobile telephone company in a newly developed country. The primary data source is the prototype of an ongoing data warehousing effort. (Source: “Mastering Data Mining” by Berry & Linoff)
  • 148. Outline The Wireless Telephone Industry Three Goals Approach to Building the Churn Model Churn Model building The Data Lessons about Churn Models Building Summary
  • 149. The Wireless Telephone Industry Rapidly maturing of the wireless market makes the number of churners and the effect of churn on the customer base grow significantly. The business shifts away from signing on nonusers, and focuses on existing customers. (see Figure 11.2 and Figure 11.3) The wireless telephone industry has differences from other industries. Sole service providers Relatively high cost of acquisition No direct customer contact Little customer mindshare The handset
  • 150. Three Goals Near-term goal: identify a list of probable churners for a marketing intervention. Discussion with the marketing group define the near-term goal: by the 24th of the month, provide the marketing department with a list of 10’000 club members most likely to churn Medium-term goal: build a churn management application (CMA). Besides running churn models, CMA also needed to: Manage models Provide an environment for data analysis before and after modeling Import data and transform it into the input for churn models Export the churn scores developed by the models Long-term goal: complete customer relationship management
  • 151. Approach to Building the Churn Model Define churn Involuntary churn refers to cancellation of a customer’s service due to nonpayment. Voluntary churn is everything that is not involuntary churn. The model is for the latter. Inventory available data A basic set of data includes data from the customer information file, data from the service account file, and data from billing system. Build models Deploy scores Churn scores can be used for marketing intervention campaigns, prioritizing of customers for different campaigns, and estimating customer longevity in computing estimated lifetime customer value. Measure the scores against what really happens How close are the estimated churn probabilities to the actual churn probabilities? Are the churn scores “relatively” true, i.e., higher scores imply higher probabilities?
  • 152. Churn Model Building A churn modeling effort necessitate a number of decisions: The choice of data mining tool SAS Enterprise Miner Version 2 was used for this project. Segmenting the models set Three models were built for three segments of customers: club members, non-club members, recent customers who had joined in the previous eight or nine months. The final four models on four different segments In order to investigate if customers joining at about the same time have similar reasons for churn, the club model set was split into two segments: customers who joined in the previous two years, and the rest.
  • 153. Churn Model Building (continued) Choice of modeling algorithm Decision tree models were used for churn modeling due to their ability to handle hundreds of fields in the data, their explanatory power and easy to be automated. This project built six trees for each model set (using Gini and entropy as split functions, and allowing 2-, 3- and 4-way splits) in order to see which performs best and to verify each other. Three parameters need to be set: minimum size of a leaf node, minimum size of a node to split, and maximum depth of the tree. The resulted tree needs to be pruned. The size and churner density of the model set Experiments with different model sets show that the model set with 30% churners and 50k records works best. (Table 11.3) The effect of latency (Figure 11.12) Translating models in time (Figure 11.13)
  • 154. The Data Historical churn rates Historical churn rate was calculated along different dimensions: handset, demographic, dealer, and ZIP code. Data at the customer and account level SSN, ZIP code of residence, market ID, age and gender, pager indication flag, etc. Data at the service level Activation data and reason, features ordered, billing plan, handset, and dealer, etc. Data billing history Total amount billed, late charges and amount overdue, all calls, fee-paid services, etc. Rejecting some variables Variable that cheat, identifiers, categoricals with too many values, absolute dates, and untrustworthy values, etc. Derived variables
  • 155. Lessons about Churn Model Building Finding the most significant variables handset churn rate, other churn rate, number of phones in use by a customer, low usage Listening to the business users to define the goals Listening to the data Including historical churn rates The past is the best predictor of the future. For churn, the past is historical churn rates: churn rate by handset, by demographics, by area, and by usage patterns.(Figure 11.17) Composing the model set Important factors: historical data availability, size and churner density. (Figure 11.18) Building a model for the churn management application Listening to the data to determine model parameters Understanding the algorithm and the tool
  • 156. Summary Four critical success factors for building a churn model: Defining churn, especially differentiating between interesting churn (such as customers who leave for a competitor) and uninteresting churn (customers whose service has been cut off due to nonpayment). Understanding how the churn results will be used. Identifying data requirements for the churn model, being sure to include historical predictors of churn, such as churn rate by handset and churn rate by demographics. Designing the model set so the resultant models can slide through different time windows and are not obsolete as soon as they are built.
  • 157. Case Study Market Basket Analysis Who buys meat at the health food store ? (Source: Mastering Data Mining by Berry & Linoff.)
  • 158. Purpose Who buys meat at the health food store? Understand customer behavior.
  • 159. DM Tools Association Rules of Market Basket Analysis. Customer clustering. Decision tree.
  • 160. Market Basket Analysis Customer Analysis Market Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases. Product Analysis Market Basket Analysis gives us insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to purchase. Source: E. Wegman
  • 161. Market Basket Analysis Given A database of transactions. Each transaction contains a set of items. Find all rules X →Y that correlate the presence of one set of items X with another set of items Y. Example: When a customer buys bread and butter, they buy milk 85% of the time. Source: E. Wegman
  • 162. Market Basket Analysis While association rules are easy to understand, they are not always useful. Useful: On Friday convenience store customers often purchase diapers and beer together. Trivial: Customer who purchase maintenance agreements are very likely to purchase larger appliances. Inexplicable: When a new Super Store opens, one of the most commonly sold items is light bulbs. Source: E. Wegman
  • 163. Measures for Market Basket Analysis Confidence: Probability that right-hand product is present given that the left-hand product is in the basket. Support: Percentage of baskets that contain both the left-hand side and the right-hand side of the association. Lift (correlation): Compare the likelihood of finding the right- hand product in a basket known to contain the left-hand product to the likelihood of finding the right-hand product in any random basket.
  • 164. Example “Caviar implies Vodka” High confidence Given that we know someone bought a caviar then probability that person buy vodka is very high. Low support The percentage of basket that contain both the vodka and caviar is very low since those products are not much use. High lift Pr( finding Vodka Cavier is already in the basket ) Pr( Finding Vodka in any random basket )
  • 165. Association Results Support Confidence( Relation Lift Rule (%) %) Red pepper -> Yellow pepper & 1 4 2.47 3.23 33.72 Bananas & Bakery Red pepper -> Yellow pepper 2 3 2.24 4.75 49.21 Bananas … … … … … … 50 2 1.37 3.77 85.96 Green peppers -> Bananas … … … … … … Pr( finding banada Green pepper is already in the basket ) Pr( Finding Banana in any random basket ) High = = Low High
  • 166. Clustering Variables Gender Meat buying Total Spending
  • 167. Customer Clusters • The height of pies: total spending • Shaded pie slice: the percentage of people in the cluster who buy meat • Top row: Women, Bottom row: men
  • 168. Decision Tree The most meat-buying branches Spend the most money Buy the largest number of item Although only about 5% of shoppers buy meat, they are among the most valuable shoppers !!!
  • 169. Decision Tree for More about Meat
  • 170. Conclusion Data Mining can be used to improve shelf placement decision. Data Mining can be used to identify a small, but very profitable group of customers.
  • 171. Case Study Supermarket Mining Analyzing Ethnic Purchasing Patterns (Source: Mastering Data Mining by Berry & Linoff.)
  • 172. Overview Describe how the manufacturer learned about ethnic purchasing patterns. Aimed at Spanish speaking shoppers in Texas. Collected data from supermarket chain in Texas. Employed data mining tools from Mineset (SGI).
  • 173. Purpose Discover whether the data provided revealed any differences between the stores with a high percentage of Spanish- speaking customers and those having fewer. Hispanic percentage for the specific item. Identify which products sell well in Hispanic consumers. Scatter plot showing variability of Hispanic appeal by category
  • 174. Data Consist of weekly sales figures for products from five basic categories. (Ready-to-eat cereals, Desserts, Snacks, Main meals, Pancake and variety baking mixes) Within category subcategories were assigned. (actual units sold, dollar volume and equivalent case sales) For each store, (store sizes, % of Hispanic shoppers and % of African-American shoppers)
  • 175. Transformation of Data Decode variables that carried more than one piece of information. HISPLVL and AALEVEL: % of Hispanic and AAs. HISPLVL: 1 ~15 1 Store outside San Antonio with 90% or more Hispanic. 10 With little or no Hispanic. Normalized values by taking the sales volume to compare stores of different sizes. Hispanic score Ave. values for the high H. store - Ave. values for the least H. store Large post. value indicates a product that sells much better in the heavily Hispanic stores.
  • 176. Transformation of Data The most valuable part of the project was preparing the data and getting familiar with it, Rather than in running fancy data mining algorithms.
  • 177. DM Tools Association rule visualization for Hispanic percentage. Scatter plot showing which products sell well in Hispanic neighborhoods. Scatter plot showing variability of Hispanic appeal by category.
  • 178. Case Study Supermarket Mining Transactions & Customer Analysis (Source: Mastering Data Mining by Berry & Linoff.)
  • 179. Overview A collaboration between a manufacturer and one of the retailer chains. Grocery market usually belong to the retailer actually performed by a supplier.
  • 180. Purpose Effective use sales data to make the category as a whole more profitable for the retailer. Identify the customer behavior. Finding clusters of customers
  • 181. Transaction Detail Fields FIELDS DESCRIPTION Date YYYY-MM-DD Store CCCSSSS, where CCC=chain, SSSS=store Lane Lane of transaction Time The time-stamp of the order start time The loyalty card number presented by the customer Customer ID ID of 0 means the customer did not present a card Tender Type Payment type, i.e. 1=cash, 2=check,…. UPC The universal product code for item purchased Quantity The total quantity of this item The total $ amount for the quantity of a particular UPC Dollar Amount purchased
  • 182. Universal Product Code The numbers, encoded as machine-readable bar code that identify nearly every product that might be sold in a grocery store. Organizations Uniform Code Council(www.uc-council.org): US and Canada European Article Numbering Association(www.ean.be) :Europe and rest of the world North America: Consist of 12 digits The code itself fits in 11 digits; the twelfth is a checksum
  • 183. From Transaction Detail Fields WE can calculate …….. . Calculate the % of each shopper’s total spending that went to that category. The total number of trips. The total dollar amount spent for the year along with the total number of items purchased and the total number of distinct items purchased. The % of the items purchased that carried high, medium and low profit margins for the store.
  • 184. Finding Clusters of Customers Finding groups of customers with similar behavior. K-mean clustering. Set a certain number k. Selected as candidate cluster centers. Assigned to the cluster whose center it is nearest. Centers of the clusters are recalculated and the records are reassigned based on their proximity to the new cluster center.
  • 185. Main Ways to Use Cluster To get insight in customer behavior by understanding what differentiates one cluster from another. To build further model within cluster To use as additional input variables to another models.
  • 186. Case Study Who Gets What? Building a Best Next Offer Model for an Online Bank (Source: Mastering Data Mining by Berry & Linoff.)
  • 187. Who Gets What? Building a Best Next Offer Model for an Online Bank The use of data mining by the online division of a major bank to improve its ability to perform cross selling. Cross-selling: the activity of Selling additional services to the customers you already have.
  • 188. Outline Background on the Banking Industry The Business Problem The Data Approach to The Problem Models Building Lesson learned
  • 189. Background on the Banking Industry The challenge for today’s large bank is to shift their focus from market share to wallet-share. That is, instead of merely increasing the number of customers, banks need to increase the profitability of the ones they already have.
  • 190. Background on the Banking Industry Why use data mining? A bank knows much more about current customers than external prospects. The information gathered on customers in the course of normal business operations is much more reliable than the data purchased on external respects.
  • 191. The Business Problem The project had immediate, short-term, and long-term goals. Long-term: increase the bank’s share of each customer’s financial business by cross-selling appropriate products. Short term: support a direct e-mail campaign for four selected products (brokerage accounts, money market accounts, home equity loans, and a particular type of saving account). Immediate: take advantage of a data mining platform on loan from SGI to demonstrate the usefulness of data mining to the marketing of online banking services.
  • 192. The Data The initial data comprised 1,122,692 account records extracted from the Customer Information System (CIS). Before starting data mining, a SAS data set was created, which contain an enriched version of the extracted data.
  • 193. The Data From accounts to customers Defining the products to be offered.
  • 194. The Data From accounts to customers The data extracted from the CIS had one row per account, which reflects the usual product-centric organization of a bank where managers are responsible for the profitability of particular products rather than the profitability of customers or households. The best next offer project required pivoting the data to build customer- centric models. The account-level records from the CIS were transformed into around a quarter million household-level records.
  • 195. The Data Defining the products to be offered 45 product types is used for the best next offer model. Of these 25 products are ones that may be offered to a customers. Information on the remaining is used only as input variables when building the models.
  • 196. Approach to the Problem The approach to the problem: A propensity-to-buy model is built for each product individually, which gives each customer a score for the modeled product. The scores for four products are combined to yield the best next offer model: customers are all offered the product for which they have the highest score.
  • 197. Approach to the Problem Comparable scores How to score? Pitfalls of this approach
  • 198. Approach to the Problem Comparable scores Three requirements are needed to make scores from various product propensity models comparable: All scores must fall into the same range: zero to one. Anyone who already has a product should score zero for it. The relative popularity of products should be reflects in the scores.
  • 199. Approach to the Problem How to score? With the product propensity model, prospects are given a score based on the extent to which they look like the existing account holders for that product. This project used a decision tree-based approach, which use the percentage of existing customers at a leaf to assign a score for the product. This approach can be summed up by the words of Richard C. Cushing: “When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”
  • 200. Approach to the Problem Pitfalls of this approach Becoming a customer may change people’s behavior. The best approach is to build models based on the way current customers looked just before they became customers. But, the data this approach is not easy to get. Current customers reflect past policy. This will result in “past discrimination” .
  • 201. Models Building Build an individual propensity model for each product Finding important variables Building a decision tree model Model performance in a controlled test Get to a cross-sell model by combining individual propensity models
  • 202. Models Building Start with brokerage accounts
  • 203. Finding important variables Using the column importance Tool Find a set of variables which, taken together, do a good job of differentiating classes (people with brokerage accounts and people without): Whether they are a private banking customer The length of time they have been with the bank The value of certain lifestyle codes assigned to them by Microvision (a marketing statistics company) Using the evidence classifier This tool Uses the naïve Bayes algorithm to build a predictive model. Naïve Bayes models treat each variable independently and measure their contributions to a prediction. Then these independent contributions are combined to make a classification.
  • 204. Building a decision tree model for brokerage MineSet’s decision tree tool Leaves in the tree are either mostly nonbrokerage or mostly brokerage. Each path through the tree to the leaf containing mostly brokerage customers can be thought of as a “rule” for predicting an unclassified customer. Customers meeting the conditions of the “rule” are likely to have or be interested in a brokerage account. In our data, only 1.2 percent of customers had brokerage accounts. To improve the model, Oversampling is used to increase the percentage of brokerage customers in the model set. The final tree is built on a model set containing about one quarter brokerage accounts.
  • 205. Building a decision tree model for brokerage Records weights in place of oversampling Allowing one-off splits Grouping categories Influencing the pruning decisions Backfitting the model for comparable scores
  • 206. Building a decision tree model for brokerage Records weights in place of oversampling Recording weighting can achieve the effect of oversampling by increasing the relative importance of the rare records. Splitting decision is based on the total weight of records in each class rather than the total number of records. In stead of increasing the weight of records in the rare class, the proper approach is to lower the weight of records in the common class. Bringing the weight of rare records up to 20~25% of the total works well.
  • 207. Building a decision tree model for brokerage Allowing one-off splits By default, MineSet’s tree building algorithm splits a categorical variable on every single value, or dose not split on it at all. Users can control if one-off splits are considered through one parameter. One-off split: split based on a single value of a categorical variable. Grouping categories MineSet’s design: The tree building algorithm is unlikely to make good splits on a categorical variable taking on hundreds of values. Some variables rejected by MineSet seem to be very predictive for some cases. They have the characteristic that although there were hundreds of values in the data, only a few values of those variables appear frequently. The approach is to Lump all values below a certain threshold into a catch-all “other” category, and make splits on the more populous ones.
  • 208. Building a decision tree model for brokerage Influencing the pruning decisions Users have the control of the size, depth, and bushiness of the tree. Good settings: minimum number of records in a node: 50; pruning factor: 0.1; no explicit limit on the depth. Backfitting the model for comparable scores The backfit model is used to run the original data through the tree. The score for each leaf is based on the percentage of brokerage customers at that leaf. The more brokerage at one leaf, the higher scores the customers without brokerage at this leaf will get, and the more possible they will open a brokerage account.
  • 209. Brokerage model performance in a controlled test Group Size Choosing Email Response Rate Model 10,000 High score Yes 0.7 Control 10,000 Random Yes 0.3 Hold-out 10,000 Random No 0.05 High score: any score higher than the density of brokerage customers in the population, not a large number.
  • 210. Getting to a cross-sell model The propensity models for the rest products are built following the same procedure, and individual propensity models are combined into a cross- sell model to find the best next offer. 0.10 A 0.72 B B vote 0.31 C 0.47 D
  • 211. Summary of the Procedure Determine whether cross-selling makes sense. Determine whether sufficient data exists to build a good cross- sell model. Build propensity models for each product individually. Combine individual propensity models to construct a cross-sell model.
  • 212. Lessons Learned Before building customer-centric models, data need to be transformed from product-centric to customer-centric. Having a particular product may change a customer’s behavior. The best way to solve this problem is to build models based on the behavior before buying the product. The current composition of the customer population is largely a reflection of past marketing policy. Oversampling and record weighting can be used to consider rare events.
  • 213. References Berry & Linoff (Wiley) Mastering Data Mining, 2000 Han & Kamber (Morgan Kaufmann Publishers) Data Mining: Concept and Techniques, 2001 Hastie, Tibshirani, & Friedman (Springer Verleg) The Elements of Statistical Learning, 2001 Taguchi & Jugulum (Wiley) The Mahalanobis-Taguchi Strategy, 2002