Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

896 views

Published on

Very Useful documents of IBM for Marketing Analytics

Published in:
Education

No Downloads

Total views

896

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

52

Comments

0

Likes

1

No embeds

No notes for slide

- 1. IBM ResearchData Analytics for MarketingDecision Support Saharon Rosset, Naoki Abe* IBM T.J. Watson Research Center *Acknowledgements to: Andrew Arnold, Chid Apte, John Langford, Rick Lawrence, Srujana Merugu, Edwin Pednault, Claudia Perlich, Rikiya Takahashi and Bianca Zadrozny © 2007 IBM Corporation
- 2. IBM ResearchTutorial outline Challenges of marketing analytics (SR) – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies (NA) – Bayesian network modeling – Utility-based classification (Cost-sensitive learning) – Reinforcement learning and Markov decision processes Detailed analysis and case studies: – Customer lifetime value modeling (NA) – Customer wallet estimation (SR) © 2006 IBM Corporation2
- 3. IBM ResearchThe grand challenges of marketing Maximize profits (duh) Initiate, maintain and improve relationships with customers: – Acquire customers – Create loyalty, prevent churn – Improve profitability (lifetime value) Optimize use of resources: – Sales channels – Advertising – Customer targeting © 2006 IBM Corporation4
- 4. IBM ResearchSome of the concrete modeling problems Channel optimization Cross/up-sell (customer targeting) New customer acquisition Churn analysis Product life-cycle analysis Customer lifetime value modeling – Effect of marketing actions on LTV? Advertising allocation RFM (Recency, Frequency, Monetary) analysis ... © 2006 IBM Corporation5
- 5. IBM ResearchData analytics for decision support: grand challenge Beyond “modeling” the current situation, we need to offer insight about the effect or potential of possible actions and decisions: – How would different channels / incentives affect LTV of our customers? – How much more money could this customer be spending with us (customer wallet) – Can we predict the effects of new actions that have never been tried in historical data? What if they have been tried on non-representative set? – Can we be confident our results are actionable? Can we differentiate causality from correlation in our models? © 2006 IBM Corporation6
- 6. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Bayesian network modeling – Utility-based classification (Cost-sensitive learning) – Reinforcement learning and Markov decision processes Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation7
- 7. IBM ResearchTypical marketing analytics vs. data mining CRM analytics: – Relies on primary research (=surveys) to understand needs and wants – Relies on (more or less) detailed models of customer behavior ⇒Usually parametric statistical models – Often estimates customer-level parameters Data mining: – Typically relies on data in Data Warehouse /Mart – Uses minimum of parametric assumptions – Often attempts to fit problem into “standard” modeling framework: classification, regression, clustering... © 2006 IBM Corporation8
- 8. IBM ResearchComparison of approaches Criterion Marketing DM Parametric models formalize + - knowledge of domain and problems Robust against incorrect assumptions - + about domain and problems Actively collect the data to estimate + - model quantities (active learning) Rely on existing, abundant data in - + Corporate Data Warehouses Integrate expert input from managers + - and customers (“wants and needs”) Use data to learn new, surprising - + patterns about customer behavior © 2006 IBM Corporation9
- 9. IBM ResearchExample 1: modeling and improving LTV Rust, Lemon and Zeithaml (2004), “Return on Marketing: Using Customer Equity to Focus Marketing Strategy”, J. of Marketing Modeling customer equity / lifetime value – Combine several previous approaches – Model the brand “switching matrix” as a function of customer preference, history and product properties – Want to identify drivers of satisfaction (levers) – Calculate effect (ROI) of marketing actions – pulling levers Mostly relies on primary research collected specifically for this study – Interviews with managers – Survey of consumer preferences © 2006 IBM Corporation10
- 10. IBM ResearchSimplified version of paper’s business model Costs Return on Marketing marketing investment investment Pulling Increased levers equity Main goals: Identify relevant levers Quantify their effect © 2006 IBM Corporation11
- 11. IBM ResearchAnalytic setup (main components only) logit(pijk) = β0k LASTijk + xik β k – pijk is probability that customer i buys item k given they bought item j previously – LAST is a dummy variable for “inertia” – Xik is a feature vector for customer i, product k This is used to compute the brand switching matrix {pijk} and customer lifetime value is calculated as: CLVij = Σt PROFij Bijt – PROF is a profit measure considering discounting, price & cost (assumed known) – Bijt is probability customer i buys product j in time t, calculated from the stochastic matrix {pijk} © 2006 IBM Corporation12
- 12. IBM ResearchData definitions Potential drivers (marketing activities) are reflected in the components of xi – Price – Quality of service etc. The data to estimate the logit model is based on: – Expert (manager) input – Questionnaires of customers – Corporate data warehouse (not implemented in their case study...) © 2006 IBM Corporation13
- 13. IBM ResearchResults: important drivers for airline industry? Driver Coefficient Std error Z score (coeff/std) Inertia .849 .075 11.34 Quality .441 .041 10.87 Price .199 .020 9.86 Convenience .609 .093 6.56 . . . . . . . . . . . . Etc. (all factors deemed important) © 2006 IBM Corporation14
- 14. IBM ResearchWhat would a data miner do? Count more (or only) on historical data in data warehouse – Variables would have different meaning – Identify correlations, not necessarily drivers Could use same analytic formulation, but also try alternative approaches – Relate LTV directly to variables observed? – Model transaction sizes in addition to switching? – Use non-parametric modeling tools? Etc. © 2006 IBM Corporation15
- 15. IBM ResearchExample 2: the segmentation approach Common practice in marketing: Define static, fixed customer segments – Supposed to capture “true essence” of customers’ behaviors, needs and wants – Often given catchy names: “Upwardly mobile businessmen” representing the “average” profile Make marketing decisions at segment level, based on understanding of needs and wants © 2006 IBM Corporation16
- 16. IBM ResearchA market segmentation methodology Based on Kotler (2000). Marketing Management. Prentice-Hall 1. Survey stage: primary research to capture motivations, attitudes, behaviors 2. Analysis stage: factor analysis, then clustering of survey data ⇒ Identify segments 3. Profiling stage: analyze segments and give them names Additional stage often taken is to assign all customers to the defined segments: 4. Assignment stage: build classification model to assign all customers to learned segments © 2006 IBM Corporation17
- 17. IBM ResearchWhat would a data-miner do? Option 1: clustering – Replace primary research by warehouse data – Cluster all customers – Lose the “needs and wants” aspect Option 2: supervised learning – Treat each decision problem as separate modeling task E.g., find “positive” and “negative” examples for each binary decision, learn model – Advantage: customized – Disadvantages: • May not have right data to model decisions we want to make • Past correlations may not be indicative of future outcomes © 2006 IBM Corporation18
- 18. IBM ResearchComparison of approaches Criterion Marketing DM Parametric models formalize + - knowledge of domain and problems Robust against incorrect assumptions - + about domain and problems Actively collect the data to estimate + - model quantities (active learning) Rely on existing, abundant data in - + Corporate Data Warehouses Integrate expert input from managers + - and customers (“wants and needs”) Use data to learn new, surprising - + patterns about customer behavior © 2006 IBM Corporation19
- 19. IBM ResearchAn integrated approach Count on historical data as much as possible Avoid complex parametric models – Let the data guide us – Still want to integrate domain knowledge Analyze and understand the special aspects of marketing modeling problems – Importance of long-term relationship (lifetime value, loyalty) – Effects of competition (customer wallet vs. customer spending) Modify existing, or develop new, data analytics approaches to address problems properly © 2006 IBM Corporation20
- 20. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Bayesian network modeling – Utility-based classification (Cost-sensitive learning) – Reinforcement learning and Markov decision processes Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation21
- 21. IBM ResearchMoving beyond revenue modeling To really understand the profitability and potential of our customers, we need to move beyond modeling their short- term revenue contribution Revenue over time: Lifetime Value modeling – How much can we expect to gain from customer over time? – Incorporates loyalty/churn, prediction of future customer revenue – LTV = ∫t S(t) v(t) D(t) dt (S(t) is customer survival function, v(t) customer value over time, D(t) discounting factor) Potential revenue: Customer Wallet Estimation – How much revenue could we be generating from this customer?22 – Incorporates competition, brand switching etc. © 2006 IBM Corporation
- 22. IBM ResearchLTV and Wallet: beyond standard modeling Time ? LTV modeling Future Sales Next year forecasting Sales / Now revenue modeling Wallet estimation Revenue Actual sales Potential sales © 2006 IBM Corporation23
- 23. IBM ResearchTypes of decision support Passive decision support – Understand more about problems and causes – Identify areas of need, under-performance etc. – Help in making better decisions Active decision support – Model the effect of actions – Actively help in deciding between alternative actions Active decision support is typically more challenging in terms of data needed to learn models © 2006 IBM Corporation24
- 24. IBM ResearchDepth and actionability of insights Depth LTV modeling Understand effect Real of potential actions Wallet on LTV and Wallet insight estimation attainment Revenue Revenue Lever modeling forecast identification Basic Correlation Causality concepts Actionability Passive Active © 2006 IBM Corporation25
- 25. IBM ResearchThe causality challenge Predictive models discover correlation – Example: linear regression Significant t-statistic for coefficients imply they have a significant effect, not that they are actually causing the response For active decision support we need to identify levers to pull to affect outcome – Only works with causality Causality is difficult to find or prove from observation data – If we have knowledge about causality, we can formalize it as (say) Bayesian network and use in our models – We can get closer to causality by case-control experiments © 2006 IBM Corporation26
- 26. IBM ResearchIllustration: predictive power is not causality Assume we observe for some companies: X = company’s marketing budget, Y = company’s sales and want to understand how to affect Y by controlling X Assume we find that X is very “predictive” about Y Possible scenarios: x y Causality ⇒ successfully identified “lever” x y Fixed percent of revenue to marketing? Z Z=Company size independently determining both quantities? Y X © 2006 IBM Corporation27
- 27. IBM ResearchSome other challenges Modeling effects of new/unobserved actions – Critical for active support, often difficult or impossible – Even for established actions, they may have been applied in different context than our planned campaign Integrating expert knowledge into process – Can be done formally via graphical models Handling data issues: matching, leaks, cleaning – Always critical Delivering solutions and results © 2006 IBM Corporation28
- 28. IBM ResearchExample: Telecom Churn Management Cell phone company has set of customers, some leave (churn) every month The goals of a Churn Management system: Analyze the process of churn – Causes – Dynamics – Effects on company Design policies and actions to improve the situation – Marketing campaigns – Incentive allocation (offer new features or presents) – Change in plans to contend with competition © 2006 IBM Corporation29
- 29. IBM ResearchFirst step: understand current situation Who is likely to churn (predictive patterns)? – Phones features / plans – Usage patterns – Demographics Tools: segmentation, classification, etc. Which of these patterns are causal? Tools: expert knowledge, Bayesian networks, etc. Which causal effects not in data? Competition, economy etc. Which of these customers are profitable? – Short term: customer value – Long term: lifetime value – Growth potential: customer wallet © 2006 IBM Corporation30
- 30. IBM ResearchSecond step: design actions Can we affect causal churn patterns? – For example, by improving customer service Given possible incentives and marketing actions, what effect will they have? – Loyalty and relationship – Current customer value and wallet attainment – Customer lifetime value – Cost to company How can we optimize use of our marketing resources? – Identify segments we want to retain – Identify effective marketing actions © 2006 IBM Corporation31
- 31. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Bayesian network modeling – Utility-based classification (Cost-sensitive and active learning) – Reinforcement learning and Markov decision processes Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation32
- 32. IBM ResearchSurvey of Useful Methodologies Bayesian Networks – Motivation: need to address causality vs. correlation issue; need to formalize domain knowledge about relationships in data – Example domain: Customer wallet estimation Utility-based classification* (Cost-sensitive Learning) – Motivation: need to handle utility of decision and cost of data acquisition in marketing decision problems – Example domains: Targeted marketing, Brand switch modeling Markov Decision Processes (MDP) and Reinforcement Learning – Motivation: need to consider long term profit maximization – Example domain: Customer lifetime value modeling *c.f. Utility-Based Data Mining Workshop at KDD’05 and KDD’06 © 2006 IBM Corporation33
- 33. IBM ResearchBayesian Network a.k.a Graphical Model Bayesian Network is a directed acyclic graphical model and defines a probability model P(E) ¬P(E) Here is a simple example… 0.3 0.7 P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C) Economy E P(M) ¬P(M) E P(C) ¬P(C) F 0.3 0.7 Marketing F 0.4 0.6 Competition T 0.9 0.1 T 0.7 0.3 Revenue MC P(R) ¬P(R) FF 0.3 0.7 TF 0.9 0.1 FT 0.2 0.8 TT 0.6 0.4 © 2006 IBM Corporation34
- 34. IBM ResearchBayesian Network as a General Unifying Framework Bayesian Network provides a general framework that subsumes numerous known classes of probabilistic models, e.g. – Naïve Bayes Classification – Clustering (Mixture models) – Auto regressive models – Hidden Markov models, etc, etc Bayesian Network provides a framework for discussing modeling, inference, causality, hidden variables, etc Unobserved Unobserved State State Class Class Variable 1 …. Variable N Variable 1 …. Variable N Symbol SymbolNaïve Bayes classification Clustering/Mixture Hidden Markov Model © 2006 IBM Corporation35
- 35. IBM ResearchEstimation and Inference Problems for Bayesian Networks Parameter estimation from data given structure – Given a graphical structure as input, and a model class, estimate the parameters of the models Inference given model – Given a full Bayesian network (i.e. graph and model parameters) and partial information on the realized values, infer the unknown values – Useful for business scenario analyses Latent variable estimation given structure – Given a full Bayesian network and data for the observed variables, infer the values for the unobserved (latent) variables Bayesian network structure learning from data – Given data only, infer the best Bayesian network, including both the graphical structure and the model parameters Inferring causal structure from data – Given data only, infer not only the underlying Bayesian Network but the causality between variables © 2006 IBM Corporation36
- 36. IBM ResearchParameter Estimation (for Linear Gaussian Models) Parameter estimation given graph structure reduces to standard estimation problem (e.g. maximum likelihood estimation) for the underlying model class For example, for linear Gaussian models, it is solvable by linear regression – P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C) Economy – M = α 1E + ε 1, ε 1 ~ N(0, σ 1) Marketing Competition – C = α 2E + ε 2, ε 2 ~ N(0, σ 2) – R = α 3M + α 4C + ε 3, ε 3 ~ N(0, σ 3) Revenue There is active research for many other underlying model classes © 2006 IBM Corporation37
- 37. IBM ResearchInference in a Given Model Given an estimated model and realized values for a subset of the variables, it may be possible to compute the most likely values for unknown variables. Inference in unrestricted Bayesian Networks is intractable (#P-hard) For restricted classes, it is possible to efficiently perform inference: e.g. dependency trees – P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M) Economy – P(M|E,C,R) = α P(M|E,C) P(R|M) – = α P(M|E) P(R|M) Marketing Competition Revenue – Simplified due to the conditional independence (d-separation) between M and C, implied by the graph structure But there are considerable challenges for graph structures including undirected cycles (e.g. original graph for P(M,E,C,R)) © 2006 IBM Corporation38
- 38. IBM ResearchStructure Learning Given Data For unrestricted classes, structure learning is known to be intractable. – Even for the class of poly-trees, robust estimation (i.e. when the true distribution may not be in the target class) is NP-hard For some restricted classes, structure learning is efficient – Dependency Trees can be efficiently robustly learned – Poly-trees can be efficiently learned, given the assumption that the true model is in the target classA dependency tree A poly-tree Economy Economy Marketing Competition Marketing Competition Revenue Revenue There is active research on proving conditions under which variable selection methods in regression (e.g. Lasso) can provably learn structure of general graphs © 2006 IBM Corporation39
- 39. IBM ResearchInferring Causality from Data M┴R|E The causal structure cannot be determined from data ! Economy Economy Economy Marketing Revenue Marketing Revenue Marketing Revenue P(M,E,R) = P(E) P(M|E) P(R|E) P(M,E,R) = P(M) P(E|M) P(R|E) P(M,E,R) = P(R) P(E|R) P(M|E) M┴C|R The causal structure can be determined from data ! Marketing Competition It can be inferred that Marketing can be Revenue a “lever” for controlling Revenue ! P(M,E,C) = P(M) P(C) P(R|M,C) C.f. [P. Spirtes, C. Glymour, and R. Scheines (2000)] © 2006 IBM Corporation40
- 40. IBM ResearchSummary: Estimation and Inference with Bayesian Networks Parameter estimation from data given structure – It is efficiently solvable for many model classes Inference given model – Exact inference is known to be NP-complete for sub-class including undirected cycles – It is efficiently solvable for tree structures and many models used in practice Latent variable estimation, given structure – Local optimum estimation is often possible via EM-algorithms Bayesian network structure learning from data – It is known to be “intractable” for general classes – It is even NP-complete to estimate “polytrees” robustly Inferring causal structure from data – Sometimes possible but in general not Given these facts, determining network structure using domain knowledge and using it to do parameter estimation and inference is common practice example © 2006 IBM Corporation41
- 41. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Bayesian network modeling – Utility-based classification (Cost-sensitive learning) – Reinforcement learning and Markov decision processes Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation42
- 42. IBM Research Cost-sensitive Learning for Marketing Decision Support Use of Basic Machine Learning (e.g. Classification and Regression) in Marketing Decision Support is well accepted – Example applications include: targeted marketing, credit rating, and others – But are they the best we have to offer ? Regression is an inherently harder problem than is required – One does not necessarily need to predict business outcome, customer behavior, etc, but is merely required to make business decisions – Regression may fail to detect significant patterns, especially when data is noisy Classification is an over simplification – By mapping to classification, one loses information on the degree of goodness/badness of a business decision in the past data Cost-sensitive classification provides the desired middle ground – It simplifies the problem almost to classification and thus allows discovery of significant patterns; – Yet retains and exploits the information on the degree of goodness of business decisions, in a way that is motivated by Utility theory © 2006 IBM Corporation43
- 43. IBM ResearchCost-sensitive Learning a.k.a. Utility-based Classification In regression: given (x,r) ε X x R, generated from a sampling distribution, find F: – F(x) ≈ r – E.g. r = profit obtained by targeting customer x In classification: given (x,y) ε X x {0,1} , generated from a sampling distribution, find F: – F(x) ≈ y – E.g. y = 1 if customer x is “good”, 0 otherwise In utility-based classification: given (stochastic) utility function U and (x,y) ε X x {0,1} generated from a sampling distribution, find F: – E[U(x,y,F(x))] is maximized (or equivalently E[C(x,y,F(x))] is minimized) – E.g. U(x,1,1) = Profit(x) = Profit obtained by targeting customer x, when x is indeed a “good” customer. © 2006 IBM Corporation44
- 44. IBM Research Example Cost and Utility Functions Simple formulation (cost/benefit matrix) Predicted 0 1 Predicted 0 1 True True 0 1 0 0 0 1 1 0 1 1 1 0 Classification utility matrix Misclassification cost matrix More realistic formulation (utility/cost dependent on individuals) Predicted bad good Predicted bad good True True bad 0 -C bad 0 - Default Amt good 0 Profit – C good 0 Interest “Targeted marketing” utility “Credit rating” utility © 2006 IBM Corporation45
- 45. IBM Research Bayesian Approach with Regression For each example x, choose the class that minimizes the expected cost: i * ( x) = arg min ∑ P ( j | x)C ( x, i, j ) i j need be estimated! Problem: Requires conditional density estimation and regression to solve a classification problem. – Price is high computational and sample complexity Merit: more flexibility and general applicability – Business constraints – Variability in fixed costs – But, is it necessary ? © 2006 IBM Corporation46
- 46. IBM Research A Classification Approach: Reducing cost-sensitive learning to weighted classification via “Costing” [ZLA’03] • If Y is {0,1}, then minimizing cost is equivalent to minimizing E X ×Y [ I (h( x) ≠ y ) w( x, y )] where w( x, y ) = C ( x,1 − y ) − C ( x, y ) • Even though we have a 2 x 2 cost matrix, its minimization can be done using one weight per labeled example • Given a distributional assumption on w(x,y), minimizing the above weighted error in the training data will generalize to unseen test instances ! • “Costing” Algorithm • “Costing” repeatedly performs weighted rejection sampling with w(x,y) to obtain an ensemble of hypotheses • Similar approaches have been applied to class probability estimation (“probing” [LZ’05) and quantile regression (“quanting” [LOZ’06]) example © 2006 IBM Corporation47
- 47. IBM Research Empirical Evaluation with Targeted Marketing data sets Rejection Sampling Feeding weights Resampling Classification Method Costing Transparent Resampling No weight KDD-98: (200) Box (100k) Charity donation NB $13163 $12367 $12026 $0.24 Boosted NB $14714 $14489 $13135 -$1.36 Exceptional C4.5 $15016 -$118 $2259 $0 Performance SVMLight $13152 $13683 $12808 $0DMEF-2: Method Costing Transparent Resamplin No weight Box g (100k)Targeted marketing (200) NB $37629 $32608 $12026 $16462 Boosted NB $37891 $36381 $13135 $121 C4.5 $37500 $478 $2259 $0 SVMLight $35290 $36443 $12808 $0 *Costing is state-of-the-art, but is restricted to 2-class problems Test Set Profits © 2006 IBM Corporation48
- 48. IBM Research A Closer Look: “Costing” (cost based bagging) [ZLA’03] Costing (Learner A, Data S, count T) (1) For all (x, y) ∈ S set wx , y = C ( x,1 − y ) − C ( x, y ) (2) For t=1 to T do Same weight in every iteration (1) Let ht = A(S, w) It only makes sense for 2-class (3) Output H: H ( x) = arg max y ∑ 1 ht ( x ) = y © 2006 IBM Corporation49
- 49. IBM Research A Multi-class Extension: Cost-sensitive boosting algorithm [AZL 2004] Define the “expanded sample” S’ as: (x, y) ∈ S = {(x, y ) | ∃ y (x, y) ∈ S, y∈ Y} GBSE (Learner A, Expanded data S’, count T) (1) For all (x, y) ∈ S initialize H 0 ( x, y ) = 1 / | Y | (2) For all (x, y) ∈ S initialize weight wx, y = E S , y ≈ H 0 [C ( x, y )] − C ( x, y ) (3) For t=1 to T do (a) For all (x,y) in S’ update weight wx, y = E S , y ≈ H t −1 [C ( x, y )] − C ( x, y ) (b) Let T = {(( x, y ), I ( wx, y > 0)) | ( x, y ) ∈ S } (c) Let ht = A(T,|w|) Weight updated in each iteration (d) ft = Stochastic(hi ) T the difference between average cost (e) Ft = (1- α)Ft-1+αft (4) Output h(x) = arg max( ∑α t =1 i ⋅ hi ( x, y ) ) by the current ensemble and cost of y © 2006 IBM Corporation50
- 50. IBM Research Gradient Boosting with Stochastic Ensembles: Illustration • The difference between the current average cost and the cost associated with a particular label is the boosting weight • The sign of the weight, E[C(x,y)]– C(x,y), is the training labelCost C(x,y) C(x,y) Ave Cost E[C(x,y)] Predicted y Label, y + + - - + - At learning iteration t Training Labels At learning iteration t+1 © 2006 IBM Corporation 51
- 51. IBM ResearchCost-sensitive boosting outperforms existing methods of cost-sensitive learning as well as classification and regression Data Set Bagging AvgCost MetaCost GBSE Annealing 1059±174 127±12 207±42 34±4 Solar 5403±397 237±38 5317±390 48±10 KDD-99 319±42 42±8 49±9 2±1 Letter 151±3 92±1 130±2 85±2 Splice 64±5 61±4 50±3 58±4 Satellite 190±10 108±6 104±6 93±6 Existing methods Ave Test Set Cost (±SE) © 2006 IBM Corporation52
- 52. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Utility-based classification (Cost-sensitive learning) – Reinforcement learning and Markov decision processes – Bayesian network modeling Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation53
- 53. IBM Research Sequential Cost-sensitive Decision Making by Reinforcement Learning Cost-sensitive classification provides an adequate framework for single marketing decision making – Real world marketing decision making is rarely made in isolation, but is made sequentially – Need to address the sequential dependency in decision making Cost-sensitive classification – Maximizes E[U(x,h(x)] We now wish to – Maximize Σt E[U(xt,h(xt)], where x may depend on earlier decisions … This is nothing but Reinforcement Learning, if we view x as the “state” – Maximize Σt E[U(st,π(st))], where st is determined stochastically according to a transition probability determined by st-1 and π(st-1). © 2006 IBM Corporation54
- 54. IBM ResearchReview: Markov Decision Process (MDP) At any given time t, the agent is in some state s. It takes an action a, and makes a transition to the next state s’, dictated by transition probability T(s,a) It then receives a “reward”, or utility U(s,a), which also depends on state s and action a. The goal of a reinforcement learner in MDP is to learn a policy, namely π: S → A, mapping states to actions, so as to maximize the cumulative discounted reward: ∞ R = ∑ γ t ⋅ U ( st , at ) t= 0 © 2006 IBM Corporation55
- 55. IBM Research MDP and Reinforcement Learning provide an advanced framework for modeling customer lifetime value Modeling CRM process using Markov Decision Process (MDP) Customer is in some "state" (his/her attributes) at any point in time Retailers action will move customer into another state Retailers goal is to take sequence of actions to guide customers path to maximize customers lifetime value Reinforcement Learning produces optimized targeting rules of the form If customer is in state "s", then take marketing action "a" Customer state “s” represented by current customer attribute vector estimates LTV(s,a) -- best policy is to choose a to maximize LTV(s,a) Valuable Typical CRM Process Campaign E Customerp 64 Loyal Loyal Campaign C Customer Customer Campaign A Potentially Repeater Repeater Repeater Valuable Bargain Hunter One Timer Defector Defector Campaign B Campaign D © 2006 IBM Corporation 56
- 56. IBM Research MDP enables genuine lifetime value modeling, in contrast to existing approaches that use observed lifetime value Observed lifetime value reflects only customer’s lifetime value attained by current marketing policy, and therefore fails to capture their potential lifetime value MDP based lifetime value modeling allows modeling of lifetime value based on optimized marketing policy (= the output of system !) Customer A’s path under… Current marketing policy Optimized marketing policy Valuable • Estimated (potential) lifetime value will be based on the Customer optimal path Campaign E • Output policy will lead the customer through the same path Loyal Loyal Campaign C Customer Customer Campaign A Potentially Valuable Repeater Repeater Repeater Bargain Hunter One Timer Defector Defector Campaign B Campaign D © 2006 IBM Corporation57
- 57. IBM Research And here is how this is possible… The MDP enables the use of data for many customers in various stages (states) to determine potential lifetime value of a particular customer in a particular state Reinforcement Learning can estimate the lifetime value (function) without explicitly estimating the MDP itself – The key lies in the value iteration procedure based on “Bellman’s equation” Q(s, a) = E[U(s, a)] + γ ⋅ max a Q(s , a ) LTV of a state = reward now + LTV of best next state Rule d Rule c Each rule is, in effect, trained with data Valuable corresponding to all subsequent states Customer Rule b Rule a Loyal Loyal Customer Customer Potentially Repeater Repeater Repeater Valuable © 2006 IBM Corporation58
- 58. IBM ResearchReinforcement Learning Methods with Function Approximation Value Iteration (based on Bellman Equation) – Provides the basis for classic reinforcement learning methods like Q-learning Q 0 (s, a) = E[U(s, a)] Q k +1 (s, a) = E[U(s, a)] + γ ⋅ max a Q k (s , a ) π (s) = arg max a Q ∞ (s, a) Batch Q-Learning (with Function Approximation) – Solves value iteration as iterative regression problems Q 0 (s, a) ← U(s, a) Q k +1 (s, a) ← (1 - α )Q k (s, a) + α (U ( s, a ) + γ ⋅ max a Q k (s , a )) Estimate using function approximation (regression) © 2006 IBM Corporation59
- 59. IBM Research Lifetime value modeling based on reinforcement learning can achieve greater long term profits than the traditional approach The graph below plots profits per campaign obtained in monthly campaigns over 2 years (in an empirical evaluation using benchmark data, i.e. KDD cup 98 data) … to yield greater long Output policy Output policy term of MDP of MDP profits approach approach (CCOM) (CCOM) “invests” in “invests” in initial initial campaigns… campaigns… © 2006 IBM Corporation60
- 60. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Utility-based classification (Cost-sensitive and active learning) – Reinforcement learning and Markov decision processes – Bayesian network modeling Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation61
- 61. IBM Research Lifetime Value Modeling and Cross-Channel Optimized Marketing (CCOM) Optimizes targeted marketing across multiple channels for lifetime value maximization. Combines scalable data mining and reinforcement learning methods to realize unique capability. $ Web Kiosk Direct Mail Call Center Store $ $ $ $ © 2006 IBM Corporation62
- 62. IBM Research CCOM Pilot Project with Saks Fifth Avenue p56 Business Problem addressed: Optimizing direct mailing to maximize lifetime revenue at the store (and other channels) Provided solution for the “Cross-Channel Challenge”: No explicit linking between marketing actions in one channel and revenue in another CCOM mailing policy shown to achieve 7-8% increase in expected revenue in the store (in laboratory experiments) ! $ Direct Mail Store $ $ $ $ CCOM-pilot business problem © 2006 IBM Corporation63
- 63. IBM Research Some Example FeaturesDemographic Features action rewardFULL_LINE_STORE_OF_RES.: If a full-line store exists in the area 0.018 0.004NON_FL_STORE_OF_RES.: If a non full-line store exists in area 0.012 -0.004Transaction Features (concerning divisions relevant to current campaign)CUR_DIV_PURCHASE_AMT_1M: Pur amt in last month in curr div 0.065 0.090CUR_DIV_PURCHASE_AMT_2_3M: Pur amt in 2-3 month in curr div 0.099 0.080CUR_DIV_PURCHASE_AMT_4_6M: Pur amt in 4-6 month in curr div 0.133 0.091CUR_DIV_PURCHASE_AMT_1Y: Pur amt in last year in curr div 0.162 0.128CUR_DIV_PURCHASE_AMT_TOT: Total Pur amt in current division 0.153 0.147Promotion History Features (on divisions relevant to current campaign)CUR_DIV_N_CATS_1M: Num cat sent last month in curr div 0.294 0.028CUR_DIV_N_CATS_2_3M: Num cat sent 2-3 months ago in curr div 0.260 0.025CUR_DIV_N_CATS_4_6M: Num cat sent 4-6 months ago in curr div 0.158 0.062CUR_DIV_N_CATS_TOT: Total num cat sent in curr div to date 0.254 0.062Control VariableACTION: To mail or not to mail 1.000 0.008Target (Response) VariableREWARD: Expected cumulative profits 0.008 1.000 © 2006 IBM Corporation64
- 64. IBM Research The Cross-Channel Challenge and Solution The Challenge: No explicit linking between actions in one channel (mailing) and rewards in another (revenue) • Very low correlation observed between actions and responses • Other factors determining “life time value” may dominate over the control variable (marketing action) in estimation of “expected value” • Obtained models can be independent of the action and give rise to useless rules ! The Cross-Channel Solution: Learn the relative advantage of competing actions! – Standard Method Value in state s1 Value in state s2 Value in state s1 Value in state s2 Approximation Actions Actions Actions a1 a2 a1 a2 a1 a2 a1 a2 – Proposed Method Value in state s1 Value in state s2 Actions a1 a2 a1 a2 © 2006 IBM Corporation65
- 65. IBM Research The Learning Method Definition of Advantage – A(s,a):= 1/Δt(Q(s,a) – maxa’ Q(s,a’)) Advantage Updating Procedure [Baird ’94] Repeat 1. Learn 1.1. A(s,a):=(1-α)A(s,a) +α (Amax(s)+(R(s,a)+γΔtV(s’)-V(s))/Δt) 1.2. Use Regression to estimate A(s,a) 1.3. V(s):=(1-β)V(s) +β(V(s)+(Amax-new(s)-Amax-old(s))/α) 2. Normalize A(s,a):=(1- ω)A(s,a)+ω(A(s,a)-Amax(s)) – Modifications: 1. Initialization with empirical life time value – 2. Batch Learning with optional function approximation © 2006 IBM Corporation66
- 66. IBM Research Evaluation Results Typical run (version 1) Policy Advantage Significant policy advantage 10 Advantage (percentage) observed with small number of 8 6 iterations 4 Obtained policy with 7- 8% policy 2 0 advantage, i.e. 7- 8% increase in -2 1 2 3 4 5 expected revenue (for 1.6 million -4 customers considered) Learning iterations Mailing policy was constrained to Typical run (version 2) mail same number of catalogues in Policy Advantage each campaign as last year 8 CCOM to evaluate sequence of Advantage (percent) 6 models and output best model 4 2 0 1 2 3 4 5 -2 -4 Le arning ite rations © 2006 IBM Corporation67
- 67. IBM Research Evaluation Method Challenge in Evaluation: Need to evaluate new policy using data collected by existing (sampling) policy Solution: Use bias-corrected estimation of “policy advantage” using data collected by sampling policy Definition of policy advantage: – (Discrete Time) Advantage Aπ(s,a):= Qπ (s,a) – maxa’ Qπ (s,a’) – Policy Advantage As~π(π’):= Eπ [Ea~π’ [Aπ(s,a)]] Estimating policy advantage with bias corrected sampling As~π(π’):= Eπ [(π’(a|s)/ π(a|s)) [Aπ(s,a)]] © 2006 IBM Corporation68
- 68. IBM Research Combination of reinforcement learning (MDP) with predictive data mining enables automatic generation of trigger-based marketing targeting rules Optimized with respect to the customer’s potential lifetime value Stated in simple “if then” style, which supports flexibility and compatibility Refined to make reference to detailed customer attributes and hence, well-suited to event and trigger-based marketing This is made possible by – Representing the states in MDP by customer’s attribute vectors – Combining reinforcement learning with predictive data mining to estimate lifetime value as function of customer attributes and marketing actions An example marketing targeting rule output by CCOM system © 2006 IBM Corporation69
- 69. IBM Research Some examples of rules output by CCOM • Avoid saturation effects Interpretation: If a customer has spent in the current division but enough catalogues have been sent, then don’t mail • Differentiate between customers who may be near saturation and those who are not … Interpretation: If a customer has spent in the current division and has received moderately many relevant catalogues, then mail • Invest in a customer until it knows it is not worth it Interpretation: If a customer has spent significantly in the past and yet has not spent much in the current division (product group) then don’t mail © 2006 IBM Corporation70
- 70. IBM Research CCOM is generically applicable by mapping physical data to this model CCOM - Logical Data Model *Developed with CBO Customer Profile History Customer Identifier Period Profile History Date Period Identifier Period Identifier Product Category Identifier Period Duration Channel Identifier Aggregated Count of Event Aggregated Revenue Aggegated Profit Transaction Customer Customer Loyalty Level History Customer Identifier Transaction Date Customer Identifier Customer Identifier Product Category Identifier First Name Loyalty Level Start Date Event Identifier Last Name Loyalty Level End Date Channel Identifier Age Loyalty Level Transaction Revenue Gender Transaction Profit Lifetime Value Model Customer Marketing Action Model Identifier Product Category Model Type Event Identifier Model Product Category Identifier Customer Identifier Product Category Description Marketing Action Date Channel Marketing Action Channel Identifier Marketing Policy Model Channel Description Model Identifier Model Type Marketing Event Model Event Product Category Event Identifier Event Identifier Channel Identifier Product Category Identifier Event Date Weight Event Category Description Fixed Cost CCOM Output Models Optional Entity © 2006 IBM Corporation71
- 71. IBM ResearchTutorial outline Challenges of marketing analytics – Integrating the “marketing” and “data mining” approaches – Customizing data mining approaches to the challenges of marketing decision support Survey of some useful ML methodologies – Bayesian network modeling – Utility-based classification (Cost-sensitive learning) – Reinforcement learning and Markov decision processes Detailed analysis and case studies: – Customer lifetime value modeling – Customer wallet estimation © 2006 IBM Corporation72
- 72. IBM ResearchWallet Estimation Case Study Outline Introduction – Business motivation and different wallet definitions Modeling approaches for conditional quantile estimation – Local and global models – Empirical evaluation A graphical model approach to wallet estimation – Generic algorithm for class of latent variable modeling problems MAP (Market Alignment Program) – Description of application and goals – The interview process and the feedback loop – Evaluation of Wallet models performance in MAP © 2006 IBM Corporation73
- 73. IBM ResearchWhat is Wallet (AKA Opportunity)? Total amount of money a company can spend on a certain category of products. Company Revenue IT Wallet IBM Sales IBM sales ≤ IT wallet ≤ Company revenue © 2006 IBM Corporation74
- 74. IBM ResearchWhy Are We Interested in Wallet? Customer targeting – Focus on acquiring customers with high wallet OnTarget – Evaluate customers’ growth potential by combining wallet estimates and sales history – For existing customers, focus on high wallet, low share-of-wallet customers Sales force management – Make resource assignment decisionsMAP • Concentrate resources on untapped – Evaluate success of sales personnel and sales channel by share-of-wallet they attain © 2006 IBM Corporation75
- 75. IBM ResearchWallet Modeling Problem Given: – customer firmographics x (from D&B): industry, emloyee number, company type etc. – customer revenue r – IBM relationship variables z: historical sales by product – IBM sales s Goal: model customer wallet w, then use it to “predict” present/future wallets No direct training data on w or information about its distribution! © 2006 IBM Corporation76
- 76. IBM ResearchHistorical Approaches within IBM Top down: this is the approach used by IBM Market Intelligence in North America (called ITEM) – Use econometric models to assign total “opportunity” to segment (e.g., industry × geography) – Assign to companies in segment proportional to their size (e.g., D&B employee counts) Bottom up: learn a model for individual companies – Get “true” wallet values through surveys or appropriate data repositories (exist e.g. for credit cards) Many issues with both approaches (won’t go into detail) – We would like a predictive approach from raw data © 2006 IBM Corporation77
- 77. IBM ResearchRelevant Work in the Literature While wallet (or share of wallet) is widely recognized as important, not much work on estimating it: Du, Kamakura and Mela (2006) developed “list augmentation” approach, using survey data to model spending with competitors Epsilon Data Management in white paper in 2001, proposed survey-based methodology Zadrozny, Costa and Kamakura (2005) compared bottom-up and top-down approaches on IBM data. Evaluation is based on a survey. © 2006 IBM Corporation78
- 78. IBM ResearchTraditional Approaches to Model Evaluation Evaluate models based on surveys – Cost and reliability issues Evaluate models based on high-level performance indicators: – Do the wallet numbers sum up to numbers that “make sense” at segment level (e.g., compared to macro- economic models)? – Does the distribution of differences between predicted Wallet and actual IBM Sales and/or Company Revenue make sense? In particular, are the % we expect bigger/smaller? – Problem: no observation-level evaluation © 2006 IBM Corporation79
- 79. IBM ResearchProposed Hierarchical IT Wallet Definitions TOTAL: Total customer available IT budget – Probably not quantity we want (IBM cannot sell it all) SERVED: Total customer spending on IT products covered by IBM – Share of wallet is portion of this number spent with IBM? REALISTIC: IBM sales to “best similar customers” – This can be concretely defined as a high percentile of: P(IBM revenue | customer attributes) – Fits typical definition of opportunity? TOTAL SERVED REALISTIC ≤ SERVED ≤ TOTAL REALISTIC © 2006 IBM Corporation80
- 80. IBM ResearchAn Approach to Estimating SERVED Wallets Historical Company SERVED IT spend relationship firmographics Wallet with IBM with IBM Wallet is unobserved, all other variables are Two families of variables --- firmographics and IBM relationship are conditionally independent given wallet We develop inference procedures and demonstrate them Theoretically attractive, practically questionable (We will come back to this later) © 2006 IBM Corporation81
- 81. IBM ResearchREALISTIC Wallet: Percentile of Conditional Distribution of IBM sales to the customer given customer attributes: s|r,x,z ~ fθ,r,x,z E.g., the standard linear regression assumption: s | x, r , z ~ N (αx + βr + γz , σ 2 ) E(s|r,x,z) REALISTIC What we are looking for is the pth percentile of this distribution © 2006 IBM Corporation82
- 82. IBM ResearchEstimating Conditional Distributions and Quantiles Assume for now we know which percentile p we are looking for First observe that modeling well the complete conditional distribution P(s|r,x,z) is sufficient ⇒If have good parametric model and distribution assumptions can also use it to estimate quantiles – E.g.: linear regression under linear model and homoskedastic iid gaussian errors assumptions Practically, however, may not be good idea to count on such assumptions – Especially not a gaussian model, because of statistical robustness considerations © 2006 IBM Corporation83
- 83. IBM ResearchModeling REALISTIC Wallet Directly REALISTIC defines wallet as pth percentile of conditional of spending given customer attributes – Implies some (1-p)% of the customers are spending full wallet with IBM Two obvious ways to get at the pth percentile: – Estimate the conditional by integrating over a neighborhood of similar customers ⇒ Take pth percentile of spending in neighborhood – Create a global model for pth percentile ⇒ Build global regression models © 2006 IBM Corporation84
- 84. IBM Research Universe of IBM customers with D&B informationLocal Models: K-Nearest Neighbors Design distance metric, e.g.: – Same industry – Similar employees/revenue Industry – Similar IBM relationship Target company i d en sp Neighborhood sizes (k): M IB Employees – Neighborhood size has significant Neighborhood of target company effect on prediction quality Prediction: Frequency Wallet Estimate – Quantile of firms in the neighborhood IBM Sales © 2006 IBM Corporation85
- 85. IBM ResearchGlobal Estimation: the Quantile Loss Function Our REALISTIC wallet definition calls for estimating the pth quantile of P(s|x,z). Can we devise a loss function which correctly estimates the quantile on average? Answer: yes, the quantile loss function for quantile p. p × ( y − y) ˆ if y > y ˆ L p ( y, y ) = ˆ (1 − p) × ( y − y ) if y > y ˆ ˆ This loss function is optimized in expectation when we correctly predict REALISTIC: arg min y E ( L p ( y, y ) | x) = p th quantile of P ( y | x) ˆ ˆ © 2006 IBM Corporation86
- 86. IBM ResearchSome Quantile Loss Functions p=0.8 4 p=0.5 (absolute loss) 3 2 1 0 -3 -2 -1 0 1 2 3 Residual (observed-predicted) © 2006 IBM Corporation87
- 87. IBM ResearchQuantile Regression Squared loss regression: – Estimation of conditional expected value by minimizing sum of n squares min β ∑ ( s − f ( z , x , β )) 2 i =1 i i i Quantile regression: n – Minimize Quantile loss: min β ∑L i =1 p ( si , f ( zi , xi , β )) quantile p × ( y − y) ˆ if y > y ˆ regression ˆ) = L p ( y, y loss (1 − p ) × ( y − y ) if y > y ˆ ˆ function Implementation: – assume linear function in some representation y = βt f(x,z), solution using linear programming – Linear quantile regression package in R (Koenker, 2001) © 2006 IBM Corporation88
- 88. IBM ResearchQuantile Regression Tree – Local or Global? Motivation: – Identify a locally optimal definition of neighborhood – Inherently nonlinear Adjustments of M5/CART for Quantile prediction: – Predict the quantile rather than the mean of the leaf – Empirically, splitting/pruning criteria do not require adjustment no Industry = ‘Banking’ yes Sales<100K no yes Frequency Wallet Estimate IBM Rev 2003>10K Frequency Wallet Estimate no yes IBM Sales IBM Sales Frequency Frequency Wallet Estimate Wallet Estimate © 2006 IBM Corporation89 IBM Sales IBM Sales
- 89. IBM ResearchAside: Log-Scale Modeling of Monetary Quantities Due to exponential, very long tailed typical distribution of monetary quantities (like Sales and Wallet), it is typically impossible to model them on original scale, because e.g.: – Biggest companies dominate modeling and evaluation – Any implicit homoskedasticity assumption in using fixed loss function is invalid Log scale is often statistically appropriate, for example if % change is likely to be “homoskedastic” Major issue: models ultimately judged in dollars, not log-dollars… © 2006 IBM Corporation90
- 90. IBM ResearchEmpirical Evaluation: Quantile Loss Setup – Four domains with relevant quantile modeling problems: direct mailing, housing prices, income data, IBM sales – Performance on test set in terms of 0.9th quantile loss – Approaches: Linear quantile regression, Q-kNN, Quantile trees, Bagged quantile trees, Quanting (Langrofd et al. 2006 -- reduces quantile estimation to averaged classification using trees) Baselines – Best constant model – Traditional regression models for expected values, adjusted under Gaussian assumption (+1.28σ) © 2006 IBM Corporation91
- 91. IBM ResearchPerformance on Quantile Loss Conclusions – Standard regression models are not competitive – If there is a time-lagged variable, LinQuantReg is best – Otherwise, bagged quantile trees (and quanting) perform best – Q-kNN is not competitive © 2006 IBM Corporation92
- 92. IBM ResearchResiduals for Quantile Regression Total positive holdout residuals: 90.05% (18009/20000) © 2006 IBM Corporation93
- 93. IBM ResearchGraphical Model for SERVED(?) Wallet Estimation Customer’s Customer’s Customer’s Spending Firmographics (X) IT Wallet (W) with IBM (S) Customer’s Relationship with IBM (Z) View 1 Two conditionally independent views ! View 2 © 2006 IBM Corporation94
- 94. IBM ResearchGeneric Problem Setting Unsupervised learning scenario: Unobserved target variable Observations on multiple predictor variables Domain knowledge suggesting that the predictors form multiple conditionally independent views Goal: To predict the target variable © 2006 IBM Corporation95
- 95. IBM ResearchSummary of Results on Generic Problem Analysis of a relevant class of latent variable models – Markov blanket can be split into conditionally independent views – For exponential linear models, the maximum likelihood estimation reduces to convex optimization problem Solution approaches for Gaussian likelihoods – Reduction to single linear least squares regression – ANOVA for testing conditional independence assumptions Empirical evaluation – Comparable to supervised learning with significant amount of training data – Case study on wallet estimation © 2006 IBM Corporation96
- 96. IBM ResearchDiscriminative Maximum Likelihood Inference Given: Directed graphical model and parametric form of the conditional distributions of nodes given their parents Goal: Predict the target W using the parameter estimates that are most likely given the observed data and the graphical model: Θ* = max Θ log PD,Θ (S | X, Z) = max Θ log ∫ Pθ 0 ( w | X) Pθ1 (S | w, Z) dw Where Θ = (θ0, θ1) is the parameter vector for the parametric conditional likelihoods, and D is our data Solution: Expectation-Maximization (EM) algorithm – Converges to a local optimum in general Estimating W: Mean or mode of “posterior” PΘ ( W | X, Z) * © 2006 IBM Corporation97
- 97. IBM ResearchGeneral Theoretical Result: Exponential Models Theorem: When the conditional distributions p(W|X) and p(S|W,Z), correspond to exponential linear models with matching link functions, the incomplete discriminative log-likelihood: LD(Θ) = log PD,Θ (S|X,Z) is a concave function of the parameters ⇒Maximum likelihood estimation reduces to a convex optimization problem ⇒EM algorithm converges to the globally optimal solution © 2006 IBM Corporation98
- 98. IBM ResearchGaussian Likelihoods and Linear Regression Assume both discriminative likelihoods P(W|X) and P(S|W,Z) are linear and gaussian: wi - αtxi = εiw ~ N(0, σw2) i.i.d si - wi - βtzi = εis ~ N(0, σs2) i.i.d Previous theorem says that EM would give ML solution Θ MLE= (α MLE, β MLE) But if we add equations up we eliminate W: si - αtxi - βtzi = (εis+ εiw) ~ N(0, σs2+ σw2) i.i.d Maximum likelihood solution of this problem is linear regression and gives solution Θ LS= (α LS, β LS) – Are the two solutions the same? © 2006 IBM Corporation99
- 99. IBM ResearchEquivalence and Interpretation Equivalence Theorem: When U=[X,Z] is a full column rank matrix, the two estimates are identical: Θ MLE =Θ LS ⇒Consistency of ΘLS and unbiasedness of resulting W estimates ⇒Can make use of linear regression computation and inference tools In particular: ANOVA to test validity of assumptions Some caveats we glossed over – In particular, full rank requirement implies cannot have intercept in both gaussian likelihoods! © 2006 IBM Corporation100
- 100. IBM ResearchANOVA for Testing Independence Assumptions ANOVA: Variance-based analysis for determining the goodness of fit for nested linear models Example of nested models: – Model A: Linear model with only variables in X, Z and no interactions – Model B: Allow interactions only within X and Z – Model C: Allow interactions between variables in X and Z Key Idea: if model C is statistically superior to model B ⇒ conditional independence and/or parametric assumptions are rejected © 2006 IBM Corporation101
- 101. IBM ResearchSome Simulation Results z Z © 2006 IBM Corporation102
- 102. IBM ResearchWallet Case Study Results Modeling equations: (monetary values → log scale) log(wi) = fα(xi) + cw + εiw, εiw~ N(0, σ2) log(si) − log(wi) = gβ(zi) + cs + εis, εis~ N(0, σ2) (cw, cs are intercepts, fα, gβ are parametric forms) Data is 2000 IBM customers in finance sector ANOVA results consistent with cond. independence: © 2006 IBM Corporation103
- 103. IBM ResearchMarket Alignment Project (MAP): Background MAP - Objective: – Optimize the allocation of sales force – Focus on customers with growth potential – Set evaluation baselines for sales personal MAP – Components: – Web-interface with customer information – Analytical component: wallet estimates – Workshops with Sales personal to review and correct the wallet predictions – Shift of resources towards customers with lower wallet share © 2006 IBM Corporation104
- 104. IBM ResearchMAP Tool Captures Expert Feedback from the Client Facing Teams MAP interview process – all Integrated and Aligned Coverages MAP Interview Team Client Facing Unit (CFU) Team Insight Delivery Web Interface and Capture Wallet models: Expert Analytics and Predicted validated Validation Opportunity OpportunityTransaction D&B Resource Data Integration Data Data Assignments Post-processing The objective here is to use expert feedback (i.e. validated revenue opportunity) from from last year’s workshops to evaluate our latest opportunity models © 2006 IBM Corporation105
- 105. IBM ResearchMAP Workshops Overview Calculated 2005 opportunity using naive Q-kNN approach 2005 MAP workshops – Displayed opportunity by brand – Expert can accept or alter the opportunity Select 3 brands for evaluation: DB2, Rational, Tivoli Build ~100 models for each brand using different approaches Compare expert opportunity to model predictions – Error measures: absolute, squared – Scale: original, log, root © 2006 IBM Corporation106
- 106. IBM ResearchInitial Q-kNN Model Used Distance metric – Identical Industry Universe of IBM customers with D&B information – Euclidean distance on size (Revenue or employees) Neighborhood sizes 20 Prediction Industry – Median of the non-zero neighbors Target company i e nu – (Alternatives Max, Percentile) ve Re Employees Post-Processing Neighborhood of target company – Floor prediction by max of last 3 years revenue © 2006 IBM Corporation107
- 107. IBM ResearchExpert Feedback (Log Scale) to Original Model (DB2) 20 18 Experts accept opportunity (45%) 16 14 Increase (17%) Expert Feedback 12 Experts change 10 opportunity (40%) Decrease (23%) 8 6 4 2 Experts reduce 0 opportunity to 0 0 2 4 6 8 10 12 14 16 18 20 (15%) MODEL_OPPTY © 2006 IBM Corporation108

No public clipboards found for this slide

Be the first to comment