Predicting Customer Behavior with R: Part 1
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Predicting Customer Behavior with R: Part 1

  • 3,913 views
Uploaded on

An introduction to modeling non-contractual customer purchasing with BYTD package in R

An introduction to modeling non-contractual customer purchasing with BYTD package in R

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Hi Matthew Great preso. I replicated your code then tried to use it with my own data, a couple of things if you could clarify, with the code in slide 22, are you splitting the data to test whether you can predict the actual figures for the total time of your data set from half way point of the data set and compare the prediction and actual?

    and with slide 32, what exactly is it meant of week of end of calibration? Is that a calendar thing or literally, if your total data set is 100 weeks worth of data, and you train it on 50, then you set T.cal to 50?

    and if this is the case then slide 40 will then use 100 weeks as the value for T.tot?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,913
On Slideshare
3,911
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
168
Comments
1
Likes
8

Embeds 2

https://twitter.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Predicting Customer Behavior with R: Part 1 Matthew Baggott, Ph.D. University of Chicago
  • 2. Goal for Today’s Workshop• Use R and the BTYD package to make a Pareto/Negative Binomial Distribution model of customer purchasing• Understand our assumptions and how we could refine them
  • 3. Goal for Today’s WorkshopFrom this To thiscust date sales1 1997-01-01 29.331 1997-01-18 29.731 1997-08-02 14.961 1997-12-12 26.482 1997-01-01 63.342 1997-01-13 11.773 1997-01-01 6.794 1997-01-01 13.975 1997-01-01 23.946 1997-01-01 35.996 1997-01-11 32.996 1997-06-23 91.926 1997-07-22 47.086 1997-07-26 71.966 1997-10-25 78.476 1997-12-06 83.476 1998-01-18 84.46
  • 4. • Tutorial assumes working knowledge of R (but feel free to ask questions)• Main R packages used: BTYD, plyr, ggplot2, reshape2, lubridate• BTYD vignette covers some of the same ground• R Script to carry out today’s analysis is at: gist.github.com/mattbaggott/5113177
  • 5. Why Model?• Help separate – active customers, – inactive customers who should be re-engaged, and – unprofitable customers – Inactive customers• Forecast future business profits and needs
  • 6. Annual Customer ‘Defection’ Rates are HighIndustry Defection RateInternet service providers 22%U.S. long distance (telephone) 30%German mobile telephone market 25%Clothing catalogs 25%Residential tree and lawn care 32%Newspaper subscriptions 66% Griffin and Lowenstein 2001
  • 7. Why Not Model?• Wübben & Wangenheim (2008) found simple rules of thumb often beat basic models• Simple calculations can faster, clearer: Long Term Value = (Avg Monthly Revenue per Customer * Gross Margin per Customer) / Monthly Churn Rate
  • 8. But it’s All Models• Our choice is not model vs. no model• Our choice is formal, scalable models vs. informal, manual models• We can and should compare, refine, & combine simple rules and complex models
  • 9. RFM Family of Models• Models use three variables: – Recency of purchases – Frequency of purchases – Monetary value of purchases• Used for non-contractual purchasing• Data needed: dates and amounts of purchases for individual customers
  • 10. Simple RFM model of Purchasing1. A probabilistic purchasing process for active customers, modeled as a Poisson process with rate λ2. A probabilistic dropout process of active customers becoming inactive, modeled as an exponential distributions with dropout rate γ
  • 11. Simple RFM model of Purchasing3. Purchasing rates follow a gamma distribution across customers with shape and scale parameters: r and α4. Dropout rates follow a gamma distribution across customers with shape and scale parameters β and s5. Transaction rate λ and the dropout rate µ vary independently across customers6. Customers are considered in isolation (no indirect value, no influencing each other)
  • 12. Purchasing as a Poisson process• Single parameter indicating the constant probability of some event• Each event is independent -- one does not make another one more or less likely (Are these realistic?) Frequency of war• Other Poisson processes : e-mail arrival , radioactive decay, wars per year Hayes, 2002
  • 13. Dropout rates• Latent variable: without subscriptions, not directly observed• ‘Right censored’ (we don’t know the future)• Fancy survival / hazard models possible (such as Cox regression)• Here, we use a simple exponential function with dropout rate γ > 0 as a constant f(t)= γe –γt
  • 14. Gamma distributions• Family of continuous probability distributions with two parameters, shape and scale/rate.• Often used to fit scale/rate parameters, as we do here with Poisson and exponential distributions.
  • 15. Model is of repeat customers• Customers are only customers after they make their first purchase• Frequency is not defined for first purchase• We will change purchase data log into repeat purchase data log with dc.SplitUpElogForRepeatTrans() or as part of dc.ElogToCbsCbt()
  • 16. CDNOW Data set• We will use data from online retailer CDNOW, included in BTYD package• 10% of the cohort of customers who made their first transactions in the first quarter of 1997• 6919 purchases by 2357 customers over a 78-week period• Not too big; we won’t need to wait long
  • 17. Install/load packagesInstallCandidates <- c("ggplot2", "BTYD", "reshape2", "plyr", "lubridate")# check if pkgs are already presenttoInstall <- InstallCandidates[!InstallCandidates %in% library()$results[,1]]if(length(toInstall)!=0) {install.packages(toInstall, repos = "http://cran.r-project.org")}# load pkgslapply(InstallCandidates, library, character.only = TRUE)
  • 18. Load datacdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD")elog=read.csv(cdnowElog) # read datahead(elog) # take a lookelog<-elog[,c(2,3,5)] # we need these columnsnames(elog) <- c("cust","date","sales") # model funcs expect these names# format dateelog$date <- as.Date(as.character(elog$date), format="%Y%m%d")
  • 19. Aggregate by cust, dates• Our model is concerned with inter-purchase intervals.• We only have dates (w/o times) and there may be multiple purchases on a day• We merge all transactions that occurred on the same day: elog <- dc.MergeTransactionsOnSameDate(elog)
  • 20. Plot dataggplot(elog, aes(x=date,y=sales,group=cust))+ geom_line(alpha=0.1)+ scale_x_date()+ scale_y_log10()+ ggtitle("Sales for individual customers")+ ylab("Sales ($, US)")+xlab("")+ theme_minimal() (Ugly plot, but could have Revealed data issues.)
  • 21. A more useful plotpurchaseFreq <- ddply(elog, .(cust), summarize, daysBetween = as.numeric(diff(date)))windows();ggplot(purchaseFreq,aes(x=daysBetween))+ geom_histogram(fill="orange")+ xlab("Time between purchases (days)")+ theme_minimal()
  • 22. Divide data into train and test(end.of.cal.period <- min(elog$date) + as.numeric((max(elog$date)- min(elog$date))/2))# split data into train(calibration) and test (holdout)and make matricesdata <- dc.ElogToCbsCbt(elog, per="week", T.cal=end.of.cal.period, merge.same.date=TRUE, # already did this statistic = "freq") # which CBT to return# take a lookstr(data)
  • 23. > str(data)List of 3 $ cal :List of 2 ..$ cbs: num [1:2357, 1:3] 2 1 0 0 0 7 1 0 2 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:3] "x" "t.x" "T.cal" Cal period matrix ..$ cbt: num [1:2357, 1:266] 0 0 0 0 0 0 0 0 0 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:266] "1997-01-08" "1997-01-09" "1997-01-10" "1997-01-11" ... $ holdout :List of 2 ..$ cbt: num [1:2357, 1:272] 0 0 0 0 0 0 0 0 0 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:272] "1997-10-01" "1997-10-02" "1997-10-03" "1997-10-04" ... ..$ cbs: num [1:2357, 1:2] 1 0 0 0 0 8 0 2 2 0 ... .. ..- attr(*, "dimnames")=List of 2 Holdout period matrix .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:2] "x.star" "T.star" $ cust.data:data.frame: 2357 obs. of 5 variables: ..$ cust : int [1:2357] 1 2 3 4 5 6 7 8 9 10 ... ..$ birth.per : Date[1:2357], format: "1997-01-01" ... Customer info ..$ first.sales: num [1:2357] 29.33 63.34 6.79 13.97 23.94 ... ..$ last.date : Date[1:2357], format: "1997-08-02" ... ..$ last.sales : num [1:2357] 14.96 11.77 6.79 13.97 23.94 ...
  • 24. Extract cbs matrix• cbs is short for "customer-by-sufficient- statistic” matrix, with the sufficient stats being: – frequency – recency (time of last transaction) and – total time observed cal2.cbs <- as.matrix(data[[1]][[1]]) str(cal2.cbs) (First item in list, first item in it)
  • 25. Estimate parameters for model• Purchase shape and scale params: r and α• Dropout shape and scale params: β and s # initial estimate (params2 <- pnbd.EstimateParameters(cal2.cbs)) # 0.5528797 10.5838911 0.6250764 12.2011828 # look at log likelihood (LL <- pnbd.cbs.LL(params2, cal2.cbs)) # -9598.711
  • 26. Estimate parameters for model# make a series of estimates, see if they convergep.matrix <- c(params2, LL)for (i in 1:20) { params2 <- pnbd.EstimateParameters(cal2.cbs, params2) LL <- pnbd.cbs.LL(params2, cal2.cbs) p.matrix.row <- c(params2, LL) p.matrix <- rbind(p.matrix, p.matrix.row)}# examinep.matrix# use final set of values(params2 <- p.matrix[dim(p.matrix)[1],1:4])
  • 27. Plot iso-likelihood for param pairs# make parameter names for descriptive result# parameter names for a more descriptive resultparam.names <- c("r", "alpha", "s", "beta")LL <- pnbd.cbs.LL(params2, cal2.cbs)dc.PlotLogLikelihoodContours(pnbd.cbs.LL, params2, cal.cbs = cal2.cbs, n.divs = 5, num.contour.lines = 7, zoom.percent = 0.3, allow.neg.params = FALSE, param.names = param.names)
  • 28. Plot iso-likelihood for param pairs Log-likelihood contour of r and alpha Log-likelihood contour of r and s Log-likelihood contour of r and beta 12.0 2.0 -9600 -9800 0 13.0 -1000 0 1.5 00 11.0 -10 -9600alpha beta 0 -9800 s 1.0 12.0 -1020 10.0 0.5 -10000 -10200 -10400 -10600 -9800 00 00 -9600 -9800 11.0 -104 -106 0 0 00 -1100 -10 0.0 9.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 r r r Log-likelihood contour of alpha and s Log-likelihood contour of alpha and beta Log-likelihood contour of s and beta -10100 2.0 -10000 6 -960 2 0 13.0 13.0 -9900 -960 -1000 1.5 00 -9800 -96 2 -9800 -960 beta beta -9700 -9700 -9700 1.0s 12.0 12.0 -9604 -9900 0 -1010 -9600 0.5 -9600 -9800 11.0 11.0 04 0 60 -9700 -96 -9 0.0 -9800 9.0 9.5 10.0 11.0 12.0 9.0 9.5 10.0 11.0 12.0 0.0 0.5 1.0 1.5 2.0 alpha alpha s
  • 29. Plot population estimates# par to make two plots side by sidepar(mfrow=c(1,2))# Plot the estimated distribution of# customers propensities to purchasepnbd.PlotTransactionRateHeterogeneity(params2, lim = NULL) # lim is upper xlim# Plot estimated distribution of# customers propensities to drop outpnbd.PlotDropoutRateHeterogeneity(params2)# set par to normalpar(mfrow = c(1,1))
  • 30. Plot population estimates Heterogeneity in Transaction Rate Heterogeneity in Dropout Rate Mean: 0.0522 Var: 0.0049 Mean: 0.0512 Var: 0.0042 25 25Density Density 15 15 0 5 0 5 0.00 0.10 0.20 0.30 0.00 0.10 0.20 0.30 Transaction Rate Dropout rate
  • 31. Examine individual predictions# predicted num. transactions a new customer# will make in 52 weekspnbd.Expectation(params2, t = 52)# expected characteristics for customer 1516,# conditional on their purchasing during calibrationcal2.cbs["1516",]x <- cal2.cbs["1516", "x"] # x is frequencyt.x <- cal2.cbs["1516", "t.x"] # t.x is time last buyT.cal <- cal2.cbs["1516", "T.cal"] # T.cal is time observed# estimate their transactions in a T.star durationpnbd.ConditionalExpectedTransactions(params2, T.star = 52, # weeks x, t.x, T.cal)# [1] 25.24912
  • 32. Probability a customer is ‘alive’x # freq of purchaset.x # week of last purchaseT.cal <- 39 # week of end of cal, i.e. presentpnbd.PAlive(params2, x, t.x, T.cal)# To visualize the distribution of P(Alive)# across customers:params3 <- pnbd.EstimateParameters(cal2.cbs)p.alives <- pnbd.PAlive(params3, cal2.cbs[,"x"], cal2.cbs[,"t.x"], cal2.cbs[,"T.cal"])
  • 33. Plot P(Alive)ggplot(as.data.frame(p.alives),aes(x=p.alives))+ geom_histogram(colour="grey", fill="orange")+ ylab("Number of Customers")+ xlab("Probability Customer is Live")+ theme_minimal() 600 Number of Customers 400 200 0 0.0 0.3 0.6 0.9 Probability Customer is Live
  • 34. Plot Observed, Model Transactions# plot actual & expected customers binned by# num of repeat transactionspnbd.PlotFrequencyInCalibration(params2, cal2.cbs, censor=10, title="Model vs. Reality during Calibration") Model vs. Reality during Calibration 1500 Actual Customers Model 500 0 0 1 2 3 4 5 6 7 8 9 10+ Calibration period transactions
  • 35. Compare calibration to holdout• Note of caution: potential overfitting – Our gamma distributions are based on the specific customers we had during calibration. – How would our parameters and predictions change with different customers? – We will addresses this in Part 2
  • 36. Get holdout results, duration# get holdout transactions from dataframe data,# add in as x.starx.star <- data[[2]][[2]][,1]cal2.cbs <- cbind(cal2.cbs, x.star)str(cal2.cbs)holdoutdates <- attributes(data[[2]][[1]])[[2]][[2]]holdoutlength <- round(as.numeric(max(as.Date(holdoutdates))- min(as.Date(holdoutdates)))/7)
  • 37. Plot frequency comparison# plot predicted vs seen conditional freqsT.star <- holdoutlengthcensor <- 10 # Bin all order numbers here and abovecomp <-pnbd.PlotFreqVsConditionalExpectedFrequency(params2, T.star, cal2.cbs, x.star, censor) Conditional Expectation 10 Actual Holdout period transactions Model 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10+ Calibration period transactions
  • 38. Examine accompanying matrix• Bin size in that plot can be seen in comp matrix:rownames(comp) <- c("act", "exp", "bin")comp freq.0 freq.1 freq.2 freq.3 freq.4 freq.5act 0.2367116 0.6970387 1.392523 1.560000 2.532258 2.947368exp 0.1367795 0.5921279 1.181825 1.693969 2.372472 2.876888bin 1411.0000000 439.0000000 214.000000 100.000000 62.000000 38.000000 freq.6 freq.7 freq.8 freq.9 freq.10+act 3.862069 4.913043 3.714286 8.400000 7.793103exp 3.776675 4.167163 5.698026 5.487862 8.369321bin 29.000000 23.000000 7.000000 5.000000 29.000000
  • 39. Compare Weekly transactions# get data without first transaction: removes those who buy 1xremovedFirst.elog <- dc.SplitUpElogForRepeatTrans(elog)$repeat.trans.elogremovedFirst.cbt <- dc.CreateFreqCBT(removedFirst.elog)# get all data, so we have customers who buy 1xallCust.cbt <- dc.CreateFreqCBT(elog)# add 1x customers into matrixtot.cbt <- dc.MergeCustomers(data.correct=allCust.cbt, data.to.correct=removedFirst.cbt)lengthInDays <- as.numeric(max(as.Date(colnames(tot.cbt)))- min(as.Date(colnames(tot.cbt))))origin <- min(as.Date(colnames(tot.cbt)))
  • 40. Compare Weekly transactionstot.cbt.df <- melt(tot.cbt,varnames = c("cust","date"), value.name="Freq")tot.cbt.df$date <- as.Date(tot.cbt.df$date)tot.cbt.df$week <- as.numeric(1 + floor((tot.cbt.df$date-origin+1)/7))transactByDay <- ddply(tot.cbt.df,.(date),summarize,sum(Freq))transactByWeek <- ddply(tot.cbt.df,.(week),summarize,sum(Freq))names(transactByWeek) <- c("week","Transactions")names(transactByDay) <- c("date","Transactions")T.cal <- cal2.cbs[,"T.cal"]T.tot <- 78 # end of holdoutcomparisonByWeek <- pnbd.PlotTrackingInc(params2, T.cal, T.tot, actual.inc.tracking.data = transactByWeek$Transactions)
  • 41. Compare Weekly transactions
  • 42. Formal Measures of Accuracy# root mean squared errorrmse <- function(est, act) { return(sqrt(mean((est-act)^2))) }# mean squared logarithmic errormsle <- function(est, act) { return(mean((log1p(est)- log1p(act))^2)) }Predict <- pnbd.ConditionalExpectedTransactions(params2, T.star = 38, # weeks x = cal2.cbs[,"x"], t.x = cal2.cbs[,"t.x"], T.cal = cal2.cbs[,"T.cal"])cal2.cbs[,"x.star"] # actual transactions for each personrmse(act=cal2.cbs[,"x.star"],est=predict)msle(act=cal2.cbs[,"x.star"],est=predict)Measures not really meaningful without some comparison
  • 43. Next Week:• Compare results to a simple model• Estimate of expenditure / customer value• Use info about clumpiness of purchase patterns (as in Platzer 2008)• Use info about seasonality of purchasing, with forecast package• Improve model predictions with machine learning techniques: – Cross-validation to avoid over-fitting – Combining model predictions
  • 44. References• Griffin and Lowenstein (2001), Customer Winback: How to Recapture Lost Customers—And Keep Them Loyal. San Francisco: Jossey-Bass.• Platzer (2008). “Stochastic models of noncontractual consumer relationships.” Master of Science in Business Administration thesis, Vienna University of Economics and Business Administration, Austria.• Schmittlein, Morrison, and Colombo (1987). Counting Your Customers: Who Are They and What Will They Do Next? Management Science, 33, 1–24.• Wang, Gao, and Li (2010). Empirical analysis of customer behaviors in Chinese e-commerce. Journal of networks 5.10: 1177-1184.• Wübben & Wangenheim (2008). Instant customer base analysis: Managerial heuristics often “get it right”. Journal of Marketing, 72(3), 82-93.• Zhang, Y., Bradlow, E. T., & Small, D. S. (2012). New Measures of Clumpiness for Incidence Data.
  • 45. Purchase rate often depends on type of purchase 1.1 million purchases on 360buy.com from Wang, Gao, & Li 2010