Predicting LifetimeValue with HadoopMartin Colaco, Head of Data Science l April 10, 2013
Agenda• What is predictive modeling• What is Lifetime Value (LTV)• What is feature extraction - challenges• How can we bui...
Can we predict how many attendees tonight?• How to estimate?   Door count (after the fact)• Is there a way to build a mode...
Predicting how many attendees tonight?Attendees = Registrations x % Attendance + Non-registrants
Predicting how many attendees tonight?Attendees = Registrations x % Attendance + Non-registrantsAttendees = 201 x 50% + 25...
Predictive Modeling• Know the question you want to answer• Look at historical behavior• Apply understanding of those behav...
Common use cases for predictive modelingMy chemical engineering roots….                In – Out = Accumulation      IN    ...
Users: Maximizing Growth                  In – Out = Accumulation     IN                  D = Growth              Out     ...
Money: Maximizing Profit                  In – Out = Accumulation      IN                   D = Profit                    ...
How Do We Estimate LTV    Business Model               LTV        Download           Cost per Download                    ...
LTV Modeling – Social / Mobile Games                                        LTV = (1 + k) * Retention * ARPU              ...
Predictive LTV Result                       300                       250                       200    Cumulative Spend   ...
Challenges with this simple LTV model• All of these parameters are moving targets• k-factor is wildly variable (we’ll igno...
Challenges with this simple LTV model• All of these parameters are moving targets• k-factor is wildly variable (we’ll igno...
Kontagent Facts• Founded in 2007• 130+ employees and growing• 100s of Customers• 1000s of Apps Instrumented• 250+ billion ...
How does Kontagent collect data?•   Via a REST API    o APA – Install message    o EVT – Custom event message (user action...
Feature Extraction for Predictive LTV  Need to translate a transaction log into a table  o   Install Date               o ...
How can we compute this table of features?•   Python – single thread     o Might work in some cases but need to cache     ...
Hive query•                                                   Transaction log    Store data in Hadoop                     ...
This query gets cumbersome quickly…select sub1.gameplay_date as play_date, sub1.returned,sub2.spenders, sub2.total_daily_s...
Feature Extraction with HiveQL  o   Install Date                o Spend on Date  o   Install Source              o Users A...
How can we compute this table of features?•   Python – single thread•   Hive with Hadoop•   Cascalog (Cascading) with Hado...
Cascalog Code                                                                  (defn life-table [api-key](defn user-instal...
Feature Extraction with Cascalog  o   Install Date                    o Spend on Date  o   Install Source                 ...
What have we learned•   Martin sucks (or is awesome) at predicting number of    attendees at Meetups!•   Predictive modeli...
Questions?         Need a job? We’re hiring:http://www.kontagent.com/company/careers/      Martin Colaco      Head of Data...
Upcoming SlideShare
Loading in …5
×

Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

2,516 views

Published on

Description:
One of the biggest challenges for people building data products today is developing and refining features for modeling purposes (i.e. feature extraction) with the volume and variability of web scale data. In this talk, Martin will discuss some of the challenges and solutions faced by Kontagent as it built out a predictive lifetime value model for its customers. As you will learn, Hadoop is critical to this feature extraction process, and Cascading is quite handy when building out more complex features than can be readily developed in a query framework like Hive.

Speaker:
Martin Colaco, Director of Data Science for Kontagent

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,516
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

  1. 1. Predicting LifetimeValue with HadoopMartin Colaco, Head of Data Science l April 10, 2013
  2. 2. Agenda• What is predictive modeling• What is Lifetime Value (LTV)• What is feature extraction - challenges• How can we build a cohort-based predictive LTV model o Python o Hive with Hadoop o Cascalog with Hadoop
  3. 3. Can we predict how many attendees tonight?• How to estimate? Door count (after the fact)• Is there a way to build a model that we can use to predict attendees?
  4. 4. Predicting how many attendees tonight?Attendees = Registrations x % Attendance + Non-registrants
  5. 5. Predicting how many attendees tonight?Attendees = Registrations x % Attendance + Non-registrantsAttendees = 201 x 50% + 25 = 125 Lots of Uncertainty Location Date & Time Company Speaker Title & Topic
  6. 6. Predictive Modeling• Know the question you want to answer• Look at historical behavior• Apply understanding of those behaviors to new situations -> new groups of users Fame Feature Model ModelData Success Extraction Selection Validation Riches
  7. 7. Common use cases for predictive modelingMy chemical engineering roots…. In – Out = Accumulation IN D Out
  8. 8. Users: Maximizing Growth In – Out = Accumulation IN D = Growth Out App or Network of Apps Paid marketing Frustration? Organic Boredom? X-promotion Too expensive? Bad UX? No new content?
  9. 9. Money: Maximizing Profit In – Out = Accumulation IN D = Profit Out App or App Network or Business Lifetime Value Business expenses: (LTV) Marketing costs Operations (servers, etc.) Employee costs
  10. 10. How Do We Estimate LTV Business Model LTV Download Cost per Download Avg. Price x Avg. Subscription Customer LifetimeMicrotransactions ???(Ads / In-app-purchases)
  11. 11. LTV Modeling – Social / Mobile Games LTV = (1 + k) * Retention * ARPU Output Features Variable Daily Retention Curve ARPDAU Curve 100.00% $0.10% of users retained 80.00% $0.08 ARPDAU 60.00% $0.06 40.00% $0.04 20.00% $0.02 0.00% $- 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Days since install Days since install
  12. 12. Predictive LTV Result 300 250 200 Cumulative Spend 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 Days Since Install
  13. 13. Challenges with this simple LTV model• All of these parameters are moving targets• k-factor is wildly variable (we’ll ignore k-factor in this presentation)• Acquisition costs can change (as can LTV and retention) - Cohort LTV by install date and install source ARPDAU Curve Retention Curve $0.10 % of users retained 100.00% $0.08 80.00% ARPDAU $0.06 60.00% $0.04 40.00% $0.02 20.00% $- 0.00% 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Days since install Days since install
  14. 14. Challenges with this simple LTV model• All of these parameters are moving targets• k-factor is wildly variable (we’ll ignore k-factor in this presentation)• Acquisition costs can change (as can LTV and retention) - Cohort LTV by install date and install source• Retention is computationally difficult to calculate• Large games can have millions of users who spend money over many months/years How can we build out the features we need to model LTV by cohort?
  15. 15. Kontagent Facts• Founded in 2007• 130+ employees and growing• 100s of Customers• 1000s of Apps Instrumented• 250+ billion events per month• 200MM+ MAUs• 1 Trillion Events in 2013
  16. 16. How does Kontagent collect data?• Via a REST API o APA – Install message o EVT – Custom event message (user action) o MTU – Spending message• Yields a transaction log over time:
  17. 17. Feature Extraction for Predictive LTV Need to translate a transaction log into a table o Install Date o Users Active on Date o Install Source o Users Active on Date or After o Activity Date o Spend on Date o Cumulative Spend to Date
  18. 18. How can we compute this table of features?• Python – single thread o Might work in some cases but need to cache potentially millions of rows of data• Hive with Hadoop o Data warehouse system that allows SQL-like querying capabilities of distributed data structures o Let’s work through this….
  19. 19. Hive query• Transaction log Store data in Hadoop APA EVT MTU• Query using Hive select distinct s from demo_apa Query Language where kt_date(utc_timestamp) = 2011-07-08 and s is not null and month=201107 (HiveQL)
  20. 20. This query gets cumbersome quickly…select sub1.gameplay_date as play_date, sub1.returned,sub2.spenders, sub2.total_daily_spendfrom(select gp.gameplay_date, count(distinct gp.s) as returnedfrom(select distinct sfrom demo_apawhere kt_date(utc_timestamp) = 2011-07-08 and s is not nulland month=201107) baseleft outer join(select s, kt_date(utc_timestamp) as gameplay_datefrom demo_evtwhere s is not null and month>=201107) gp on gp.s = base.s play_date returned spenders total_daily_spendgroup by gp.gameplay_date 7/10/2011 2 1 75) sub1 7/11/2011 4 2 19join(select sp.spend_date, count(distinct sp.s) as spenders, 7/12/2011 1 1 0.2sum(sp.spend)/100 as total_daily_spendfrom(select distinct sfrom demo_apawhere kt_date(utc_timestamp) = 2011-07-08 and s is not nulland month=201107) baseleft outer join(select s, kt_date(utc_timestamp) as spend_date, v as spendfrom demo_mtuwhere s is not null and v>0 and month>=201107) sp on sp.s = base.sgroup by sp.spend_date) sub2 on sub1.gameplay_date=sub2.spend_date
  21. 21. Feature Extraction with HiveQL o Install Date o Spend on Date o Install Source o Users Active on Date or After o Activity Date o Cumulative Spend to Date o Users Active on Date Problem - HiveQL doesn’t support non equi-joins Options for improving Hive performance • Write tables or temp tables • Code up some UDFs
  22. 22. How can we compute this table of features?• Python – single thread• Hive with Hadoop• Cascalog (Cascading) with Hadoop o Cascading is a flow based computational model for Hadoop o Cascalog is a declarative based system for cascading o Let’s work through this…
  23. 23. Cascalog Code (defn life-table [api-key](defn user-install-dates [api-key] (let [install-dates (user-install-dates api-key) (let [apas (tap/apa-tap api-key)] evts (tap/evt-tap api-key) (<- [?s ?install-date] mtus (tap/mtu-tap api-key) (apas ?s _ _ ?install-ts) cumulative-spend (cumulative-spend-by-date install-dates mtus) (ops/ts-to-date ?install-ts :> ?install-date)))) activity-spend (spend-by-activity-date install-dates mtus) cumulative-users (cumulative-active-users-by-date install-dates evts)(defn active-users-by-activity-date [install-dates evts] active-users (active-users-by-activity-date install-dates evts)] (<- [?install-date ?activity-date ?active-users] (<- [?install-date ?activity-date ?remaining-users ?active-users ?paying- (install-dates ?s ?install-date) users ?day-spending ?cumulative-spending] (evts ?s _ ?ts) (cumulative-spend ?install-date ?activity-date ?cumulative-spend) (ops/ts-to-date ?ts :> ?activity-date) (activity-spend ?install-date ?activity-date ?paying-users ?day-spending) (c/distinct-count ?s :> ?active-users))) (cumulative-users ?install-date ?activity-date ?remaining-users) (active-users ?install-date ?activity-date ?active-users))))(defn spend-by-activity-date [install-dates mtus] (<- [?install-date ?activity-date ?paying-users ?day-spending] (mtus ?s ?v _ _ _ _ ?ts) (install-dates ?s ?install-date) (ops/ts-to-date ?ts :> ?activity-date) (c/distinct-count ?s :> ?paying-users) (c/sum ?v :> ?day-spending)))(defn cumulative-active-users-by-date [install-dates evts] (<- [?install-date ?activity-date ?remaining-users] (install-dates ?s ?install-date) (evts ?s _ ?ts) (ops/project-backward ?ts :> ?activity-date) (c/distinct-count ?s :> ?remaining-users)))(defn cumulative-spend-by-date [install-dates mtus] (<- [?install-date ?activity-date ?cumulative-spend] (install-dates ?s ?install-date) (mtus ?s ?v _ _ _ _ ?ts) (ops/project-forward ?ts :> ?activity-date) (c/sum ?v :> ?cumulative-spend)))
  24. 24. Feature Extraction with Cascalog o Install Date o Spend on Date o Install Source o Users Active on Date or After o Activity Date o Cumulative Spend to Date o Users Active on Date Options for improvement • Code not optimized – CPU limited
  25. 25. What have we learned• Martin sucks (or is awesome) at predicting number of attendees at Meetups!• Predictive modeling (particularly around LTV) can have a huge impact on a business o Requires intuition and iteration o In the big data world, feature extraction can be quite a huge challenge• Feature extraction can be done with Hadoop o HiveQL is nice because analysts can use it, but it can be inefficient and not generate all the features we need o Cascading can solve most of these problems and generate the clean features we need
  26. 26. Questions? Need a job? We’re hiring:http://www.kontagent.com/company/careers/ Martin Colaco Head of Data Science martin.colaco@kontagent.com

×