• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent
 

Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

on

  • 1,465 views

Description: ...

Description:
One of the biggest challenges for people building data products today is developing and refining features for modeling purposes (i.e. feature extraction) with the volume and variability of web scale data. In this talk, Martin will discuss some of the challenges and solutions faced by Kontagent as it built out a predictive lifetime value model for its customers. As you will learn, Hadoop is critical to this feature extraction process, and Cascading is quite handy when building out more complex features than can be readily developed in a query framework like Hive.

Speaker:
Martin Colaco, Director of Data Science for Kontagent

Statistics

Views

Total Views
1,465
Views on SlideShare
1,465
Embed Views
0

Actions

Likes
6
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent Presentation Transcript

    • Predicting LifetimeValue with HadoopMartin Colaco, Head of Data Science l April 10, 2013
    • Agenda• What is predictive modeling• What is Lifetime Value (LTV)• What is feature extraction - challenges• How can we build a cohort-based predictive LTV model o Python o Hive with Hadoop o Cascalog with Hadoop
    • Can we predict how many attendees tonight?• How to estimate? Door count (after the fact)• Is there a way to build a model that we can use to predict attendees?
    • Predicting how many attendees tonight?Attendees = Registrations x % Attendance + Non-registrants
    • Predicting how many attendees tonight?Attendees = Registrations x % Attendance + Non-registrantsAttendees = 201 x 50% + 25 = 125 Lots of Uncertainty Location Date & Time Company Speaker Title & Topic
    • Predictive Modeling• Know the question you want to answer• Look at historical behavior• Apply understanding of those behaviors to new situations -> new groups of users Fame Feature Model ModelData Success Extraction Selection Validation Riches
    • Common use cases for predictive modelingMy chemical engineering roots…. In – Out = Accumulation IN D Out
    • Users: Maximizing Growth In – Out = Accumulation IN D = Growth Out App or Network of Apps Paid marketing Frustration? Organic Boredom? X-promotion Too expensive? Bad UX? No new content?
    • Money: Maximizing Profit In – Out = Accumulation IN D = Profit Out App or App Network or Business Lifetime Value Business expenses: (LTV) Marketing costs Operations (servers, etc.) Employee costs
    • How Do We Estimate LTV Business Model LTV Download Cost per Download Avg. Price x Avg. Subscription Customer LifetimeMicrotransactions ???(Ads / In-app-purchases)
    • LTV Modeling – Social / Mobile Games LTV = (1 + k) * Retention * ARPU Output Features Variable Daily Retention Curve ARPDAU Curve 100.00% $0.10% of users retained 80.00% $0.08 ARPDAU 60.00% $0.06 40.00% $0.04 20.00% $0.02 0.00% $- 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Days since install Days since install
    • Predictive LTV Result 300 250 200 Cumulative Spend 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 Days Since Install
    • Challenges with this simple LTV model• All of these parameters are moving targets• k-factor is wildly variable (we’ll ignore k-factor in this presentation)• Acquisition costs can change (as can LTV and retention) - Cohort LTV by install date and install source ARPDAU Curve Retention Curve $0.10 % of users retained 100.00% $0.08 80.00% ARPDAU $0.06 60.00% $0.04 40.00% $0.02 20.00% $- 0.00% 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Days since install Days since install
    • Challenges with this simple LTV model• All of these parameters are moving targets• k-factor is wildly variable (we’ll ignore k-factor in this presentation)• Acquisition costs can change (as can LTV and retention) - Cohort LTV by install date and install source• Retention is computationally difficult to calculate• Large games can have millions of users who spend money over many months/years How can we build out the features we need to model LTV by cohort?
    • Kontagent Facts• Founded in 2007• 130+ employees and growing• 100s of Customers• 1000s of Apps Instrumented• 250+ billion events per month• 200MM+ MAUs• 1 Trillion Events in 2013
    • How does Kontagent collect data?• Via a REST API o APA – Install message o EVT – Custom event message (user action) o MTU – Spending message• Yields a transaction log over time:
    • Feature Extraction for Predictive LTV Need to translate a transaction log into a table o Install Date o Users Active on Date o Install Source o Users Active on Date or After o Activity Date o Spend on Date o Cumulative Spend to Date
    • How can we compute this table of features?• Python – single thread o Might work in some cases but need to cache potentially millions of rows of data• Hive with Hadoop o Data warehouse system that allows SQL-like querying capabilities of distributed data structures o Let’s work through this….
    • Hive query• Transaction log Store data in Hadoop APA EVT MTU• Query using Hive select distinct s from demo_apa Query Language where kt_date(utc_timestamp) = 2011-07-08 and s is not null and month=201107 (HiveQL)
    • This query gets cumbersome quickly…select sub1.gameplay_date as play_date, sub1.returned,sub2.spenders, sub2.total_daily_spendfrom(select gp.gameplay_date, count(distinct gp.s) as returnedfrom(select distinct sfrom demo_apawhere kt_date(utc_timestamp) = 2011-07-08 and s is not nulland month=201107) baseleft outer join(select s, kt_date(utc_timestamp) as gameplay_datefrom demo_evtwhere s is not null and month>=201107) gp on gp.s = base.s play_date returned spenders total_daily_spendgroup by gp.gameplay_date 7/10/2011 2 1 75) sub1 7/11/2011 4 2 19join(select sp.spend_date, count(distinct sp.s) as spenders, 7/12/2011 1 1 0.2sum(sp.spend)/100 as total_daily_spendfrom(select distinct sfrom demo_apawhere kt_date(utc_timestamp) = 2011-07-08 and s is not nulland month=201107) baseleft outer join(select s, kt_date(utc_timestamp) as spend_date, v as spendfrom demo_mtuwhere s is not null and v>0 and month>=201107) sp on sp.s = base.sgroup by sp.spend_date) sub2 on sub1.gameplay_date=sub2.spend_date
    • Feature Extraction with HiveQL o Install Date o Spend on Date o Install Source o Users Active on Date or After o Activity Date o Cumulative Spend to Date o Users Active on Date Problem - HiveQL doesn’t support non equi-joins Options for improving Hive performance • Write tables or temp tables • Code up some UDFs
    • How can we compute this table of features?• Python – single thread• Hive with Hadoop• Cascalog (Cascading) with Hadoop o Cascading is a flow based computational model for Hadoop o Cascalog is a declarative based system for cascading o Let’s work through this…
    • Cascalog Code (defn life-table [api-key](defn user-install-dates [api-key] (let [install-dates (user-install-dates api-key) (let [apas (tap/apa-tap api-key)] evts (tap/evt-tap api-key) (<- [?s ?install-date] mtus (tap/mtu-tap api-key) (apas ?s _ _ ?install-ts) cumulative-spend (cumulative-spend-by-date install-dates mtus) (ops/ts-to-date ?install-ts :> ?install-date)))) activity-spend (spend-by-activity-date install-dates mtus) cumulative-users (cumulative-active-users-by-date install-dates evts)(defn active-users-by-activity-date [install-dates evts] active-users (active-users-by-activity-date install-dates evts)] (<- [?install-date ?activity-date ?active-users] (<- [?install-date ?activity-date ?remaining-users ?active-users ?paying- (install-dates ?s ?install-date) users ?day-spending ?cumulative-spending] (evts ?s _ ?ts) (cumulative-spend ?install-date ?activity-date ?cumulative-spend) (ops/ts-to-date ?ts :> ?activity-date) (activity-spend ?install-date ?activity-date ?paying-users ?day-spending) (c/distinct-count ?s :> ?active-users))) (cumulative-users ?install-date ?activity-date ?remaining-users) (active-users ?install-date ?activity-date ?active-users))))(defn spend-by-activity-date [install-dates mtus] (<- [?install-date ?activity-date ?paying-users ?day-spending] (mtus ?s ?v _ _ _ _ ?ts) (install-dates ?s ?install-date) (ops/ts-to-date ?ts :> ?activity-date) (c/distinct-count ?s :> ?paying-users) (c/sum ?v :> ?day-spending)))(defn cumulative-active-users-by-date [install-dates evts] (<- [?install-date ?activity-date ?remaining-users] (install-dates ?s ?install-date) (evts ?s _ ?ts) (ops/project-backward ?ts :> ?activity-date) (c/distinct-count ?s :> ?remaining-users)))(defn cumulative-spend-by-date [install-dates mtus] (<- [?install-date ?activity-date ?cumulative-spend] (install-dates ?s ?install-date) (mtus ?s ?v _ _ _ _ ?ts) (ops/project-forward ?ts :> ?activity-date) (c/sum ?v :> ?cumulative-spend)))
    • Feature Extraction with Cascalog o Install Date o Spend on Date o Install Source o Users Active on Date or After o Activity Date o Cumulative Spend to Date o Users Active on Date Options for improvement • Code not optimized – CPU limited
    • What have we learned• Martin sucks (or is awesome) at predicting number of attendees at Meetups!• Predictive modeling (particularly around LTV) can have a huge impact on a business o Requires intuition and iteration o In the big data world, feature extraction can be quite a huge challenge• Feature extraction can be done with Hadoop o HiveQL is nice because analysts can use it, but it can be inefficient and not generate all the features we need o Cascading can solve most of these problems and generate the clean features we need
    • Questions? Need a job? We’re hiring:http://www.kontagent.com/company/careers/ Martin Colaco Head of Data Science martin.colaco@kontagent.com