Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
All the data and still not enough!
Claudia Perlich Chief Scientist
@claudia_perlich
Predictive Modeling:
Algorithms that LearnFunctions
Income Age Buy
123,000 30 yes
51,100 40 yes
68,000 55 no
74,000 46 no
23,000 47 yes
100,000 49 no
Data forPredictiveModeli...
?
yes
yes
no
no
yes
no
RulesforPredictiveModeling
Target
Examples
Features
 Data should be:
 Large enough
 Independentl...
Paradox of BigData:
“Youneverhave thedatayouwant”
Art of making due with second best
IBM:SalesForceOptimization
WalletisNEVERobserved
We observe
this in the
data
But we do not
observe this
IBM Sales to
this Company
Company Revenue (D&...
Wallet
10
5
31
17
39
4
Data forWalletEstimation?
Target
Examples
Features
9
REALISTICWalletsas quantiles
 Motivation
 Imagine 100 identical firms with identical IT needs
 Consider the distribut...
Revenue
10
5
31
17
39
4
Data forWalletEstimation
Target
Examples
Features
QuantileRegressionoptimizing
weightedabsoluteloss
10 20 30 40 50 60 70 80
1
2
3
4
5
6
7
8
9
Firm Sales
IBMRevenue
Company ...
MedicalDiagnosis:BrestCancer
© IBM Corporation 2008
Slide 13
Siemens: Computer-Aided Detection of Breast Cancer in Mammograms
1712 Patients 6816 Images...
SiemensMedical:fMRIbreastcancerdata
245 Patients:
36% Cancer
414 Patients:
1% Cancer
1027 Patients
0% Cancer
18 Patients:
...
Data forDiagnosisfromMultiple
Sources
Target
Examples
Features
Cancer
yes
no
no
no
no
no
ModelingtheSources…
Target
Examples
Features
Source Cancer
1 yes
2 no
1 no
1 no
4 no
3 no
DigitalAdvertising
OnlineDisplayAdvertising
Do peoplebuystuff afterseeingan ad?
Datacollectionforpost-viewpurchase
conversion
Time
Cohort of
random
prospects
?
Data ForAdvertising
Target
Examples
Features
PV Buy
no
no
no
no
yes
yes
Multi-ArmedBandit:
Explorationvs.exploitation
 Show some random ads to learn a good model
 Tradeoff between learning and...
SizeoftheTrainingSample?
Target
Examples
Features
Buy
no
no
no
no
yes
yes
VeryfewLuxury carsareboughonline
Maserati $128,0000
$128,0000
RealityofOnlinePurchases
Target
Examples
Features
Buy
no
no
no
no
no
yes
OnlineDisplayAdvertising
Proxyfor purchase?How about click?
Click?
yes
yes
no
no
yes
no
OptimizingClicksinAdvertising?
ClickOptimization:Fumblingin theDark
Top 10 Apps by CTR
How BigData andOptimizationis
killingMetrics
 90% of clicks are ‘accidental/non intentional’
 10% are meaningful, and ch...
OnlineDisplayAdvertising
Whocaresabout thead anyway?
PredictOtherindicators:searchor
brandsitevisit/scheduletestdrive
Target
Examples
Features
Site Visit
no
no
no
yes
yes
yes
AdvertisingFraud
Istherereallyapersonontheother
endwantingtoseethesite?
Data forFraudDetection
Target
Examples
Features
Human?
yes
no
no
yes
yes
no
Tellingthedifferencebetweenan
algorithmandahuman
Turing test KAPTCHA
Bot traffic networks
OnlineDisplayAdvertising
Whoshouldyoureallyadvertiseto???
Data forAdvertisingImpact
Target
Examples
Features
Impact
1
0.3
0.5
0
0
0.1
AlternativeHistories(Counterfactual)
FundamentallyImpossible!
Target
Examples
Features
Impact
1
0.3
0.5
0
0
0.1
Buildtwoseparatemodelsand
calculateimpactas thedifference
Site Visit
yes
no
no
yes
no
no
Site Visit
yes
no
no
yes
no
no
Ex...
Usepredictivemodelstomeasureimpact
Negative Test: wrong ad
Positive Test: A/B comparison
Relationshipoforganicconversionrateand
causalimpact
-0.001000
0.000000
0.001000
0.002000
0.003000
0.004000
0.005000
0.0060...
AudiencesinVideoAdvertising
Pleasingtheadvertisingoracle…
 Audience reports from
matched populations in
Facebook
 68% of the ads where shown
to fema...
Data forAudienceOptimization
Target
Examples
Features
Gender
male
female
female
male
male
female
WeightedLogisticRegressionon
aggregated
Target
Examples
Features
Weight Gender
0.32 male
0.68 female
0.32 male
0.68 female...
HyperlocalTargeting?
 Foursquare locations: very noisy…
Data forLocationReliabilityinAuction
Target
Examples
Features
Reliable?
yes
no
no
yes
yes
no
30%smartphoneuserstravelfaster
thanspeedof sound…
Catalan traditions
pop up everywhere….
Data forLocationReliabilityinAuction
Target
Examples
Features
Reliable?
maybe
no
no
maybe
maybe
no
Paradox of BigData:
“Youneverhave thedatayouwant”
Art of making due with second best
Allamatterhowcreativeyouareatcheating….
Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC
Upcoming SlideShare
Loading in …5
×

Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

1,099 views

Published on

There is a deeply symbiotic relationship between machine learning/predictive modeling and Big Data. Machine learning theory asserts that the more data the better. Empirical observations suggest that more granular data, a hallmark of Big Data, further improves performance. Predictive modeling is one of the core techniques that measurably delivers value across many industries and demonstrates the value of Big Data.

However, there is a surprising paradox of predictive modeling: when you need models most, even all the data is not enough or just not suitable. The foundation of predictive modeling requires that you have enough training data with the respective outcomes, preferably IID. But often this data is not available: there are only so many people buying luxury cars online to inform my targeting models. I can never observe what happens BOTH when I treat you AND when I don’t – which is what I need to make causal claims and measure the impact of strategic decisions. To allocate sales resources I love to know what a customer’s budget is – but maybe even he does not know.

So in the days and age of Big Data there remains an art to machine learning in situation where the right data is scarce. This talk will present a number of cases where enough of the right data is fundamentally not obtainable and how creative data science can still solve them.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

  1. 1. All the data and still not enough! Claudia Perlich Chief Scientist @claudia_perlich
  2. 2. Predictive Modeling: Algorithms that LearnFunctions
  3. 3. Income Age Buy 123,000 30 yes 51,100 40 yes 68,000 55 no 74,000 46 no 23,000 47 yes 100,000 49 no Data forPredictiveModeling Target Examples Features
  4. 4. ? yes yes no no yes no RulesforPredictiveModeling Target Examples Features  Data should be:  Large enough  Independently Identically Distributed
  5. 5. Paradox of BigData: “Youneverhave thedatayouwant” Art of making due with second best
  6. 6. IBM:SalesForceOptimization
  7. 7. WalletisNEVERobserved We observe this in the data But we do not observe this IBM Sales to this Company Company Revenue (D&B) Wallet/Opportunity How can we make this a predictive modeling problem?
  8. 8. Wallet 10 5 31 17 39 4 Data forWalletEstimation? Target Examples Features
  9. 9. 9 REALISTICWalletsas quantiles  Motivation  Imagine 100 identical firms with identical IT needs  Consider the distribution of the IBM sales to these firms  Bottom firms should spend as much as the top  Define wallet as high percentile of spending conditional on the customer attributes Frequency IBM Sales Wallet Estimate
  10. 10. Revenue 10 5 31 17 39 4 Data forWalletEstimation Target Examples Features
  11. 11. QuantileRegressionoptimizing weightedabsoluteloss 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 Firm Sales IBMRevenue Company Sales IBMRevenue Opportunity for C 2 Opportunity for C 1 C1 C2 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 Firm Sales IBMRevenue Company Sales IBMRevenue Opportunity for C 2 Opportunity for C 1 C1 C2 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 Firm Sales IBMRevenue Company Sales IBMRevenue Opportunity for C 2 Opportunity for C 1 C1 C2 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 Firm Sales IBMRevenue Company Sales IBMRevenue Opportunity for C 2 Opportunity for C 1 C1 C2
  12. 12. MedicalDiagnosis:BrestCancer
  13. 13. © IBM Corporation 2008 Slide 13 Siemens: Computer-Aided Detection of Breast Cancer in Mammograms 1712 Patients 6816 Images 105,000 Candidates [ x1 , x2 , … , x117 ] Image feature vector Malignant ? MLO CC MLO CC
  14. 14. SiemensMedical:fMRIbreastcancerdata 245 Patients: 36% Cancer 414 Patients: 1% Cancer 1027 Patients 0% Cancer 18 Patients: 85% Cancer Model score Log of Patient ID Every point is a candidate Inessence,themostpredictivevariableisthepatientID
  15. 15. Data forDiagnosisfromMultiple Sources Target Examples Features Cancer yes no no no no no
  16. 16. ModelingtheSources… Target Examples Features Source Cancer 1 yes 2 no 1 no 1 no 4 no 3 no
  17. 17. DigitalAdvertising
  18. 18. OnlineDisplayAdvertising Do peoplebuystuff afterseeingan ad?
  19. 19. Datacollectionforpost-viewpurchase conversion Time Cohort of random prospects ?
  20. 20. Data ForAdvertising Target Examples Features PV Buy no no no no yes yes
  21. 21. Multi-ArmedBandit: Explorationvs.exploitation  Show some random ads to learn a good model  Tradeoff between learning and using
  22. 22. SizeoftheTrainingSample? Target Examples Features Buy no no no no yes yes
  23. 23. VeryfewLuxury carsareboughonline Maserati $128,0000 $128,0000
  24. 24. RealityofOnlinePurchases Target Examples Features Buy no no no no no yes
  25. 25. OnlineDisplayAdvertising Proxyfor purchase?How about click?
  26. 26. Click? yes yes no no yes no OptimizingClicksinAdvertising?
  27. 27. ClickOptimization:Fumblingin theDark Top 10 Apps by CTR
  28. 28. How BigData andOptimizationis killingMetrics  90% of clicks are ‘accidental/non intentional’  10% are meaningful, and changes can be measures  Optimization can find structure in the other 90%  You will end up with only non-intentional …
  29. 29. OnlineDisplayAdvertising Whocaresabout thead anyway?
  30. 30. PredictOtherindicators:searchor brandsitevisit/scheduletestdrive Target Examples Features Site Visit no no no yes yes yes
  31. 31. AdvertisingFraud
  32. 32. Istherereallyapersonontheother endwantingtoseethesite?
  33. 33. Data forFraudDetection Target Examples Features Human? yes no no yes yes no
  34. 34. Tellingthedifferencebetweenan algorithmandahuman Turing test KAPTCHA
  35. 35. Bot traffic networks
  36. 36. OnlineDisplayAdvertising Whoshouldyoureallyadvertiseto???
  37. 37. Data forAdvertisingImpact Target Examples Features Impact 1 0.3 0.5 0 0 0.1
  38. 38. AlternativeHistories(Counterfactual)
  39. 39. FundamentallyImpossible! Target Examples Features Impact 1 0.3 0.5 0 0 0.1
  40. 40. Buildtwoseparatemodelsand calculateimpactas thedifference Site Visit yes no no yes no no Site Visit yes no no yes no no Examples1 seenad Examples2 notseenad ExpectedImpact: p(SV|Ad)-p(SV|noad)
  41. 41. Usepredictivemodelstomeasureimpact Negative Test: wrong ad Positive Test: A/B comparison
  42. 42. Relationshipoforganicconversionrateand causalimpact -0.001000 0.000000 0.001000 0.002000 0.003000 0.004000 0.005000 0.006000 0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40% Organic conversion propensity Additivecasualimpact
  43. 43. AudiencesinVideoAdvertising
  44. 44. Pleasingtheadvertisingoracle…  Audience reports from matched populations in Facebook  68% of the ads where shown to females  Makeup for 32% of ads The Oracle
  45. 45. Data forAudienceOptimization Target Examples Features Gender male female female male male female
  46. 46. WeightedLogisticRegressionon aggregated Target Examples Features Weight Gender 0.32 male 0.68 female 0.32 male 0.68 female 0.73 male 0.27 female
  47. 47. HyperlocalTargeting?  Foursquare locations: very noisy…
  48. 48. Data forLocationReliabilityinAuction Target Examples Features Reliable? yes no no yes yes no
  49. 49. 30%smartphoneuserstravelfaster thanspeedof sound…
  50. 50. Catalan traditions pop up everywhere….
  51. 51. Data forLocationReliabilityinAuction Target Examples Features Reliable? maybe no no maybe maybe no
  52. 52. Paradox of BigData: “Youneverhave thedatayouwant” Art of making due with second best
  53. 53. Allamatterhowcreativeyouareatcheating….

×