Infovision 2011 Data to Decisions Shailesh Kumar, Google
Upcoming SlideShare
Loading in...5

Infovision 2011 Data to Decisions Shailesh Kumar, Google



Infovision 2011 Data to Decisions Shailesh Kumar, Google ...

Infovision 2011 Data to Decisions Shailesh Kumar, Google
Infovision 2011 Data to Decisions Shailesh Kumar, Google



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Infovision 2011 Data to Decisions Shailesh Kumar, Google Presentation Transcript

  • 1. From Data to Decisions:Learnings from Real-World Data Mining Dr. Shailesh Kumar Google, Inc. InfoVision 2011
  • 3. This data explosion is enabled by…  Better “Sensors” – Higher Resolution, More Spectral Bands, Quick Experimental Turnaround, Crowd Sourcing…  Higher Bandwidth Communication – Faster Networks and Routers, Better Compression technologies…  Larger Warehouses – Cheaper Storage, Multi-Level Caching, Scalable Database/Data warehousing technologies…  Massive Crunching Power – Faster Multi-core processors, Parallel Distributed Computing, MapReduce paradigms…  Advances in Machine Learning and Data Mining – Sophisticated Learning frameworks, Distributed Data Mining…
  • 4. From “Data” to “Decision” Domain Insights Features Knowledge Data Models Feedback Predictions Business Objectives Decision Business Constraints
  • 5. Observation  Prediction  Decision Credit Card Fraud Credit Scoring Retail Cross SellApprove Transaction?Input: Past card usage behavior Approve Loan? Input: Past payment behavior Send Coupon? Input: Past purchase behaviorPredict: Fraudulent transaction? Predict: Probability of Default Predict: Response to a coupon
  • 6. Building Machine Learning ModelsThe Process, the Art, and the Science Collect Raw Collect Target Can be Costly!! (Input) Data (Output) Labels (“ground truth”) Too Simple: Under-Learn Engineer and Select Too Complex: Over-Learn “Predictive” features Bias Variance Tradeoff •  Use Domain Knowledge “Train” a model Choose: “Model Type” •  Keep variability that matters using Feature-Label •  Remove Redundancy & “Model Complexity” training data set “Deploy” the model: “Evaluate” the trained Predict class label of all model on “validation” data the “un-labeled” data and iterate until satisfied
  • 7. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  • 8. Looking for a Needle in a Haystack?  What is the nature of my haystack (data)   What process generated the data?   What assumptions am I making about the data?  Is it the right needle (insight) to look for?   Is it “actionable”? Is it “useful”? Is it “novel”?   Does it tell me something I didn’t know?Insight Discovery ≠ Hypothesis Testing
  • 9. The Traditional Market Basket Analysis Wrong needle in a mysterious haystack! CANDIDATE ITEM-SETS Size = 3  FREQUENT CANDIDATE FREQUENT FREQUENTITEM-SETS ITEM-SETS ITEM-SETS ITEM-SETS Size = 1 Size = 2 Size = 2 Size = 3
  • 10. Lesson: Know your data (Haystack) What process generated the data?Few buy a complete “logical” product group in the same basket  already have other products  buy them from another retailer  buy them at a different time  got them as gifts  …. mixture of, projections of, latent intentions
  • 11. Lesson: Extract the essence, let go of dataPair-wise Co-occurrence Statistics
  • 12. Lesson: Look for the right Insight“Frequent” vs. “Logical” Itemset Lighting Furniture Airbeds Folding Furniture Projection TV Flat Panel TV Camping Accessories Inflatables Speakers Home Theatre ServicesGrill Accessories Water Sports Lighting Digital Cable TV Home Components Patio Accessories  Novel – Not obvious from the data (support = 0)  Useful – product bundling, recommendations, layout  Exhaustive – “No insight left behind!” – however “rare”
  • 13. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  • 14. Two Mindsets to ModelingModel-Centric Feature-centric•  Throw all features in! •  Carefully craft features•  Have enough data •  Use Domain Knowledge•  Build Complex models •  Build Simpler Models Simple Complex Complex Simple Features Model Features Model The Law of Conservation of Complexity
  • 15. Lesson: Distribute Complexity well Simplify Models with complex features Simple Complex Complex SimpleFeatures Model Features Model
  • 16. Lesson: Overcome model limitations Age < 60 log (Income) - B x Age < 12Income < Education Education Rs. 32 < 20 < 20 log (Income) Income ? Age Age
  • 17. Lessons from Real-world Data Mining Insights Text Features Decisions Labels Models
  • 18. Lesson: Things are not what they appearWhat is a word in “Bag-of-Words”?  Segmentation: What is a word?   New York Stock Exchange  4 words?   “New York” “Stock Exchange”  2 phrases?   “New York Stock Exchange”  1 phrase?  Disambiguation: What does a word mean?   ‘rock band’, ‘rock climbing’,   ‘rocking chair’, ‘the rock’  Equivalencing: How “similar” are two terms?   Comparing Apples to Oranges…   Orange Juice, Orange Flag, Orange Blog,   Apple store, Apple pie, The Big Apple
  • 19. Equivalencing SIMILARITY = 0.995  we filed a suit charging dell of illegal behavior  they submitted a case accusing apple of unauthorized conductDisambiguation SIMILARITY = 0.171  i was right to avoid a suit against apple  on my right was a man in a suit drinking apple juiceYou shall know a word by the company it keeps -- Firth, J. R. 1957:11
  • 20. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  • 21. Labels are precious – use them well  Labeled data vs. Unlabeled data   Lots of input data! (e.g. web pages)   Small fraction is labeled! (e.g. spam/not)  Labels can be   Costly – human judgments, costly experiments, rare events   Noisy – web clicks, crowd sourced,…  How do we use unlabeled data with labeled data?   Semi-supervised Learning  Which unlabeled data point to get labeled next?   Active Learning
  • 22. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  • 23. Lesson: Don’t beat data into submissionModel Complexity no more than necessary  How many hidden units in a neural network?  How deep a decision tree?  How much cost for “misclassification elasticity” in SVM?  How many clusters? or modes in mixture of density? Model is too simple  under-learn Model is too complex  memorize Model is just right  generalize
  • 24. Lesson: Divide and Conquer Many simple models > Single complex model •  Better “localized features” •  Simpler “local models” •  More interpretable features and models •  Higher Accuracy •  Faster Modeling Time •  Lower Resource RequirementsM W N U F P T A V Y S Z B E I J K R X C D L H O G Q
  • 25. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  • 26. Lesson: Interpret PredictionsWhat is the score?  Why is score that way? Concept Space Prediction Score Overlay *This is not what we mean by the “art of data mining” 
  • 27. Lesson: Learn Globally, Decide Locally Accidents description Density Overlay“The Ford-Firestone dispute blew up in August 2000 and is still going strong. In response to claims thattheir 15-inch Wilderness AT, radial ATX and ATX II tire treads were separating from the tire coreleading to grisly, spectacular crashes. Bridgestone/Firestone recalled 6.5 million tires….” -- Forbes
  • 28. Lesson: Prediction is not enough!Different Reasons, Different Decisions Collection Notes Probability of defaulting
  • 29. Summary  Decisions driven more by data than by “gut feeling”  Converting data to decisions is Art + Science + Engineering  Insights: Right needles in a well understood Haystack  Features: Garbage In, Garbage Out  Models: Generalize, don’t Memorize  Labels: Explore thoroughly, Exploit efficiently  Decisions: Right decision for the right reason  Feedback: Adapt features, models, scores, decisions
  • 30. Questions?In theory, theory and practice are same.In practice, they are not. -- Lawrence Peter Berra