From Data to Decisions:Learnings from Real-World       Data Mining       Dr. Shailesh Kumar           Google, Inc.        ...
Welcome to the Information Age …  … drowning in data and starving for KnowledgeATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGA...
This data explosion is enabled by…    Better “Sensors” – Higher Resolution, More Spectral Bands,     Quick Experimental T...
From “Data” to “Decision”                             Domain  Insights    Features      Knowledge   Data        Models Fee...
Observation  Prediction  Decision     Credit Card Fraud                   Credit Scoring                    Retail Cross...
Building Machine Learning ModelsThe Process, the Art, and the Science                   Collect Raw                      C...
Lessons from Real-world Data Mining             Insights                        Features Decisions                        ...
Looking for a Needle in a Haystack?    What is the nature of my haystack (data)         What process generated the data?...
The Traditional Market Basket Analysis Wrong needle in a mysterious haystack!                                        CANDI...
Lesson: Know your data (Haystack) What process generated the data?Few buy a complete “logical” product group in the same b...
Lesson: Extract the essence, let go of dataPair-wise Co-occurrence Statistics
Lesson: Look for the right Insight“Frequent” vs. “Logical” Itemset                    Lighting                            ...
Lessons from Real-world Data Mining             Insights                        Features Decisions                        ...
Two Mindsets to ModelingModel-Centric                 Feature-centric•    Throw all features in!   •    Carefully craft fe...
Lesson: Distribute Complexity well Simplify Models with complex features Simple       Complex      Complex       SimpleFea...
Lesson: Overcome model limitations          Age < 60            log (Income) - B x Age < 12Income <          Education    ...
Lessons from Real-world Data Mining             Insights                          Text                        Features Dec...
Lesson: Things are not what they appearWhat is a word in “Bag-of-Words”?    Segmentation: What is a word?         New Yo...
Equivalencing                                   SIMILARITY = 0.995  we filed a suit charging dell of illegal behavior  t...
Lessons from Real-world Data Mining             Insights                        Features Decisions                        ...
Labels are precious – use them well    Labeled data vs. Unlabeled data         Lots of input data! (e.g. web pages)     ...
Lessons from Real-world Data Mining             Insights                        Features Decisions                        ...
Lesson: Don’t beat data into submissionModel Complexity no more than necessary    How many hidden units in a neural netwo...
Lesson: Divide and Conquer    Many simple models > Single complex model            •  Better “localized features”         ...
Lessons from Real-world Data Mining             Insights                        Features Decisions                        ...
Lesson: Interpret PredictionsWhat is the score?  Why is score that way?     Concept Space                     Prediction ...
Lesson: Learn Globally, Decide Locally          Accidents description                                    Density Overlay“T...
Lesson: Prediction is not enough!Different Reasons, Different Decisions     Collection Notes         Probability of defaul...
Summary    Decisions driven more by data than by “gut feeling”    Converting data to decisions is Art + Science + Engine...
Questions?In theory, theory and practice are same.In practice, they are not.                   -- Lawrence Peter Berra
Upcoming SlideShare
Loading in …5
×

Infovision 2011 Data to Decisions Shailesh Kumar, Google

1,472 views

Published on

Infovision 2011 Data to Decisions Shailesh Kumar, Google

http://informationexcellence.wordpress.com/category/knowledge-share-sessions/
Infovision 2011 Data to Decisions Shailesh Kumar, Google

http://informationexcellence.wordpress.com/2011/10/28/infovision2011-presentations/

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,472
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Infovision 2011 Data to Decisions Shailesh Kumar, Google

  1. 1. From Data to Decisions:Learnings from Real-World Data Mining Dr. Shailesh Kumar Google, Inc. InfoVision 2011
  2. 2. Welcome to the Information Age … … drowning in data and starving for KnowledgeATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…
  3. 3. This data explosion is enabled by…  Better “Sensors” – Higher Resolution, More Spectral Bands, Quick Experimental Turnaround, Crowd Sourcing…  Higher Bandwidth Communication – Faster Networks and Routers, Better Compression technologies…  Larger Warehouses – Cheaper Storage, Multi-Level Caching, Scalable Database/Data warehousing technologies…  Massive Crunching Power – Faster Multi-core processors, Parallel Distributed Computing, MapReduce paradigms…  Advances in Machine Learning and Data Mining – Sophisticated Learning frameworks, Distributed Data Mining…
  4. 4. From “Data” to “Decision” Domain Insights Features Knowledge Data Models Feedback Predictions Business Objectives Decision Business Constraints
  5. 5. Observation  Prediction  Decision Credit Card Fraud Credit Scoring Retail Cross SellApprove Transaction?Input: Past card usage behavior Approve Loan? Input: Past payment behavior Send Coupon? Input: Past purchase behaviorPredict: Fraudulent transaction? Predict: Probability of Default Predict: Response to a coupon
  6. 6. Building Machine Learning ModelsThe Process, the Art, and the Science Collect Raw Collect Target Can be Costly!! (Input) Data (Output) Labels (“ground truth”) Too Simple: Under-Learn Engineer and Select Too Complex: Over-Learn “Predictive” features Bias Variance Tradeoff •  Use Domain Knowledge “Train” a model Choose: “Model Type” •  Keep variability that matters using Feature-Label •  Remove Redundancy & “Model Complexity” training data set “Deploy” the model: “Evaluate” the trained Predict class label of all model on “validation” data the “un-labeled” data and iterate until satisfied
  7. 7. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  8. 8. Looking for a Needle in a Haystack?  What is the nature of my haystack (data)   What process generated the data?   What assumptions am I making about the data?  Is it the right needle (insight) to look for?   Is it “actionable”? Is it “useful”? Is it “novel”?   Does it tell me something I didn’t know?Insight Discovery ≠ Hypothesis Testing
  9. 9. The Traditional Market Basket Analysis Wrong needle in a mysterious haystack! CANDIDATE ITEM-SETS Size = 3  FREQUENT CANDIDATE FREQUENT FREQUENTITEM-SETS ITEM-SETS ITEM-SETS ITEM-SETS Size = 1 Size = 2 Size = 2 Size = 3
  10. 10. Lesson: Know your data (Haystack) What process generated the data?Few buy a complete “logical” product group in the same basket  already have other products  buy them from another retailer  buy them at a different time  got them as gifts  …. mixture of, projections of, latent intentions
  11. 11. Lesson: Extract the essence, let go of dataPair-wise Co-occurrence Statistics
  12. 12. Lesson: Look for the right Insight“Frequent” vs. “Logical” Itemset Lighting Furniture Airbeds Folding Furniture Projection TV Flat Panel TV Camping Accessories Inflatables Speakers Home Theatre ServicesGrill Accessories Water Sports Lighting Digital Cable TV Home Components Patio Accessories  Novel – Not obvious from the data (support = 0)  Useful – product bundling, recommendations, layout  Exhaustive – “No insight left behind!” – however “rare”
  13. 13. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  14. 14. Two Mindsets to ModelingModel-Centric Feature-centric•  Throw all features in! •  Carefully craft features•  Have enough data •  Use Domain Knowledge•  Build Complex models •  Build Simpler Models Simple Complex Complex Simple Features Model Features Model The Law of Conservation of Complexity
  15. 15. Lesson: Distribute Complexity well Simplify Models with complex features Simple Complex Complex SimpleFeatures Model Features Model
  16. 16. Lesson: Overcome model limitations Age < 60 log (Income) - B x Age < 12Income < Education Education Rs. 32 < 20 < 20 log (Income) Income ? Age Age
  17. 17. Lessons from Real-world Data Mining Insights Text Features Decisions Labels Models
  18. 18. Lesson: Things are not what they appearWhat is a word in “Bag-of-Words”?  Segmentation: What is a word?   New York Stock Exchange  4 words?   “New York” “Stock Exchange”  2 phrases?   “New York Stock Exchange”  1 phrase?  Disambiguation: What does a word mean?   ‘rock band’, ‘rock climbing’,   ‘rocking chair’, ‘the rock’  Equivalencing: How “similar” are two terms?   Comparing Apples to Oranges…   Orange Juice, Orange Flag, Orange Blog,   Apple store, Apple pie, The Big Apple
  19. 19. Equivalencing SIMILARITY = 0.995  we filed a suit charging dell of illegal behavior  they submitted a case accusing apple of unauthorized conductDisambiguation SIMILARITY = 0.171  i was right to avoid a suit against apple  on my right was a man in a suit drinking apple juiceYou shall know a word by the company it keeps -- Firth, J. R. 1957:11
  20. 20. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  21. 21. Labels are precious – use them well  Labeled data vs. Unlabeled data   Lots of input data! (e.g. web pages)   Small fraction is labeled! (e.g. spam/not)  Labels can be   Costly – human judgments, costly experiments, rare events   Noisy – web clicks, crowd sourced,…  How do we use unlabeled data with labeled data?   Semi-supervised Learning  Which unlabeled data point to get labeled next?   Active Learning
  22. 22. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  23. 23. Lesson: Don’t beat data into submissionModel Complexity no more than necessary  How many hidden units in a neural network?  How deep a decision tree?  How much cost for “misclassification elasticity” in SVM?  How many clusters? or modes in mixture of density? Model is too simple  under-learn Model is too complex  memorize Model is just right  generalize
  24. 24. Lesson: Divide and Conquer Many simple models > Single complex model •  Better “localized features” •  Simpler “local models” •  More interpretable features and models •  Higher Accuracy •  Faster Modeling Time •  Lower Resource RequirementsM W N U F P T A V Y S Z B E I J K R X C D L H O G Q
  25. 25. Lessons from Real-world Data Mining Insights Features Decisions Labels Models
  26. 26. Lesson: Interpret PredictionsWhat is the score?  Why is score that way? Concept Space Prediction Score Overlay *This is not what we mean by the “art of data mining” 
  27. 27. Lesson: Learn Globally, Decide Locally Accidents description Density Overlay“The Ford-Firestone dispute blew up in August 2000 and is still going strong. In response to claims thattheir 15-inch Wilderness AT, radial ATX and ATX II tire treads were separating from the tire coreleading to grisly, spectacular crashes. Bridgestone/Firestone recalled 6.5 million tires….” -- Forbes
  28. 28. Lesson: Prediction is not enough!Different Reasons, Different Decisions Collection Notes Probability of defaulting
  29. 29. Summary  Decisions driven more by data than by “gut feeling”  Converting data to decisions is Art + Science + Engineering  Insights: Right needles in a well understood Haystack  Features: Garbage In, Garbage Out  Models: Generalize, don’t Memorize  Labels: Explore thoroughly, Exploit efficiently  Decisions: Right decision for the right reason  Feedback: Adapt features, models, scores, decisions
  30. 30. Questions?In theory, theory and practice are same.In practice, they are not. -- Lawrence Peter Berra

×