Data-driven modeling: Lecture 01

12,830 views
12,667 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,830
On SlideShare
0
From Embeds
0
Number of Embeds
3,820
Actions
Shares
0
Downloads
69
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data-driven modeling: Lecture 01

  1. 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 23, 2012Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 1 / 19
  2. 2. Data-dependent products • Effective/practical systems that learn from experience impact our daily lives, e.g.: • Recommendation systems • Spam detection • Optical character recognition • Face recognition • Fraud detection • Machine translation • ...Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 2 / 19
  3. 3. Learning by exampleJake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
  4. 4. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)?Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
  5. 5. Learning by example • We learn quickly from few, relatively unstructured examples ... but we don’t understand how we accomplish this • We’d like to develop algorithms that enable machines to learn by example from large data setsJake Hofman (Columbia University) Data-driven modeling January 23, 2012 4 / 19
  6. 6. Got data? • Web service APIs expose lots of dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 5 / 19
  7. 7. Got data? • Many free, public data sets available onlineJake Hofman (Columbia University) Data-driven modeling January 23, 2012 6 / 19
  8. 8. Black-boxified?Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
  9. 9. Black-boxified?Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
  10. 10. Roadmap? Step 1: Have data Step 2: ??? Step 3: ProfitJake Hofman (Columbia University) Data-driven modeling January 23, 2012 8 / 19
  11. 11. Roadmap, take two 1 Get dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  12. 12. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  13. 13. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss functionJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  14. 14. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize lossJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  15. 15. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize loss 8 Choose performance measure 9 “Train” to minimize loss 10 “Test” to evaluate generalizationJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  16. 16. Topics • Supervised • Unsupervised • k-nearest neighbors • K-means • Naive Bayes • Mixture models • Linear regression • Principal components • Logistic regression analysis • Support vector machines • Topic models • Collaborative filtering • Matrix factorization • Data representation: feature space, selection, normalization • Model assessment: complexity control, cross-validation, ROC curve, Bayesian Occam’s razor • Large-scale learningJake Hofman (Columbia University) Data-driven modeling January 23, 2012 10 / 19
  17. 17. Everything old is new again1 • Many fields ... • Statistics • Pattern recognition • Data mining • Machine learning • ... similar goals • Extract and recognize patterns in data • Interpret or explain observations • Test validity of hypotheses • Efficiently search the space of hypotheses • Design efficient algorithms enabling machines to learn from data 1 http://cbcl.mit.edu/publications/theses/thesis-rifkin.pdfJake Hofman (Columbia University) Data-driven modeling January 23, 2012 11 / 19
  18. 18. Statistics vs. machine learning2 Glossary Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000 large grant= $50,000 nice place to have a meeting: nice place to have a meeting: Snowbird, Utah, French Alps Las Vegas in August 1 2 http: //anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 12 / 19
  19. 19. Philosophy • We would like models that: • Provide predictive and explanatory power • Are complex enough to describe observed phenomena • Are simple enough to generalize to future observationsJake Hofman (Columbia University) Data-driven modeling January 23, 2012 13 / 19
  20. 20. Example: Netflix PrizeJake Hofman (Columbia University) Data-driven modeling January 23, 2012 14 / 19
  21. 21. Example: Netflix PrizeJake Hofman (Columbia University) Data-driven modeling January 23, 2012 15 / 19
  22. 22. Shipping = FeatureJake Hofman (Columbia University) Data-driven modeling January 23, 2012 16 / 19
  23. 23. ReferencesJake Hofman (Columbia University) Data-driven modeling January 23, 2012 17 / 19
  24. 24. Disclaimer You may be bored if you already know how to ... • Acquire data from APIs • Clean/explore/visualize data • Classify and cluster various types of data (e.g., images, text) • Code in Python, R, SciPy/NumPy, etc. • Scale solutions to large data sets (e.g. Hadoop, SGD) • Script with unix tools on the command line, e.g. $ sed -e ’s/<[^>]*>//g’ < page.html > page.txtJake Hofman (Columbia University) Data-driven modeling January 23, 2012 18 / 19
  25. 25. Themes Data jeopardy Regardless of scale, it’s difficult to find the right questions to ask of the dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  26. 26. Themes Data hacking Cleaning and normalizing data is a substantial amount of the work (and likely impacts results)Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  27. 27. Themes Data hacking The ability to iterate quickly, asking and answering many questions, is crucialJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  28. 28. Themes Data hacking Hacks happen: sed/awk/grep are useful, and scaleJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  29. 29. Themes “Data science” Simple methods (e.g., linear models) work surprisingly well, especially with lots of dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  30. 30. Themes “Data science” It’s easy to cover your tracks—things are often much more complicated than they appearJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  31. 31. Predicting consumer activity with Web searchwith Sharad Goel, S´bastien Lahaie, David Pennock, Duncan Watts e "Right Round" c 10 20 Rank 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 WeekJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 6 / 71
  32. 32. Search predictionsMotivation "Right Round" c 10 Does collective search activity 20 Rank provide useful predictive signal about real-world outcomes? 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 WeekJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 7 / 71
  33. 33. Search predictionsMotivation Past work mainly focuses on predicting the present1 and ignores baseline models trained on publicly available data 8 Actual 7 Search Flu Level (Percent) 6 Autoregressive 5 4 3 2 1 2004 2005 2006 2007 2008 2009 2010 Date 1 Varian, 2009Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 8 / 71
  34. 34. Search predictionsMotivation We predict future sales for movies, video games, and music "Transformers 2" "Tom Clancys HAWX" "Right Round" a b c 10 Search Volume Search Volume 20 Rank 30 Billboard 40 Search −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Time to Release (Days) Time to Release (Days) WeekJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 9 / 71
  35. 35. Search predictionsSearch models For movies and video games, predict opening weekend box office and first month sales, respectively: log(revenue) = β0 + β1 log(search) + For music, predict following week’s Billboard Hot 100 rank: billboardt+1 = β0 + β1 searcht + β2 searcht−1 + Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 10 / 71
  36. 36. Search predictionsSearch volumeJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 11 / 71

×