Data-driven modeling: Lecture 02

7,458 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
7,458
On SlideShare
0
From Embeds
0
Number of Embeds
5,815
Actions
Shares
0
Downloads
61
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data-driven modeling: Lecture 02

  1. 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 30, 2012Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 1 / 23
  2. 2. Outline 1 Digit recognition 2 Image classification 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 2 / 23
  3. 3. Digit recognition Classification is an supervised learning task by which we aim to predict the correct label for an example given its features ↓ 0 5 4 1 4 9 e.g. determine which digit {0, 1, . . . , 9} is in depicted in each imageJake Hofman (Columbia University) Data-driven modeling January 30, 2012 3 / 23
  4. 4. Digit recognition Determine which digit {0, 1, . . . , 9} is in depicted in each imageJake Hofman (Columbia University) Data-driven modeling January 30, 2012 4 / 23
  5. 5. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensitiesJake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
  6. 6. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensities Represent each image as a “vector of pixels”, flattening the 2-d array of pixels to a 1-d vectorJake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
  7. 7. k-nearest neighbors classification k-nearest neighbors: memorize training examples, predict labels using labels of the k closest training points Intuition: nearby points have similar labelsJake Hofman (Columbia University) Data-driven modeling January 30, 2012 6 / 23
  8. 8. k-nearest neighbors classification Small k gives a complex boundary, large k results in coarse averagingJake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
  9. 9. k-nearest neighbors classification Evaluate performance on a held-out test set to assess generalization errorJake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
  10. 10. Digit recognition Simple digit classifer with k=1 nearest neighbors ./ classify_digits . pyJake Hofman (Columbia University) Data-driven modeling January 30, 2012 8 / 23
  11. 11. Outline 1 Digit recognition 2 Image classification 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 9 / 23
  12. 12. Image classification Determine if an image is a landscape or headshot ↓ ↓ ’landscape’ ’headshot’ Represent each image with a binned RGB intensity histogramJake Hofman (Columbia University) Data-driven modeling January 30, 2012 10 / 23
  13. 13. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( chairs . jpg )Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
  14. 14. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( chairs . jpg )Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
  15. 15. Intensity histograms Disregard all spatial information, simply count pixels by intensities (e.g. lots of pixels with bright green and dark blue)Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 12 / 23
  16. 16. Intensity histograms How many bins for pixel intensities? Too many bins gives a noisy, overly complex representation of the data, while using too few bins results in an overly simple oneJake Hofman (Columbia University) Data-driven modeling January 30, 2012 13 / 23
  17. 17. Image classification Classify ./ classify_flickr . py 16 9 flickr_headshot flickr_landscape Change in performance on test set with number of neighbors k = 1, accuracy = 0.7125 k = 3, accuracy = 0.7425 k = 5, accuracy = 0.7725 k = 7, accuracy = 0.7650 k = 9, accuracy = 0.7500Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 14 / 23
  18. 18. Outline 1 Digit recognition 2 Image classification 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 15 / 23
  19. 19. Simple screen scraping One-liner to download ESL digit data wget - Nr -- level =1 --no - parent http :// www - stat . stanford . edu /~ tibs / ElemStatLearn / datasets / zip . digitsJake Hofman (Columbia University) Data-driven modeling January 30, 2012 16 / 23
  20. 20. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr " = n | egrep ^ http .*( png | jpg | gif ) | xargs wgetJake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
  21. 21. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr " = n | egrep ^ http .*( png | jpg | gif ) | xargs wget • get page source • translate quotes and = to newlines • match urls with image extensions • download qualifying imagesJake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
  22. 22. “cat flickr xargs wget”?Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 18 / 23
  23. 23. Flickr APIJake Hofman (Columbia University) Data-driven modeling January 30, 2012 19 / 23
  24. 24. YQL: SELECT * FROM Internet1 http://developer.yahoo.com/yql 1 http://oreillynet.com/pub/e/1369Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 20 / 23
  25. 25. YQL: Console http://developer.yahoo.com/yql/consoleJake Hofman (Columbia University) Data-driven modeling January 30, 2012 21 / 23
  26. 26. YQL + Python Python function for public YQL queries def yql_public ( query , env = False ): # build dictionary of GET parameters params = { q : query , format : json } if env : params [ env ] = env # escape query query_str = urlencode ( params ) # fetch results url = % s ?% s % ( YQL_PUBLIC , query_str ) result = urlopen ( url ) # parse json and return return json . load ( result )[ query ][ results ]Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 22 / 23
  27. 27. YQL + Python + Flickr Fetch info for “interestingness” photos ./ simpleyql . py select * from flickr . photos . interestingness (20) where api_key = " ... " Download thumbnails for photos tagged with “vivid” ./ download_flickr . py vivid 500 < api_key >Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 23 / 23

×