• Like
Data-driven modeling: Lecture 02
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Data-driven modeling: Lecture 02

  • 6,740 views
Published

 

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
6,740
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
56
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 30, 2012Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 1 / 23
  • 2. Outline 1 Digit recognition 2 Image classification 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 2 / 23
  • 3. Digit recognition Classification is an supervised learning task by which we aim to predict the correct label for an example given its features ↓ 0 5 4 1 4 9 e.g. determine which digit {0, 1, . . . , 9} is in depicted in each imageJake Hofman (Columbia University) Data-driven modeling January 30, 2012 3 / 23
  • 4. Digit recognition Determine which digit {0, 1, . . . , 9} is in depicted in each imageJake Hofman (Columbia University) Data-driven modeling January 30, 2012 4 / 23
  • 5. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensitiesJake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
  • 6. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensities Represent each image as a “vector of pixels”, flattening the 2-d array of pixels to a 1-d vectorJake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
  • 7. k-nearest neighbors classification k-nearest neighbors: memorize training examples, predict labels using labels of the k closest training points Intuition: nearby points have similar labelsJake Hofman (Columbia University) Data-driven modeling January 30, 2012 6 / 23
  • 8. k-nearest neighbors classification Small k gives a complex boundary, large k results in coarse averagingJake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
  • 9. k-nearest neighbors classification Evaluate performance on a held-out test set to assess generalization errorJake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
  • 10. Digit recognition Simple digit classifer with k=1 nearest neighbors ./ classify_digits . pyJake Hofman (Columbia University) Data-driven modeling January 30, 2012 8 / 23
  • 11. Outline 1 Digit recognition 2 Image classification 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 9 / 23
  • 12. Image classification Determine if an image is a landscape or headshot ↓ ↓ ’landscape’ ’headshot’ Represent each image with a binned RGB intensity histogramJake Hofman (Columbia University) Data-driven modeling January 30, 2012 10 / 23
  • 13. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( chairs . jpg )Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
  • 14. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( chairs . jpg )Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
  • 15. Intensity histograms Disregard all spatial information, simply count pixels by intensities (e.g. lots of pixels with bright green and dark blue)Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 12 / 23
  • 16. Intensity histograms How many bins for pixel intensities? Too many bins gives a noisy, overly complex representation of the data, while using too few bins results in an overly simple oneJake Hofman (Columbia University) Data-driven modeling January 30, 2012 13 / 23
  • 17. Image classification Classify ./ classify_flickr . py 16 9 flickr_headshot flickr_landscape Change in performance on test set with number of neighbors k = 1, accuracy = 0.7125 k = 3, accuracy = 0.7425 k = 5, accuracy = 0.7725 k = 7, accuracy = 0.7650 k = 9, accuracy = 0.7500Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 14 / 23
  • 18. Outline 1 Digit recognition 2 Image classification 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 15 / 23
  • 19. Simple screen scraping One-liner to download ESL digit data wget - Nr -- level =1 --no - parent http :// www - stat . stanford . edu /~ tibs / ElemStatLearn / datasets / zip . digitsJake Hofman (Columbia University) Data-driven modeling January 30, 2012 16 / 23
  • 20. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr " = n | egrep ^ http .*( png | jpg | gif ) | xargs wgetJake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
  • 21. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr " = n | egrep ^ http .*( png | jpg | gif ) | xargs wget • get page source • translate quotes and = to newlines • match urls with image extensions • download qualifying imagesJake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
  • 22. “cat flickr xargs wget”?Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 18 / 23
  • 23. Flickr APIJake Hofman (Columbia University) Data-driven modeling January 30, 2012 19 / 23
  • 24. YQL: SELECT * FROM Internet1 http://developer.yahoo.com/yql 1 http://oreillynet.com/pub/e/1369Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 20 / 23
  • 25. YQL: Console http://developer.yahoo.com/yql/consoleJake Hofman (Columbia University) Data-driven modeling January 30, 2012 21 / 23
  • 26. YQL + Python Python function for public YQL queries def yql_public ( query , env = False ): # build dictionary of GET parameters params = { q : query , format : json } if env : params [ env ] = env # escape query query_str = urlencode ( params ) # fetch results url = % s ?% s % ( YQL_PUBLIC , query_str ) result = urlopen ( url ) # parse json and return return json . load ( result )[ query ][ results ]Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 22 / 23
  • 27. YQL + Python + Flickr Fetch info for “interestingness” photos ./ simpleyql . py select * from flickr . photos . interestingness (20) where api_key = " ... " Download thumbnails for photos tagged with “vivid” ./ download_flickr . py vivid 500 < api_key >Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 23 / 23