Upcoming SlideShare
×

# Data-driven modeling: Lecture 02

7,458 views

Published on

Published in: Education, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
7,458
On SlideShare
0
From Embeds
0
Number of Embeds
5,815
Actions
Shares
0
61
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data-driven modeling: Lecture 02

1. 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 30, 2012Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 1 / 23
2. 2. Outline 1 Digit recognition 2 Image classiﬁcation 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 2 / 23
3. 3. Digit recognition Classiﬁcation is an supervised learning task by which we aim to predict the correct label for an example given its features ↓ 0 5 4 1 4 9 e.g. determine which digit {0, 1, . . . , 9} is in depicted in each imageJake Hofman (Columbia University) Data-driven modeling January 30, 2012 3 / 23
4. 4. Digit recognition Determine which digit {0, 1, . . . , 9} is in depicted in each imageJake Hofman (Columbia University) Data-driven modeling January 30, 2012 4 / 23
5. 5. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensitiesJake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
6. 6. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensities Represent each image as a “vector of pixels”, ﬂattening the 2-d array of pixels to a 1-d vectorJake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
7. 7. k-nearest neighbors classiﬁcation k-nearest neighbors: memorize training examples, predict labels using labels of the k closest training points Intuition: nearby points have similar labelsJake Hofman (Columbia University) Data-driven modeling January 30, 2012 6 / 23
8. 8. k-nearest neighbors classiﬁcation Small k gives a complex boundary, large k results in coarse averagingJake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
9. 9. k-nearest neighbors classiﬁcation Evaluate performance on a held-out test set to assess generalization errorJake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
10. 10. Digit recognition Simple digit classifer with k=1 nearest neighbors ./ classify_digits . pyJake Hofman (Columbia University) Data-driven modeling January 30, 2012 8 / 23
11. 11. Outline 1 Digit recognition 2 Image classiﬁcation 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 9 / 23
12. 12. Image classiﬁcation Determine if an image is a landscape or headshot ↓ ↓ ’landscape’ ’headshot’ Represent each image with a binned RGB intensity histogramJake Hofman (Columbia University) Data-driven modeling January 30, 2012 10 / 23
13. 13. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( chairs . jpg )Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
14. 14. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( chairs . jpg )Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
15. 15. Intensity histograms Disregard all spatial information, simply count pixels by intensities (e.g. lots of pixels with bright green and dark blue)Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 12 / 23
16. 16. Intensity histograms How many bins for pixel intensities? Too many bins gives a noisy, overly complex representation of the data, while using too few bins results in an overly simple oneJake Hofman (Columbia University) Data-driven modeling January 30, 2012 13 / 23
17. 17. Image classiﬁcation Classify ./ classify_flickr . py 16 9 flickr_headshot flickr_landscape Change in performance on test set with number of neighbors k = 1, accuracy = 0.7125 k = 3, accuracy = 0.7425 k = 5, accuracy = 0.7725 k = 7, accuracy = 0.7650 k = 9, accuracy = 0.7500Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 14 / 23
18. 18. Outline 1 Digit recognition 2 Image classiﬁcation 3 Acquiring image dataJake Hofman (Columbia University) Data-driven modeling January 30, 2012 15 / 23
19. 19. Simple screen scraping One-liner to download ESL digit data wget - Nr -- level =1 --no - parent http :// www - stat . stanford . edu /~ tibs / ElemStatLearn / datasets / zip . digitsJake Hofman (Columbia University) Data-driven modeling January 30, 2012 16 / 23
20. 20. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr " = n | egrep ^ http .*( png | jpg | gif ) | xargs wgetJake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
21. 21. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr " = n | egrep ^ http .*( png | jpg | gif ) | xargs wget • get page source • translate quotes and = to newlines • match urls with image extensions • download qualifying imagesJake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
22. 22. “cat ﬂickr xargs wget”?Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 18 / 23
23. 23. Flickr APIJake Hofman (Columbia University) Data-driven modeling January 30, 2012 19 / 23
24. 24. YQL: SELECT * FROM Internet1 http://developer.yahoo.com/yql 1 http://oreillynet.com/pub/e/1369Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 20 / 23
25. 25. YQL: Console http://developer.yahoo.com/yql/consoleJake Hofman (Columbia University) Data-driven modeling January 30, 2012 21 / 23
26. 26. YQL + Python Python function for public YQL queries def yql_public ( query , env = False ): # build dictionary of GET parameters params = { q : query , format : json } if env : params [ env ] = env # escape query query_str = urlencode ( params ) # fetch results url = % s ?% s % ( YQL_PUBLIC , query_str ) result = urlopen ( url ) # parse json and return return json . load ( result )[ query ][ results ]Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 22 / 23
27. 27. YQL + Python + Flickr Fetch info for “interestingness” photos ./ simpleyql . py select * from flickr . photos . interestingness (20) where api_key = " ... " Download thumbnails for photos tagged with “vivid” ./ download_flickr . py vivid 500 < api_key >Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 23 / 23