T shirts, feminism, parenting, and data science

859 views

Published on

A talk I gave at the Seattle DAML Meetup, 10/24/2013

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
859
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

T shirts, feminism, parenting, and data science

  1. 1. Joel Grus Chief Scientist, VoloMetrix @joelgrus
  2. 2. About Me • Chief Scientist at VoloMetrix • Have a 2-year-old daughter • Did not take me long to discover that “boys” clothing is fun, “girls” clothing kind of sucks
  3. 3. Typical “Toddler Boys” Shirt Typical “Toddler Girls” Shirt
  4. 4. Obvious to us, but can a computer figure it out?
  5. 5. The Data • Downloaded image of every “toddler boys” and “toddler girls” t-shirt from • • • • • • • Carters Children’s Place Crazy 8 Gap Kids Gymboree Old Navy Target. • 616 images of boys shirts and 446 images of girls shirts • The goal: to build a model that predicts “boy shirt” or “girl shirt” just based on the images!
  6. 6. Attempt #1: Colors • Each image is a collection of RGB pixels • There are 256 * 256 * 256 ~ 17 million possible colors (too many) • Bucket each of R, G, B into [0,85), [85,170), or [170,255) • This gives 3 * 3 * 3 = 27 possible colors • Use features “does image contain at least one pixel of color j?” • Train logistic regression model on 80% of shirts, test on other 20%
  7. 7. Color Model Performance P(girl shirt | “girl shirt”) = 75% P(boy shirt | “boy shirt”) = 77% P(“girl shirt” | girl shirt) = 63% P(“boy shirt” | boy shirt) = 86% # of shirts (boys) (girls) “Confidence Score” ( > 0 “boy shirt”, < 0 “girl shirt”)
  8. 8. “girlier” “boyier”
  9. 9. “girlier” less colorful “boyier” more colorful
  10. 10. Attempt #2: Eigenshirts • To compare images, rescale all of them to 138 x 138 • Chose this size because many were 138 x 138 already • Others mostly bigger • Using R, G, B as coordinates for each pixel, think of each image as a point in 138 * 138 * 3 = 57,132-dimensional space • Obviously, with 57k features and only 1,000 shirts, this will overfit • Use dimensionality reduction to find the 10 most “interesting” dimensions, project shirts into 10-d subspace, build model there • Each subspace dimension determines a (Platonic ideal) “eigenshirt”
  11. 11. What does projection look like?
  12. 12. Almost all miscategorized shirts have weak predictions (overall 93% accuracy)
  13. 13. “girlier” “boyier”
  14. 14. Future Directions • Look at text on shirt (but too lazy to transcribe it) • Try to make images same size / background color • Build model to predict how “fun” a shirt is (but will require tedious hand-labeling) • ??
  15. 15. More info • Code (but not data) is on https://github.com/joelgrus/shirts • Two blog posts on joelgrus.com, both linked from the github README (or Google them, they have the same title as this talk) • Follow me on twitter: @joelgrus

×