Joel Grus

Chief Scientist, VoloMetrix
@joelgrus
About Me
• Chief Scientist at VoloMetrix
• Have a 2-year-old daughter
• Did not take me long to discover that “boys” clothing is fun, “girls”
clothing kind of sucks
Typical
“Toddler Boys”
Shirt

Typical
“Toddler Girls”
Shirt
Obvious to us, but can a computer figure it
out?
The Data
• Downloaded image of every “toddler boys” and “toddler girls” t-shirt from
•
•
•
•
•
•
•

Carters
Children’s Place
Crazy 8
Gap Kids
Gymboree
Old Navy
Target.

• 616 images of boys shirts and 446 images of girls shirts
• The goal: to build a model that predicts “boy shirt” or “girl shirt” just based
on the images!
Attempt #1: Colors
• Each image is a collection of RGB pixels
• There are 256 * 256 * 256 ~ 17 million possible colors (too many)
• Bucket each of R, G, B into [0,85), [85,170), or [170,255)
• This gives 3 * 3 * 3 = 27 possible colors
• Use features “does image contain at least one pixel of color j?”
• Train logistic regression model on 80% of shirts, test on other 20%
Color Model Performance

P(girl shirt | “girl shirt”) = 75%
P(boy shirt | “boy shirt”) = 77%
P(“girl shirt” | girl shirt) = 63%
P(“boy shirt” | boy shirt) = 86%

# of shirts
(boys)
(girls)

“Confidence Score” ( > 0 “boy shirt”, < 0 “girl shirt”)
“girlier”

“boyier”
“girlier”

less colorful

“boyier”

more colorful
Attempt #2: Eigenshirts
• To compare images, rescale all of them to 138 x 138
• Chose this size because many were 138 x 138 already
• Others mostly bigger

• Using R, G, B as coordinates for each pixel, think of each image as a
point in 138 * 138 * 3 = 57,132-dimensional space
• Obviously, with 57k features and only 1,000 shirts, this will overfit
• Use dimensionality reduction to find the 10 most “interesting”
dimensions, project shirts into 10-d subspace, build model there
• Each subspace dimension determines a (Platonic ideal) “eigenshirt”
What does projection look like?
Almost all miscategorized shirts have
weak predictions (overall 93% accuracy)
“girlier”

“boyier”
Future Directions
• Look at text on shirt (but too lazy to transcribe it)
• Try to make images same size / background color
• Build model to predict how “fun” a shirt is (but will require tedious
hand-labeling)
• ??
More info
• Code (but not data) is on https://github.com/joelgrus/shirts
• Two blog posts on joelgrus.com, both linked from the github README
(or Google them, they have the same title as this talk)
• Follow me on twitter: @joelgrus

T shirts, feminism, parenting, and data science

  • 1.
    Joel Grus Chief Scientist,VoloMetrix @joelgrus
  • 2.
    About Me • ChiefScientist at VoloMetrix • Have a 2-year-old daughter • Did not take me long to discover that “boys” clothing is fun, “girls” clothing kind of sucks
  • 3.
  • 4.
    Obvious to us,but can a computer figure it out?
  • 5.
    The Data • Downloadedimage of every “toddler boys” and “toddler girls” t-shirt from • • • • • • • Carters Children’s Place Crazy 8 Gap Kids Gymboree Old Navy Target. • 616 images of boys shirts and 446 images of girls shirts • The goal: to build a model that predicts “boy shirt” or “girl shirt” just based on the images!
  • 6.
    Attempt #1: Colors •Each image is a collection of RGB pixels • There are 256 * 256 * 256 ~ 17 million possible colors (too many) • Bucket each of R, G, B into [0,85), [85,170), or [170,255) • This gives 3 * 3 * 3 = 27 possible colors • Use features “does image contain at least one pixel of color j?” • Train logistic regression model on 80% of shirts, test on other 20%
  • 7.
    Color Model Performance P(girlshirt | “girl shirt”) = 75% P(boy shirt | “boy shirt”) = 77% P(“girl shirt” | girl shirt) = 63% P(“boy shirt” | boy shirt) = 86% # of shirts (boys) (girls) “Confidence Score” ( > 0 “boy shirt”, < 0 “girl shirt”)
  • 8.
  • 9.
  • 10.
    Attempt #2: Eigenshirts •To compare images, rescale all of them to 138 x 138 • Chose this size because many were 138 x 138 already • Others mostly bigger • Using R, G, B as coordinates for each pixel, think of each image as a point in 138 * 138 * 3 = 57,132-dimensional space • Obviously, with 57k features and only 1,000 shirts, this will overfit • Use dimensionality reduction to find the 10 most “interesting” dimensions, project shirts into 10-d subspace, build model there • Each subspace dimension determines a (Platonic ideal) “eigenshirt”
  • 12.
  • 14.
    Almost all miscategorizedshirts have weak predictions (overall 93% accuracy)
  • 15.
  • 16.
    Future Directions • Lookat text on shirt (but too lazy to transcribe it) • Try to make images same size / background color • Build model to predict how “fun” a shirt is (but will require tedious hand-labeling) • ??
  • 17.
    More info • Code(but not data) is on https://github.com/joelgrus/shirts • Two blog posts on joelgrus.com, both linked from the github README (or Google them, they have the same title as this talk) • Follow me on twitter: @joelgrus