Presented by: Huy Nguyen @huyng
Computer vision at scale with
Hadoop and Storm
Copyright © 2014 Yahoo, Inc.
Outline
▪ Computer vision at 10,000 feet
▪ Hadoop training system
▪ Storm continuous stream processing
▪ Hadoop batch proc...
How can we help Flickr users better
organize, search, and browse
their photo collection?
photo credit: Quinn Dombrowski
What is AutoTagging?
Any photo you upload to Flickr is now automatically
classified using computer vision software
Class P...
Demo
Auto Tags Feeding Search
▪
▪
▪
▪
▪
▪
Our Goals
▪ Train thousands of classifiers
▪ Classify millions of photos per day
▪ Compute auto tags on backlog of billion...
Our system for training classifiers
▪ High level overview of image classification
▪ Hadoop workflow
Classification in a nutshell
How do we define a classifier?
A classifier is simply the line “z = wx -
b” which divides pos...
Image classification in a nutshell
To classify images we need to convert
raw pixels into a “feature” to plug into x
comput...
How are image classifiers made?
flower Class Name
Negative
Examples
(approx. 100K)
Positive Examples
(approx. 10K)
Repeat ...
Hadoop Classifier Training
Hadoop Classifier Training
HDFS
Pixel Data
+
Label Data
1 <base64>
2 <base64>
3 <base64>
4 <base64>
.
.
n <base64>
dog 1 T...
Hadoop Classifier Training - JOIN
map
map
map
reduce
Pixels
Labels
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
2 <pixelsb64>...
Hadoop Classifier Training - TRAIN
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]
3 <p...
Hadoop Classifier Training
▪ Workflow coordinated with OOZIE
Hadoop Classifier Training
▪ 10,000 mappers
▪ 1,000 reducers
▪ A few thousand classifiers in less than 6 hours
▪ Replaced ...
Area of Improvements
▪ Reducers have all training data in memory
▪ A potential bottleneck if we want to expand training se...
Our system for tagging the flickr corpus
▪ Batch processing with Hadoop
▪ Stream processing with Storm
Batch processing in Hadoop
www HDFS Autotags
Mapper
Search
Engine
Indexer
HDFS
Indexing
Reducer
Autotags
MapperAutotags
Ma...
Hadoop Auto tagging
HDFS
Pixel Data
Autotag Results
1 <base64>
2 <base64>
3 <base64>
4 <base64>
.
.
n <base64>
1 { “cat”: ...
Hadoop classification performance
▪ 10,000 Mappers
▪ Processed in about 1 week entire corpus
Main challenge: packaging our code
▪ Heterogenous Python & C++ codebase
▪ How do we get all of the shared library dependen...
Main challenge: packaging our code
▪ tried: pyinstaller, didn’t work
▪ rolled our own solution using strace
▪ /autotags_pa...
Stream processing (our first attempt)
www Redis
PHP
WorkersPHP
WorkersPHP
Workers
fork autotags a.jpg
search
db
Why we started looking into Storm
Why we started looking into Storm
Feature
Computation
Classification
300 ms 10 ms
Classifier
Load
300 ms
Classifier loadin...
Solutions we evaluated at the time
▪ Mem-mapping model
▪ Batching up operations
▪ Daemonize our autotagging code
Enter
Storm
Storm computational framework
▪ Spouts
▪ The source of your data
stream. Spouts “emit”
tuples
▪ Tuples
▪ An atomic unit of...
Defining a topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("images", new ImageSpout(), 10);
bui...
Defining a bolt
class Autotag(storm.BasicBolt):
def __init__(self):
self.classifier.load()
def process(self, tuple):
<proc...
Our final solution
www Redis Autotags
Spout
Autotags
Bolt
Search
Engine
Indexer
Bolt
DB
Writer
Bolt
Topology
Storm architecture
Supervisors
(10 nodes)
Nimbus
(1 node)
Zookeepers
(3 nodes)
Spouts and bolts live here
Fault tolerant against node failures
Nimbus server dies
▪ no new jobs can be submitted
▪ current jobs stay running
▪ simpl...
storm commands
storm jar topology-jar-path class
storm commands
storm list
storm commands
storm kill topology-name
storm commands
storm deactivate topology-jar-path class
storm commands
storm activate topology-name
Is Storm just another queuing system?
Yes, and more
▪ A framework for separating application design from
scaling concerns
▪ Has a great deployment story
▪ atomi...
Try storm out!
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/nathanmarz/...
Thank you
Presentation copyright © 2014 Yahoo, Inc.
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Upcoming SlideShare
Loading in...5
×

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

13,066

Published on

This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm.

Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k

From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/

If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs

Published in: Technology, Lifestyle
1 Comment
27 Likes
Statistics
Notes
No Downloads
Views
Total Views
13,066
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
111
Comments
1
Likes
27
Embeds 0
No embeds

No notes for slide

Transcript of "Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)"

  1. 1. Presented by: Huy Nguyen @huyng Computer vision at scale with Hadoop and Storm Copyright © 2014 Yahoo, Inc.
  2. 2. Outline ▪ Computer vision at 10,000 feet ▪ Hadoop training system ▪ Storm continuous stream processing ▪ Hadoop batch processing ▪ Q&A
  3. 3. How can we help Flickr users better organize, search, and browse their photo collection? photo credit: Quinn Dombrowski
  4. 4. What is AutoTagging? Any photo you upload to Flickr is now automatically classified using computer vision software Class Probability Flowers 0.98 Outdoors 0.95 Cat 0.001 Grass 0.6 Classifiers Classifiers Classifiers
  5. 5. Demo
  6. 6. Auto Tags Feeding Search ▪ ▪ ▪ ▪ ▪ ▪
  7. 7. Our Goals ▪ Train thousands of classifiers ▪ Classify millions of photos per day ▪ Compute auto tags on backlog of billions of photos ▪ Enable reprocessing bi-weekly and every quarter ▪ Do it all within 90 days of being acquired!
  8. 8. Our system for training classifiers ▪ High level overview of image classification ▪ Hadoop workflow
  9. 9. Classification in a nutshell How do we define a classifier? A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data Classification is done by computing z, given some data point x
  10. 10. Image classification in a nutshell To classify images we need to convert raw pixels into a “feature” to plug into x compute feature
  11. 11. How are image classifiers made? flower Class Name Negative Examples (approx. 100K) Positive Examples (approx. 10K) Repeat a few thousand times for each class Compute Features Train Classifier w,b
  12. 12. Hadoop Classifier Training
  13. 13. Hadoop Classifier Training HDFS Pixel Data + Label Data 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> dog 1 TRUE dog 2 FALSE dog 3 TRUE cat 1 FALSE cat 2 TRUE cat 3 FALSE . . Pixel Data Label Data photo_id, pixel_data class, photo_id, label
  14. 14. Hadoop Classifier Training - JOIN map map map reduce Pixels Labels 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . ▪ mappers emit (photo_id, pixels) OR (photo_id, class, label) ▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
  15. 15. Hadoop Classifier Training - TRAIN 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . map map map reduce compute feature training classifier dog clf. cat clf. ▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image ▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
  16. 16. Hadoop Classifier Training ▪ Workflow coordinated with OOZIE
  17. 17. Hadoop Classifier Training ▪ 10,000 mappers ▪ 1,000 reducers ▪ A few thousand classifiers in less than 6 hours ▪ Replaced OpenMPI based system ▪ Easier deployment ▪ Access to more hardware ▪ Much simpler code
  18. 18. Area of Improvements ▪ Reducers have all training data in memory ▪ A potential bottleneck if we want to expand training set for any given set of classifiers ▪ Known path to avoiding this bottleneck ▪ Incrementally update models as new data arrives
  19. 19. Our system for tagging the flickr corpus ▪ Batch processing with Hadoop ▪ Stream processing with Storm
  20. 20. Batch processing in Hadoop www HDFS Autotags Mapper Search Engine Indexer HDFS Indexing Reducer Autotags MapperAutotags Mapper
  21. 21. Hadoop Auto tagging HDFS Pixel Data Autotag Results 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> 1 { “cat”: .98, “dog”: 0.3 } 2 { “cat”: .97, “dog”: 0.2 } 3 { “cat”: .2, “dog”: 0.3 } 4 { “cat”: .5, “dog”: 0.9 } . . n { “cat”: .90, “dog”: 0.23 } Pixel Data Autotag Results photo_id, pixel_data photo_id, results
  22. 22. Hadoop classification performance ▪ 10,000 Mappers ▪ Processed in about 1 week entire corpus
  23. 23. Main challenge: packaging our code ▪ Heterogenous Python & C++ codebase ▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?
  24. 24. Main challenge: packaging our code ▪ tried: pyinstaller, didn’t work ▪ rolled our own solution using strace ▪ /autotags_package ▪ /lib ▪ /lib/python2.7/site-packages ▪ /bin ▪ /app
  25. 25. Stream processing (our first attempt) www Redis PHP WorkersPHP WorkersPHP Workers fork autotags a.jpg search db
  26. 26. Why we started looking into Storm
  27. 27. Why we started looking into Storm Feature Computation Classification 300 ms 10 ms Classifier Load 300 ms Classifier loading accounted for 49% of total processing time
  28. 28. Solutions we evaluated at the time ▪ Mem-mapping model ▪ Batching up operations ▪ Daemonize our autotagging code
  29. 29. Enter Storm
  30. 30. Storm computational framework ▪ Spouts ▪ The source of your data stream. Spouts “emit” tuples ▪ Tuples ▪ An atomic unit of work ▪ Bolts ▪ Small programs that processes your tuple stream ▪ Topology ▪ A graph of bolts and spouts
  31. 31. Defining a topology TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("images", new ImageSpout(), 10); builder.setBolt("autotag", new AutoTagBolt(), 3) .shuffleGrouping("images") builder.setBolt("dbwriter", new DBWriterBolt(), 2) .shuffleGrouping("autotag")
  32. 32. Defining a bolt class Autotag(storm.BasicBolt): def __init__(self): self.classifier.load() def process(self, tuple): <process tuple here>
  33. 33. Our final solution www Redis Autotags Spout Autotags Bolt Search Engine Indexer Bolt DB Writer Bolt Topology
  34. 34. Storm architecture Supervisors (10 nodes) Nimbus (1 node) Zookeepers (3 nodes) Spouts and bolts live here
  35. 35. Fault tolerant against node failures Nimbus server dies ▪ no new jobs can be submitted ▪ current jobs stay running ▪ simply restart Supervisor node dies ▪ All unprocessed jobs for that node get reassigned ▪ Supervisor restarted Zookeeper node dies ▪ Restart node, no data loss as long as not too many die at once
  36. 36. storm commands storm jar topology-jar-path class
  37. 37. storm commands storm list
  38. 38. storm commands storm kill topology-name
  39. 39. storm commands storm deactivate topology-jar-path class
  40. 40. storm commands storm activate topology-name
  41. 41. Is Storm just another queuing system?
  42. 42. Yes, and more ▪ A framework for separating application design from scaling concerns ▪ Has a great deployment story ▪ atomic start, kill, deactivate for free ▪ no more rsync ▪ No single point of failure ▪ Wide industry adoption ▪ Is programming language agnostic
  43. 43. Try storm out! https://github.com/apache/incubator-storm/tree/master/examples/storm-starter https://github.com/nathanmarz/storm-deploy
  44. 44. Thank you Presentation copyright © 2014 Yahoo, Inc.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×