Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Upcoming SlideShare
Loading in...5
×
 

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

on

  • 8,692 views

This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm. ...

This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm.

Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k

From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/

If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs

Statistics

Views

Total Views
8,692
Views on SlideShare
7,642
Embed Views
1,050

Actions

Likes
19
Downloads
52
Comments
1

10 Embeds 1,050

http://b0op.com 682
https://twitter.com 155
http://bigdata.braccialli.net 122
http://play.daumcorp.com 51
http://tedwon.com 26
http://feedly.com 9
http://115.68.2.182 2
https://www.linkedin.com 1
http://dschool.co 1
http://5424981385442961652_41e25bf6e2019e86d8bd05b27d82a356c026b356.blogspot.com.br 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen) Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen) Presentation Transcript

  • Presented by: Huy Nguyen @huyng Computer vision at scale with Hadoop and Storm Copyright © 2014 Yahoo, Inc.
  • Outline ▪ Computer vision at 10,000 feet ▪ Hadoop training system ▪ Storm continuous stream processing ▪ Hadoop batch processing ▪ Q&A
  • How can we help Flickr users better organize, search, and browse their photo collection? photo credit: Quinn Dombrowski
  • What is AutoTagging? Any photo you upload to Flickr is now automatically classified using computer vision software Class Probability Flowers 0.98 Outdoors 0.95 Cat 0.001 Grass 0.6 Classifiers Classifiers Classifiers
  • Demo
  • Auto Tags Feeding Search ▪ ▪ ▪ ▪ ▪ ▪
  • Our Goals ▪ Train thousands of classifiers ▪ Classify millions of photos per day ▪ Compute auto tags on backlog of billions of photos ▪ Enable reprocessing bi-weekly and every quarter ▪ Do it all within 90 days of being acquired!
  • Our system for training classifiers ▪ High level overview of image classification ▪ Hadoop workflow
  • Classification in a nutshell How do we define a classifier? A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data Classification is done by computing z, given some data point x
  • Image classification in a nutshell To classify images we need to convert raw pixels into a “feature” to plug into x compute feature
  • How are image classifiers made? flower Class Name Negative Examples (approx. 100K) Positive Examples (approx. 10K) Repeat a few thousand times for each class Compute Features Train Classifier w,b
  • Hadoop Classifier Training
  • Hadoop Classifier Training HDFS Pixel Data + Label Data 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> dog 1 TRUE dog 2 FALSE dog 3 TRUE cat 1 FALSE cat 2 TRUE cat 3 FALSE . . Pixel Data Label Data photo_id, pixel_data class, photo_id, label
  • Hadoop Classifier Training - JOIN map map map reduce Pixels Labels 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . ▪ mappers emit (photo_id, pixels) OR (photo_id, class, label) ▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
  • Hadoop Classifier Training - TRAIN 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . map map map reduce compute feature training classifier dog clf. cat clf. ▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image ▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
  • Hadoop Classifier Training ▪ Workflow coordinated with OOZIE
  • Hadoop Classifier Training ▪ 10,000 mappers ▪ 1,000 reducers ▪ A few thousand classifiers in less than 6 hours ▪ Replaced OpenMPI based system ▪ Easier deployment ▪ Access to more hardware ▪ Much simpler code
  • Area of Improvements ▪ Reducers have all training data in memory ▪ A potential bottleneck if we want to expand training set for any given set of classifiers ▪ Known path to avoiding this bottleneck ▪ Incrementally update models as new data arrives
  • Our system for tagging the flickr corpus ▪ Batch processing with Hadoop ▪ Stream processing with Storm
  • Batch processing in Hadoop www HDFS Autotags Mapper Search Engine Indexer HDFS Indexing Reducer Autotags MapperAutotags Mapper
  • Hadoop Auto tagging HDFS Pixel Data Autotag Results 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> 1 { “cat”: .98, “dog”: 0.3 } 2 { “cat”: .97, “dog”: 0.2 } 3 { “cat”: .2, “dog”: 0.3 } 4 { “cat”: .5, “dog”: 0.9 } . . n { “cat”: .90, “dog”: 0.23 } Pixel Data Autotag Results photo_id, pixel_data photo_id, results
  • Hadoop classification performance ▪ 10,000 Mappers ▪ Processed in about 1 week entire corpus
  • Main challenge: packaging our code ▪ Heterogenous Python & C++ codebase ▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?
  • Main challenge: packaging our code ▪ tried: pyinstaller, didn’t work ▪ rolled our own solution using strace ▪ /autotags_package ▪ /lib ▪ /lib/python2.7/site-packages ▪ /bin ▪ /app
  • Stream processing (our first attempt) www Redis PHP WorkersPHP WorkersPHP Workers fork autotags a.jpg search db
  • Why we started looking into Storm
  • Why we started looking into Storm Feature Computation Classification 300 ms 10 ms Classifier Load 300 ms Classifier loading accounted for 49% of total processing time
  • Solutions we evaluated at the time ▪ Mem-mapping model ▪ Batching up operations ▪ Daemonize our autotagging code
  • Enter Storm
  • Storm computational framework ▪ Spouts ▪ The source of your data stream. Spouts “emit” tuples ▪ Tuples ▪ An atomic unit of work ▪ Bolts ▪ Small programs that processes your tuple stream ▪ Topology ▪ A graph of bolts and spouts
  • Defining a topology TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("images", new ImageSpout(), 10); builder.setBolt("autotag", new AutoTagBolt(), 3) .shuffleGrouping("images") builder.setBolt("dbwriter", new DBWriterBolt(), 2) .shuffleGrouping("autotag")
  • Defining a bolt class Autotag(storm.BasicBolt): def __init__(self): self.classifier.load() def process(self, tuple): <process tuple here>
  • Our final solution www Redis Autotags Spout Autotags Bolt Search Engine Indexer Bolt DB Writer Bolt Topology
  • Storm architecture Supervisors (10 nodes) Nimbus (1 node) Zookeepers (3 nodes) Spouts and bolts live here
  • Fault tolerant against node failures Nimbus server dies ▪ no new jobs can be submitted ▪ current jobs stay running ▪ simply restart Supervisor node dies ▪ All unprocessed jobs for that node get reassigned ▪ Supervisor restarted Zookeeper node dies ▪ Restart node, no data loss as long as not too many die at once
  • storm commands storm jar topology-jar-path class
  • storm commands storm list
  • storm commands storm kill topology-name
  • storm commands storm deactivate topology-jar-path class
  • storm commands storm activate topology-name
  • Is Storm just another queuing system?
  • Yes, and more ▪ A framework for separating application design from scaling concerns ▪ Has a great deployment story ▪ atomic start, kill, deactivate for free ▪ no more rsync ▪ No single point of failure ▪ Wide industry adoption ▪ Is programming language agnostic
  • Try storm out! https://github.com/apache/incubator-storm/tree/master/examples/storm-starter https://github.com/nathanmarz/storm-deploy
  • Thank you Presentation copyright © 2014 Yahoo, Inc.