Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Terabyte-scale image similarity search with Hadoop

2,465 views

Published on

Talk given at Hadoop Summit Europe 2014, Amsterdam, Netherlands on 02.04.2014

Talk abstract: In this talk I focus on a specific Hadoop application, image similarity search, and present our experience on designing, building and testing a Hadoop-based image similarity search scalable to terabyte-sized image collections. I start with overviewing how to adapt image retrieval techniques to MapReduce model. Second, I describe image indexing and searching workloads and show how these workflows are rather atypical for Hadoop. E.g., I explain how to tune Hadoop to fit to such computational tasks and particularly specify the parameters and values that deliver best performance. Next I present the Hadoop cluster heterogeneity problem and describe a solution to it by proposing a platform-aware Hadoop configuration. Then I introduce the tools, provided by the standard Apache Hadoop framework, useful for a large class of application workloads similar to ours, where a large-size auxiliary data structure is required for processing the dataset. Finally, I overview a series of experiments conducted on four terabytes image dataset (biggest reported in the academic literature). The findings will be shared as best practices and recommendations to the practitioners working with huge multimedia collections.

Speaker: Dr. Denis Shestakov is an experienced researcher in the area of big data engineering and, recently, a practitioner as a Hadoop/MapReduce consultant. Denis has been involved in various big data projects in web analytics and search, multimedia search and bioinformatics. See his profile at LinkedIn: http://fi.linkedin.com/in/dshestakov/

Published in: Technology
  • Check for more material at http://www.slideshare.net/denshe/bigdata13-071013 & http://www.slideshare.net/denshe/cbmi13-scalable-190613
    For careful stydy check the works on slide 33. Feel free to request full-texts by emailing me.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Terabyte-scale image similarity search with Hadoop

  1. 1. Terabyte-scale image similarity search with Hadoop Denis Shestakov Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014
  2. 2. About me ● Big Data researcher/engineer ○ recent projects: large-scale image retrieval ○ before: web crawling ● Hadoop/MapReduce contractor ○ design/development/tuning Hadoop applications Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  3. 3. Talk Outline ● Intro to image search ● Image retrieval with MapReduce ● Image indexing/searching workloads ● Hadoop tools for large joins ● Smart Hadoop configuration ● Misc & conclusions Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  4. 4. Intro to Image Search ● Finding images given a text ○ dog → Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  5. 5. Intro to Image Search ● Finding images given an image ○ By content-similarity Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  6. 6. Image Search Applications ● Regular image search ○ Google Images, Bing Images, TinEye, etc ● Product search (by image) ● Object recognition ○ Face, logo, vehicle, etc. ● Computer vision ● Augmented reality ● Medical imaging ● Astrophysics Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  7. 7. Intro to Image Search Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  8. 8. Intro to Image Search How does it work? ● Images resized to smaller size ● Then transformed to chosen feature description representation ○ image → set of feature descriptors (=high-dimensional vectors) ○ Many transformations exist ■ SIFT (Scale-invariant feature transform) used by us Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  9. 9. Intro to Image Search How does it work? image_id SIFT descriptor 10011 21, 143, 5, …, 201, 186 10011 121, 14, 75, …, 20, 109 10011 37, 40, 0, …, 213, 96 ... ... 10011 81, 235, 67, …, 102,63 Typical: several hundreds of feature descriptors per image Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  10. 10. Intro to Image Search How does it work? ● Compare (e.g., by calculating Euclidean distance) feature descriptors of a query image with descriptors of images in collection to search ● Images with ‘closest’ descriptors are similar to a query image Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  11. 11. Intro to Image Search Why MapReduce? ● Direct comparisons of descriptors costly even for very small collections ● Lots of approaches to ‘organize’ feature descriptors for fast search ○ Build an index ○ Index all the descriptors ○ At search, check query descriptors only against certain groups of descriptors Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  12. 12. Image Retrieval with MapReduce Why MapReduce? ● Poorly scalable ○ up to ~10-20 mln images ● But multimedia grows exponentially ● Scaling is required … Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  13. 13. Image Retrieval with MapReduce Use case: ● Copyright violation detection in large image databank ○ >100mln images ● Searching for batch of images ○ Thousands of images in one query ○ Focus on throughput, not on response time for individual image ● SIFT features Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  14. 14. Image Retrieval with MapReduce Indexing images ● Generating index tree ● Clustering images into a large set of clusters (max cluster size = 5000) ○ Mapper input: ■ unsorted SIFT descriptors ■ index tree (loaded by every mapper) ○ Mapper output: ■ (cluster_id, SIFT) ○ Reducer output: ■ SIFTs sorted by cluster_id Denis Shestakov denshe at gmail.com MapReduce linkedin: linkedin.com/in/dshestakov
  15. 15. Image Retrieval with MapReduce Searching ● Generating lookup table ○ indexing query SIFTs MapReduce ● Finding best matches for query SIFTs ○ Mapper input: ■ sorted SIFT descriptors ■ lookup table (loaded by every mapper) ○ Mapper output: ■ (query-sift-id, knn of image-ids) ○ Reducer output: MapReduce ■ Best votes (image-ids) for query-image-id Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  16. 16. Image Retrieval with MapReduce In nutshell: ● Indexing phase ○ Clustering SIFTs with one-pass k-means ● Searching phase ○ Map-side join of clustered SIFTs and lookup table (query SIFTs) Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  17. 17. Image search workloads Time to discuss Hadoop specifics: ● Standard Apache Hadoop distribution, ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Easy to migrate ● Around 100 nodes from Grid5000 ○ 8/24 cores, 24/32/48GB RAM per node ○ capacity/performance varied Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  18. 18. Image search workloads Dataset: ● 110 mln images (~30 billion SIFT descriptors) ○ ~30 billion SIFT descriptors ○ 4TB ○ Largest reported in literature ○ Images resized to 150px on largest side ○ Worked also with subset, 1TB ○ Used as distracting dataset Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  19. 19. Image search workloads Queries: ● Query batches ○ Up to 250k query images in one batch ○ Batch includes original images and their distorted variants ■ Some variants are very hard to find ● e.g., print-crumple-scan ● Check if original images returned as top votes ○ (out of scope) state-of-the-art search quality Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  20. 20. Image search workloads Indexing workload characteristics ● computationally-intensive (map phase) ● data-intensive (at map&reduce phases) ● large auxiliary data structure (i.e., index tree) ○ grows as dataset grows ○ e.g., 1.8GB for 110M images (4TB) ● map input < map output ● network is heavily utilized during shuffling Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  21. 21. Image search workloads Indexing workload Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  22. 22. Image search workloads Searching workload ● large aux.data structure (e.g., lookup table) Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  23. 23. Image search workloads Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov ● Basic settings: ○ 512MB HDFS block size ○ 3 replicas ○ 8 map slots ○ 2 reduce slots ● 4TB dataset: ○ 4 map slots
  24. 24. Hadoop tools for large joins ● Some workloads require all mappers to load a large-size data structure ○ Like image indexing/searching workloads ● Spreading data file across all nodes ○ Hadoop DistributedCache ● Not efficient if structure is of gigabytes-size ○ Partial solution: increase HDFS block sizes → decrease #mappers ● Another approach: multithreaded mappers ○ Not well documented Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  25. 25. Hadoop tools for large joins ● Multithreaded mapper spans a configured number of threads, each thread executes a map task ● Mapper threads share the RAM ● Downsides: ○ synchronization when reading input ○ synchronization when writing output Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  26. 26. Hadoop tools for large joins Indexing 4T with 4 mappers slots, each running two threads ● index tree size: 1.8GB Indexing time on 100 nodes ● 8h27min → 6h8min Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  27. 27. Hadoop tools for large joins ● In some workloads mappers require only a part of auxiliary data structure ○ I.e., relevant to data block processed ○ Image searching workflow ● Approach: Hadoop MapFile ○ Very efficient ■ Big batches, >10000 query images ■ ~2 times faster on batches including around 25000 images Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  28. 28. Smart Hadoop configuration Here is the problem: ● Apache Hadoop, v.1.0.1 ● Capacity/performance of nodes varied ○ 8/24 cores, 24-48GB RAM, etc ● One config file (#mappers, #reducers, maxim. map/reduce memory, ...) for all nodes ● Issue for memory-intensive workloads! Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  29. 29. Smart Hadoop configuration Solution (hack): ● deploy Hadoop on all nodes with settings addressing the least equipped nodes ● create sub-cluster configuration files adjusted to better equipped nodes ○ substitute original config file with the new one on better equipped nodes ● restart tasktrackers with new configuration files on better equipped nodes Call it smart deployment ● Or known under another name? Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  30. 30. Smart Hadoop configuration Denis Shestakov denshe at gmail.com Indexing 1T on 106 nodes: 75min → 65min linkedin: linkedin.com/in/dshestakov
  31. 31. Conclusions ● Several directions for further optimization ● Presented techniques applicable to video and audio datasets ○ Given a transformation into feature vectors ○ Only small changes expected (e.g, new Writable) ● Hadoop smart deployment trick ● (Wanted) Best practices for Hadoop job history log analysis Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  32. 32. Supporting publications Things to share Hadoop job history logs available on request: ● Describe indexing/searching 4TB dataset ● Insights on better analysis/visualization are welcome ● Get cbmi13 example-set at http://goo.gl/e06wE Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  33. 33. Supporting publications Supporting Materials Check full-texts of our publications: ● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ACM ICMR'13, 2013. ● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional ● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013. Denis Shestakov denshe at gmail.com indexing with Hadoop. In Proc. CBMI'13, 2013. linkedin: linkedin.com/in/dshestakov
  34. 34. Acknowledgements Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov ● My colleagues at INRIA Rennes ● Aalto University ● Grid5000 infrastructure
  35. 35. That’s it! Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov Thanks!

×