Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scalable high-dimensional
indexing with Hadoop
TEXMEX team, INRIA Rennes, France
Denis Shestakov, PhD
denis.shestakov at {...
Outline
● Motivation
● Approach overview: scaling indexing &
searching using Hadoop
● Experimental setup: datasets, resour...
Motivation
● Big data is here
○ Lots of multimedia content
○ Even forgetting 'big' companies, 1TB/day of
multimedia is now...
Our approach
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexin...
Our approach
● Hadoop used for both indexing and searching
● Our search scenario:
■ Searching for batch of images
● Thousa...
Experimental setup
● Used Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capaci...
Experimental setup
● Over 100 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one of th...
Experimental setup
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ ...
Results: workflow overview
● Experiment on indexing & searching 1TB took 5-6
hours
Results: indexing 1TB
Results: indexing 4TB
● 4TB
● 100 nodes
● Used tuned parameters
○ Except change in #mappers/#reducers per node
■ To fit bi...
Results: search quality
Results: search scalability
Results: search execution
Search 12k batch over 1TB using 100 nodes
Results: searching 4TB
● 4TB
● 87 nodes
● Copydays query batch (3k images)
○ Throughput: 460ms per image
● 12k query batch...
Observations &
implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-...
Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me if interest...
Future directions
● Deal with big batches of query images
○ ~200k query images
● Share auxiliary data (index tree, lookup ...
Acknowledgements
● TEXMEX team, INRIA Rennes http://www.
irisa.fr/texmex/index_en.php
● Quaero project, http://www.quaero....
Thank you!
Questions?
Upcoming SlideShare
Loading in …5
×

Scalable high-dimensional indexing with Hadoop

1,383 views

Published on

Talk given at CBMI 2013 (Veszprém, Hungary) on 19.06.2013

  • Be the first to comment

Scalable high-dimensional indexing with Hadoop

  1. 1. Scalable high-dimensional indexing with Hadoop TEXMEX team, INRIA Rennes, France Denis Shestakov, PhD denis.shestakov at {aalto.fi,inria.fr} linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg
  2. 2. Outline ● Motivation ● Approach overview: scaling indexing & searching using Hadoop ● Experimental setup: datasets, resources, configuration ● Results ● Observations & implications ● Things to share ● Future directions
  3. 3. Motivation ● Big data is here ○ Lots of multimedia content ○ Even forgetting 'big' companies, 1TB/day of multimedia is now common for many parties ● Solution: apply more computational power ○ Luckily, easier access to such power via grid/cloud resources ● Applications: ○ Large-scale image retrieval: e.g., detecting copyright violations in huge image repositories ○ Google Goggles-like systems: annotating the scene
  4. 4. Our approach ● Index & search huge image collection using MapReduce-based eCP algorithm ○ See our work at ICMR'13: Indexing and searching 100M images with MapReduce [7] ○ See Section II for quick overview ● Use the Grid5000 plartform ○ Distributed infrastructure available to French researchers & their partners ● Use the Hadoop framework ○ Most popular open-source implementation of MapReduce model ○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes
  5. 5. Our approach ● Hadoop used for both indexing and searching ● Our search scenario: ■ Searching for batch of images ● Thousands of images in one run ● Focus on throughput, not on response time for individual image ■ Use case: copyright violation detection ● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
  6. 6. Experimental setup ● Used Grid5000 platform: ○ Nodes in rennes site of Grid5000 ■ Up to 110 nodes available ■ Nodes capacity/performance varied ● Heterogenous, come from three clusters ● From 8 cores to 24 cores per node ● From 24GB to 48GB RAM per node ● Hadoop ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Pros: easy to migrate, try and compare by others ■ Cons: not top performance
  7. 7. Experimental setup ● Over 100 mln images (~30 billion SIFT descriptors) ○ Collected from the Web and provided by one of the partners in Quaero project ■ One of the largest reported in literature ○ Images resized to 150px on largest side ○ Worked with ■ The whole set (~4TB) ■ The subset, 20mln images (~1TB) ○ Used as distracting dataset
  8. 8. Experimental setup ● For evaluation of indexing quality: ○ Added to distracting datasets: ■ INRIA Copydays (127 images) ○ Queried for ■ Copydays batch (3055 images = 127 original images and their associated variants incl. strong distortions, e.g. print-crumple-scan ) ■ 12k batch (12081 images = 245 random images from dataset and their variants) ○ Checked if original images returned as top voted search results
  9. 9. Results: workflow overview ● Experiment on indexing & searching 1TB took 5-6 hours
  10. 10. Results: indexing 1TB
  11. 11. Results: indexing 4TB ● 4TB ● 100 nodes ● Used tuned parameters ○ Except change in #mappers/#reducers per node ■ To fit bigger index tree (for 4TB) to RAM ■ 4 mappers/2 reducers ● Time: 507min
  12. 12. Results: search quality
  13. 13. Results: search scalability
  14. 14. Results: search execution Search 12k batch over 1TB using 100 nodes
  15. 15. Results: searching 4TB ● 4TB ● 87 nodes ● Copydays query batch (3k images) ○ Throughput: 460ms per image ● 12k query batch ○ Throughput: 210ms per image ● Bigger batches improve throughput insignificantly ○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node
  16. 16. Observations & implications ● HDFS block size limits scalability ○ 1TB dataset => 1186 blocks of 1024MB size ○ Assuming 8-core nodes and reported searching method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions: ■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks ■ Re-visit search process: e.g., partial-loading of lookup table ● Big data is here but not resources to process ○ E.g, indexing&searching >10TB not possible given resources we had
  17. 17. Things to share ● Our methods/system can be applied to audio datasets ○ No major changes expected ○ Contact me if interested ● Code for MapReduce-eCP algorithm available on request ○ Should run smoothly on your Hadoop cluster ○ Interested in comparisons ● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request ○ Describe indexing/searching our dataset by giving details on map/reduce tasks execution ○ Insights on better analysis/visualization are welcome ○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
  18. 18. Future directions ● Deal with big batches of query images ○ ~200k query images ● Share auxiliary data (index tree, lookup table) by mappers ○ Multithreaded map tasks ● (environment-specific) Test scalability on more nodes ○ Use several sites of Grid5000 infrastructure ■ rennes+nancy sites (up to 300 nodes) --in progress
  19. 19. Acknowledgements ● TEXMEX team, INRIA Rennes http://www. irisa.fr/texmex/index_en.php ● Quaero project, http://www.quaero.org/ ● Grid5000 infrastructure & its Rennes maintenance team, https://www.grid5000.fr
  20. 20. Thank you! Questions?

×