SlideShare a Scribd company logo
Terabyte-scale image similarity
search: experience and best practice
Diana Moise2, Denis Shestakov1,2,
Gylfi Gudmundsson2, Laurent Amsaleg3
1

Department of Media Technology, School of Science, Aalto University, Finland
2
Inria Rennes – Bretagne Atlantique, France
3
IRISA - CNRS, France
Denis Shestakov
denis.shestakov at aalto.fi
linkedin: linkedin.com/in/dshestakov
mendeley: mendeley.com/profiles/denis-shestakov
Terabyte-scale image search in
Europe?
Overview
1. Background: image retrieval, our focus,
environment, etc.
2. Applying Hadoop to multimedia retrieval
tasks
3. Addressing Hadoop cluster heterogeneity
issue
4. Studying workloads with large auxiliary data
structure required for processing
5. Experimenting with very large image dataset
Image search?
Content-based image search:
● Find matches with similar content
Image search applications?
● regular image search
● object recognition
○ face, logo, etc.
● for systems like Google Goggles
● augmented reality applications
● medical imaging
● analysis of astrophysics data
Our use case
● Copyright violation detection
● Our scenario:
○ Searching for batch of images
■ Querying for thousands of images in one run
■ Focus on throughput, not on response time for
individual image
● Note: indexed dataset can be searched on single
machine with adequate disk capacity if necessary
Image search with Hadoop
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexing and
searching 100M images with MapReduce [18]
○ See Section III for quick overview
● Use the Grid5000 plartform
○ Distributed infrastructure available to French
researchers & their partners
● Use the Hadoop framework
Experimental setup: cluster
● Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capacity/performance varied
● Heterogenous, come from three clusters
● From 8 cores to 24 cores per node
● From 24GB to 48GB RAM per node
Experimental setup: framework
● Standard Apache Hadoop distribution, ver.1.0.1
○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others
■ Cons: not top performance

○ Tools provided by Hadoop framework
■
■
■
■

Hadoop SequenceFiles
DistributedCache
multithreaded mappers
MapFiles
Experimental setup: dataset
● 110 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one
of the partners in Quaero project
■ Largest reported in literature
○ Images resized to 150px on largest side
○ Worked with
■ The whole set (~4TB)
■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
Experimental setup: querying
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ Queried for
■ Copydays batch (~3000 images = 127 original
images and their associated variants incl. strong
distortions, e.g. print-crumple-scan )
■ 12k batch (~12000 images = 245 random images
from dataset and their variants)
■ 25k batch
○ Checked if original images returned as top voted
search results
Image search with Hadoop
Distributed index creation
● Clustering images into a large set of clusters (max
cluster size = 5000)
● Mapper input:
○ unsorted SIFT descriptors
○ index tree (loaded by every mapper)
● Mapper output:
○ (cluster_id, SIFT)
● Reducer output:
○ SIFTs sorted by cluster_id
Image search with Hadoop
Indexing workload characteristics
● computationally-intensive (map phase)
● data-intensive (at map&reduce phases)
● large auxiliary data structure (i.e., index tree)
○ grows as dataset grows
○ e.g., 1.8GB for 110M images (4TB)

● map input < map output
● network is heavily utilized during shuffling
Image search with Hadoop
Image search with Hadoop
Searching workflow
● large aux.data structure (e.g., lookup table)
Index search with Hadoop: results
● Basic settings:
○ 512MB chunk size
○ 3 replicas
○ 8 map slots
○ 2 reduce slots
● 4TB dataset:
○ 4 map slots
Hadoop on heterogeneous clusters
Capacity/performance of nodes in our cluster
varied
○
○
○
○

Nodes come from three clusters
From 8 cores to 24 cores per node
From 24GB to 48GB RAM per node
Different CPU speeds

● Hadoop assumes one configuration (#mappers,
#reducers, maxim. map/reduce memory, ...) for
all nodes
● Not good for Hadoop clusters like ours
Hadoop on heterogeneous clusters
● Our solution (hack):
○ deploy Hadoop on all nodes with settings addressing the
least equipped nodes
○ create sub-cluster configuration files adjusted to better
equipped nodes
○ restart tasktrackers with new configuration files on better
equipped nodes

● We call it ‘smart deployment’
● Considerations:
○ Perhaps rack-awareness feature of Hadoop should be
complemented with smart deployment functionality
Hadoop on heterogeneous clusters
● Results

○ indexing 1T on 106 nodes: 75min → 65min
Large auxiliary data structure
● Some workloads require all mappers to load a largesize data structure
○ E.g., both in image indexing and searching workloads

● Spreading data file across all nodes:
○ Hadoop DistributedCache

● Not efficient if structure is of gigabytes-size
● Partial solution: increase HDFS block sizes →
decrease #mappers
● Another solution: multithreaded mappers provided by
Hadoop
○ Poorly documented feature!
Large auxiliary data structure
● Multithreaded mapper spans a configured number
of threads, each thread executes a map task
● Mapper threads share the RAM
● Downsides:
○ synchronization when reading input
○ synchronization when writing output
Large auxiliary data structure
● Let’s test it!

● Indexing 4T with 4 mappers slots, each running 2
threads
○ index tree size: 1.8GB
● Indexing time: 8h27min → 6h8min
Large auxiliary data structure
● In some application, mappers needs only a part of
auxiliary data structure (the one relevant to data
block processed)
● Solution: Hadoop MapFile
● See Section 5.C.2
○ Searching for 3-25k image batches
○ Though it is rather inconclusive
● Stay tuned!
○ A proper study of MapFile is now in progress
Open questions
● Practical one:
○ What are best practices for analysis of
Hadoop job execution logs?
● Analysis of Hadoop job logs happened to be very
useful in our project
○ Did with our python/perl scripts
● It is extremely useful for understanding and then
tuning Hadoop jobs on large Hadoop clusters
● Any good existing libraries/tools?
○ E.g., Starfish Hadoop Log analyzer (Duke Univ.)
Open questions
E.g., search (12k batch over 1TB) job execution on 100 nodes
Observations & implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e. 8x149=1192)
○ Solutions:
■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks
■ Re-visit search process: e.g., partial-loading of lookup
table

● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given
resources we had
Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me/Diana if interested

● Code for MapReduce-eCP algorithm available on request
○ Should run smoothly on your Hadoop cluster
○ Interested in comparisons

● Hadoop job history logs behind our experiments available
on request
○ Describe indexing/searching our dataset by giving details on
map/reduce tasks execution
○ Insights on better analysis/visualization are welcome
○ E.g., job logs supporting our CBMI'13 work: http://goo.
gl/e06wE
Acknowledgements
● Aalto University http://www.
aalto.fi
● Quaero project http://www.
quaero.org
● Grid5000 infrastructure & its
Rennes maintenance team
http://www.grid5000.fr
Supporting publications
[18] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and
searching 100M images with Map-Reduce. In Proc. ACM ICMR '13, 2013.
[20] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable highdimensional indexing with Hadoop. In Proc. CBMI'13, 2013.
[this-bigdata13]
D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg.
Terabyte-scale image similarity search: experience and best practice. In Proc.
IEEE BigData'13, 2013.
[submitted] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg.
Scalable high-dimensional indexing and searching with Hadoop.
Thank you!

More Related Content

What's hot

HDFS
HDFSHDFS
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Hadoop
Hadoop Hadoop
Hadoop
Shamama Kamal
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Deborah Akuoko
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
Cloudera, Inc.
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander K
 

What's hot (20)

HDFS
HDFSHDFS
HDFS
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 

Viewers also liked

Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Yahoo Developer Network
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
Pravin Kumar Singh, PMP, PSM
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Project
 
Mild reminder
Mild reminderMild reminder
Optimize IT Infrastructure
Optimize IT InfrastructureOptimize IT Infrastructure
Optimize IT Infrastructure
Scalar Decisions
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
StampedeCon
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
Institute of Information Systems (HES-SO)
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
Rommel Garcia
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Mahdi Esmailoghli
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
DataWorks Summit
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
David Feinleib
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
jakehofman
 
Retail Reference Architecture
Retail Reference ArchitectureRetail Reference Architecture
Retail Reference ArchitectureMongoDB
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
Sonatype
 
Big image analytics for (Re-) insurer
 Big image analytics for (Re-) insurer Big image analytics for (Re-) insurer
Big image analytics for (Re-) insurer
Flavio Trolese
 
What is big data?
What is big data?What is big data?
What is big data?
David Wellman
 

Viewers also liked (20)

Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
Mild reminder
Mild reminderMild reminder
Mild reminder
 
Optimize IT Infrastructure
Optimize IT InfrastructureOptimize IT Infrastructure
Optimize IT Infrastructure
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
 
Retail Reference Architecture
Retail Reference ArchitectureRetail Reference Architecture
Retail Reference Architecture
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
 
Big image analytics for (Re-) insurer
 Big image analytics for (Re-) insurer Big image analytics for (Re-) insurer
Big image analytics for (Re-) insurer
 
What is big data?
What is big data?What is big data?
What is big data?
 

Similar to Terabyte-scale image similarity search: experience and best practice

Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
Sufi Nawaz
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
Talentica Software
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
Jordan Open Source Association
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Rabindra Nath Nandi
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop
HadoopHadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 

Similar to Terabyte-scale image similarity search: experience and best practice (20)

Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
InternReport
InternReportInternReport
InternReport
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 

More from Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
Denis Shestakov
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Denis Shestakov
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Denis Shestakov
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
Denis Shestakov
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
Denis Shestakov
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
Denis Shestakov
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
Denis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

More from Denis Shestakov (9)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Terabyte-scale image similarity search: experience and best practice

  • 1. Terabyte-scale image similarity search: experience and best practice Diana Moise2, Denis Shestakov1,2, Gylfi Gudmundsson2, Laurent Amsaleg3 1 Department of Media Technology, School of Science, Aalto University, Finland 2 Inria Rennes – Bretagne Atlantique, France 3 IRISA - CNRS, France Denis Shestakov denis.shestakov at aalto.fi linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov
  • 3. Overview 1. Background: image retrieval, our focus, environment, etc. 2. Applying Hadoop to multimedia retrieval tasks 3. Addressing Hadoop cluster heterogeneity issue 4. Studying workloads with large auxiliary data structure required for processing 5. Experimenting with very large image dataset
  • 4. Image search? Content-based image search: ● Find matches with similar content
  • 5. Image search applications? ● regular image search ● object recognition ○ face, logo, etc. ● for systems like Google Goggles ● augmented reality applications ● medical imaging ● analysis of astrophysics data
  • 6. Our use case ● Copyright violation detection ● Our scenario: ○ Searching for batch of images ■ Querying for thousands of images in one run ■ Focus on throughput, not on response time for individual image ● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
  • 7. Image search with Hadoop ● Index & search huge image collection using MapReduce-based eCP algorithm ○ See our work at ICMR'13: Indexing and searching 100M images with MapReduce [18] ○ See Section III for quick overview ● Use the Grid5000 plartform ○ Distributed infrastructure available to French researchers & their partners ● Use the Hadoop framework
  • 8. Experimental setup: cluster ● Grid5000 platform: ○ Nodes in rennes site of Grid5000 ■ Up to 110 nodes available ■ Nodes capacity/performance varied ● Heterogenous, come from three clusters ● From 8 cores to 24 cores per node ● From 24GB to 48GB RAM per node
  • 9. Experimental setup: framework ● Standard Apache Hadoop distribution, ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Pros: easy to migrate, try and compare by others ■ Cons: not top performance ○ Tools provided by Hadoop framework ■ ■ ■ ■ Hadoop SequenceFiles DistributedCache multithreaded mappers MapFiles
  • 10. Experimental setup: dataset ● 110 mln images (~30 billion SIFT descriptors) ○ Collected from the Web and provided by one of the partners in Quaero project ■ Largest reported in literature ○ Images resized to 150px on largest side ○ Worked with ■ The whole set (~4TB) ■ The subset, 20mln images (~1TB) ○ Used as distracting dataset
  • 11. Experimental setup: querying ● For evaluation of indexing quality: ○ Added to distracting datasets: ■ INRIA Copydays (127 images) ○ Queried for ■ Copydays batch (~3000 images = 127 original images and their associated variants incl. strong distortions, e.g. print-crumple-scan ) ■ 12k batch (~12000 images = 245 random images from dataset and their variants) ■ 25k batch ○ Checked if original images returned as top voted search results
  • 12. Image search with Hadoop Distributed index creation ● Clustering images into a large set of clusters (max cluster size = 5000) ● Mapper input: ○ unsorted SIFT descriptors ○ index tree (loaded by every mapper) ● Mapper output: ○ (cluster_id, SIFT) ● Reducer output: ○ SIFTs sorted by cluster_id
  • 13. Image search with Hadoop Indexing workload characteristics ● computationally-intensive (map phase) ● data-intensive (at map&reduce phases) ● large auxiliary data structure (i.e., index tree) ○ grows as dataset grows ○ e.g., 1.8GB for 110M images (4TB) ● map input < map output ● network is heavily utilized during shuffling
  • 15. Image search with Hadoop Searching workflow ● large aux.data structure (e.g., lookup table)
  • 16. Index search with Hadoop: results ● Basic settings: ○ 512MB chunk size ○ 3 replicas ○ 8 map slots ○ 2 reduce slots ● 4TB dataset: ○ 4 map slots
  • 17. Hadoop on heterogeneous clusters Capacity/performance of nodes in our cluster varied ○ ○ ○ ○ Nodes come from three clusters From 8 cores to 24 cores per node From 24GB to 48GB RAM per node Different CPU speeds ● Hadoop assumes one configuration (#mappers, #reducers, maxim. map/reduce memory, ...) for all nodes ● Not good for Hadoop clusters like ours
  • 18. Hadoop on heterogeneous clusters ● Our solution (hack): ○ deploy Hadoop on all nodes with settings addressing the least equipped nodes ○ create sub-cluster configuration files adjusted to better equipped nodes ○ restart tasktrackers with new configuration files on better equipped nodes ● We call it ‘smart deployment’ ● Considerations: ○ Perhaps rack-awareness feature of Hadoop should be complemented with smart deployment functionality
  • 19. Hadoop on heterogeneous clusters ● Results ○ indexing 1T on 106 nodes: 75min → 65min
  • 20. Large auxiliary data structure ● Some workloads require all mappers to load a largesize data structure ○ E.g., both in image indexing and searching workloads ● Spreading data file across all nodes: ○ Hadoop DistributedCache ● Not efficient if structure is of gigabytes-size ● Partial solution: increase HDFS block sizes → decrease #mappers ● Another solution: multithreaded mappers provided by Hadoop ○ Poorly documented feature!
  • 21. Large auxiliary data structure ● Multithreaded mapper spans a configured number of threads, each thread executes a map task ● Mapper threads share the RAM ● Downsides: ○ synchronization when reading input ○ synchronization when writing output
  • 22. Large auxiliary data structure ● Let’s test it! ● Indexing 4T with 4 mappers slots, each running 2 threads ○ index tree size: 1.8GB ● Indexing time: 8h27min → 6h8min
  • 23. Large auxiliary data structure ● In some application, mappers needs only a part of auxiliary data structure (the one relevant to data block processed) ● Solution: Hadoop MapFile ● See Section 5.C.2 ○ Searching for 3-25k image batches ○ Though it is rather inconclusive ● Stay tuned! ○ A proper study of MapFile is now in progress
  • 24. Open questions ● Practical one: ○ What are best practices for analysis of Hadoop job execution logs? ● Analysis of Hadoop job logs happened to be very useful in our project ○ Did with our python/perl scripts ● It is extremely useful for understanding and then tuning Hadoop jobs on large Hadoop clusters ● Any good existing libraries/tools? ○ E.g., Starfish Hadoop Log analyzer (Duke Univ.)
  • 25. Open questions E.g., search (12k batch over 1TB) job execution on 100 nodes
  • 26. Observations & implications ● HDFS block size limits scalability ○ 1TB dataset => 1186 blocks of 1024MB size ○ Assuming 8-core nodes and reported searching method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions: ■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks ■ Re-visit search process: e.g., partial-loading of lookup table ● Big data is here but not resources to process ○ E.g, indexing&searching >10TB not possible given resources we had
  • 27. Things to share ● Our methods/system can be applied to audio datasets ○ No major changes expected ○ Contact me/Diana if interested ● Code for MapReduce-eCP algorithm available on request ○ Should run smoothly on your Hadoop cluster ○ Interested in comparisons ● Hadoop job history logs behind our experiments available on request ○ Describe indexing/searching our dataset by giving details on map/reduce tasks execution ○ Insights on better analysis/visualization are welcome ○ E.g., job logs supporting our CBMI'13 work: http://goo. gl/e06wE
  • 28. Acknowledgements ● Aalto University http://www. aalto.fi ● Quaero project http://www. quaero.org ● Grid5000 infrastructure & its Rennes maintenance team http://www.grid5000.fr
  • 29. Supporting publications [18] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ACM ICMR '13, 2013. [20] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable highdimensional indexing with Hadoop. In Proc. CBMI'13, 2013. [this-bigdata13] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013. [submitted] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing and searching with Hadoop.