Talk given at Hadoop Summit Europe 2014, Amsterdam, Netherlands on 02.04.2014
Talk abstract: In this talk I focus on a specific Hadoop application, image similarity search, and present our experience on designing, building and testing a Hadoop-based image similarity search scalable to terabyte-sized image collections. I start with overviewing how to adapt image retrieval techniques to MapReduce model. Second, I describe image indexing and searching workloads and show how these workflows are rather atypical for Hadoop. E.g., I explain how to tune Hadoop to fit to such computational tasks and particularly specify the parameters and values that deliver best performance. Next I present the Hadoop cluster heterogeneity problem and describe a solution to it by proposing a platform-aware Hadoop configuration. Then I introduce the tools, provided by the standard Apache Hadoop framework, useful for a large class of application workloads similar to ours, where a large-size auxiliary data structure is required for processing the dataset. Finally, I overview a series of experiments conducted on four terabytes image dataset (biggest reported in the academic literature). The findings will be shared as best practices and recommendations to the practitioners working with huge multimedia collections.
Speaker: Dr. Denis Shestakov is an experienced researcher in the area of big data engineering and, recently, a practitioner as a Hadoop/MapReduce consultant. Denis has been involved in various big data projects in web analytics and search, multimedia search and bioinformatics. See his profile at LinkedIn: http://fi.linkedin.com/in/dshestakov/