1. Hadoop Cluster on Docker Containers
“What Works and What Doesn't”
By:
Pranav Joshi
ME-HPC
GTU PG School
2. Content
● Introduction to Hadoop and Docker
● Why Hadoop on Docker?
● Job Configuration
● Openstack Sahara
● Handling Hadoop Single Point of Failure
● Validating the Prototype
● Performance Test
● Conclusion
● Reference
3. Introduction to Hadoop
● Apache Hadoop is an open-source software framework written in
Java for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
● Major Components of Apache Hadoop are,
– Hadoop Common: The common utilities that support the other Hadoop
modules.
– Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
– Hadoop YARN: A framework for job scheduling and cluster resource
management.
– Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
4. Introduction to Docker Container
● Docker allows you to package an application with all of
its dependencies into a standardized unit for software
development.
● It is an open-source program that enables a Linux
application and its dependencies to be packaged as a
container.
● Containers include the application and all of its
dependencies, but share the kernel with other
containers.
5. Why Docker?
● Lightweight, Portable
● Build once, Run anywhere
● VM – without the overhead of a VM
● Isolated containers
● Automated and scripted
8. Job Configuration
● YARN’s ApplicationMaster asks the NodeManager to launch
containers: LinuxContainerExecutor
● Docker can be used not only for fine-grained performance
isolation, but for delivering software packages
10. Design and Implementation
● Implementation:
– Using a Dockerfile, our solution creates an image with Java, ssh
and some basic packages installed, and set up the image to
use the Hadoop build in a shared folder with the host.
– When an instance is created from the image, it starts ssh
daemon by default in order to allow further runtime configuration
through this channel.
● Management:
– Cluster managing library offers an even more abstract API
allowing the client to list and create a cluster, start, stop and get
details of a container and starting service in a specific container.
11. Hadoop and Fault Tolerance
● HDFS allows the replication of the NameNode (through
passive replication), but a failure at the level of the Job-
Tracker forces a job to be restarted.
● On Hadoop 2.x, part of the job management responsibility
is transferred to the ApplicationMaster, which becomes a
task manager.
● The loss of the ResourceManager does not block the
execution of a job, only prevents new jobs to be submitted.
However, the loss of an ApplicationMaster forces the
restart of the job, just like on Hadoop l.x.
12. Handling Hadoop Single Point of
Failures
● Fast recovery in the case of a failure
● Small impact on the performance
● Adapt to the capacity and context of the nodes
13. Validating the Prototype
● Using the Docker-Hadoop dashboard allowed us to analyze different failure
scenarios, including:
– Crash of the Job'Tracker node: we kill the JobTracker to force a new
node to resume the JobTracker role.
– Restart of an old JobTracker: we investigate the impacts of the return of
an old JobTracker node. Two possibilities are investigated:
● The returning node was simply disconnected from the network and
still thinks it is the JobTracker.
● The returning node has restarted and has lost all its status, but is still
on the top of Zookeeper's list.
– Heartbeat tuning: a too lazy heartbeat slows-down the reaction to failures
and may lead to some of the situations in the previous item. An intensive
heartbeat may impact negatively on the overall performance.
16. Conclusion
● From this presentation we can explore the use of
container-based virtual machines to develop a prototyping
environment for MapReduce applications.
● The use of Docker-Hadoop allowed us to improve the
development speed of our Hadoop solution, as the
developers could test their code directly on their own
computers.
17. References
●
IEEE Paper 1
– Title: Efficient Prototyping of Fault Tolerant Map-Reduce Applications with
Docker-Hadoop
– Authors: Luiz Angelo Steffenel,Javier Rey, Matias Cogorno and Sergio
Nesmachnow, France
– Publication: 2015 IEEE International Conference on Cloud Engineering
●
IEEE Paper 2
– Title: Finding the Big Data Sweet Spot: Towards Automatically
Recommending Configurations for Hadoop Clusters on Docker Containers
– Authors: Rui Zhang, Min Li* and Dean Hildebrand, IBM Research and
Almaden *IBM T.J. Watson Research Center
– Publication: 2015 IEEE International Conference on Cloud Engineering