Big data processing using hadoop poster presentation
1. Hadoop : Cloud versus Commodity Hardware
Presenter: Amrut Patil Advisor: Dr. Rajendra K. Raj
Rochester Institute of Technology
Amrut Patil
Rochester Institute of Technology
Email: axp7911@rit.edu
Contact
1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6th conference on
Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10,
Berkeley, CA, USA, 2004. USENIX Association..
2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co.
3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose
References
• Big Data is becoming more commonplace, both in scientific research
and industrial settings.
• Hadoop, a parallelized and distributed storage and processing open
source framework, is gaining increasing popularity to process vast
amount of data.
• This project investigates the use of Hadoop for Big Data processing.
• We compare the design and implementation of Hadoop
infrastructure in a cloud setting and on commodity hardware.
Overview
• Set up AWS account and get AWS authentication credentials, namely,
Access Key ID, Secret Access Key, X.509 Certificate file,
X.509 private key file, AWS account ID
• Set up command line tools to start and stop EC2 instances.
• Prepare an SSH key pair: Public key is embedded in the EC2 instance
and private key is on the local machine. Together they establish a
secure communication channel.
• Set up Hadoop on EC2 by configuring security parameters(AWS
Account ID, AWS Access Key ID and AWS Secret Access Key) in the
single initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh.
• To launch a Hadoop cluster on EC2, use:
hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves>
• To login to the master node of the cluster, use:
hadoop-ec2 login <cluster-name>
• Testing functionality of Hadoop cluster, use:
bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
• To shut down a cluster:
bin/hadoop-ec2 terminate-cluster <cluster-name>
Hadoop Background
• Verified functionality of the Hadoop cluster by installing and running
Hive, a datawarehousing package.
• Accessible: This infrastructure can be set up using commodity
hardware and in a cloud setting.
• Scalable: The cluster capacity can be easily increased by adding more
number of machines.
• Fault Tolerant: In case of failure, it automatically restarts failed jobs
• Low Cost: One can quickly and cheaply create their own cluster using
a set of machines.
Conclusions
• Hadoop employs a master/slave architecture for distributed storage
and computation.
• The distributed storage system is called the Hadoop File System
(HDFS).
Blocks of Hadoop for data processing:
• NameNode: Master of HDFS. Monitors how the files are broken
down into file blocks, nodes which store these blocks and directs
the slave datanodes to perform I/O tasks.
• DataNode: Performs the task of reading and writing files from HDFS
to local file system.
• Secondary NameNode: Takes snapshot of HDFS metadata after pre-
defined intervals of time. Useful to handle fault tolerance.
• Job Tracker: Determines which tasks to process, monitors tasks
while they are running and assigns nodes to tasks.
• Task Tracker: Manages the execution of individual tasks on each
slave node.
• Hadoop uses the MapReduce framework for easily scaling data
processing over multiple computing nodes.
Approaches for Implementing Hadoop
• On a Cloud Setting: Utilized Amazon Web Services(AWS)
namely, Amazon Elastic Cloud Computer(EC2) and Amazon Simple
Storage Service(S3).
• Using Commodity Hardware: Utilized several old PCs that were
being retired running Ubuntu 12.04 LTS.
• Choose one specific node which will host the NameNode and Job
Tracker daemons. This machine also activates the DataNode and Task
Tracker daemons on all slave nodes.
• Set up passphraseless SSH for the master to remotely access every
node in the cluster. Public key is stored locally on every node while
private key is send by the master node..
• User accounts should have the same name on all nodes.
• Generate an RSA keypair on the master node using:
ssh-keygen -t rsa
• Copy public key to every slave node as well as master node using:
scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key
• Log in to target node from the master::
ssh target
• Hadoop configuration settings are contained in three XML files:
core-site.xml, hdfs-site.xml, and mapred-site.xml.
• Hadoop can be run in three operational modes:
• Local (Standalone)Mode: Hadoop runs completely on local
machine. HDFS is not used and no Hadoop daemons are
launched.
• Psuedo-distributed mode: All daemons are running on a single
machine. Mainly used for development work.
• Fully Distributed mode: Actual Hadoop cluster runs in this mode.
• To start Hadoop Daemons: bin/start-all.sh
• To stop Hadoop Daemons: bin/stop-all.sh
Hadoop on the Cloud
Common Architecture of Hadoop Cluster
Secondary Name Node
NameNode
Job Tracker
DataNode
Task Tracker
DataNode
Task
Tracker
DataNode
Task Tracker
Only 1 Per Cluster
Only 1 Per ClusterMaster
Slave 1
. . . . .
Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with
NameNode and JobTracker as Masters and DataNode and TaskTracker
as Slaves
Slave 2 Slave N
Hadoop on Commodity Hardware