SlideShare a Scribd company logo
Hadoop : Cloud versus Commodity Hardware
Presenter: Amrut Patil Advisor: Dr. Rajendra K. Raj
Rochester Institute of Technology
Amrut Patil
Rochester Institute of Technology
Email: axp7911@rit.edu
Contact
1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6th conference on
Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10,
Berkeley, CA, USA, 2004. USENIX Association..
2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co.
3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose
References
• Big Data is becoming more commonplace, both in scientific research
and industrial settings.
• Hadoop, a parallelized and distributed storage and processing open
source framework, is gaining increasing popularity to process vast
amount of data.
• This project investigates the use of Hadoop for Big Data processing.
• We compare the design and implementation of Hadoop
infrastructure in a cloud setting and on commodity hardware.
Overview
• Set up AWS account and get AWS authentication credentials, namely,
Access Key ID, Secret Access Key, X.509 Certificate file,
X.509 private key file, AWS account ID
• Set up command line tools to start and stop EC2 instances.
• Prepare an SSH key pair: Public key is embedded in the EC2 instance
and private key is on the local machine. Together they establish a
secure communication channel.
• Set up Hadoop on EC2 by configuring security parameters(AWS
Account ID, AWS Access Key ID and AWS Secret Access Key) in the
single initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh.
• To launch a Hadoop cluster on EC2, use:
hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves>
• To login to the master node of the cluster, use:
hadoop-ec2 login <cluster-name>
• Testing functionality of Hadoop cluster, use:
bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
• To shut down a cluster:
bin/hadoop-ec2 terminate-cluster <cluster-name>
Hadoop Background
• Verified functionality of the Hadoop cluster by installing and running
Hive, a datawarehousing package.
• Accessible: This infrastructure can be set up using commodity
hardware and in a cloud setting.
• Scalable: The cluster capacity can be easily increased by adding more
number of machines.
• Fault Tolerant: In case of failure, it automatically restarts failed jobs
• Low Cost: One can quickly and cheaply create their own cluster using
a set of machines.
Conclusions
• Hadoop employs a master/slave architecture for distributed storage
and computation.
• The distributed storage system is called the Hadoop File System
(HDFS).
 Blocks of Hadoop for data processing:
• NameNode: Master of HDFS. Monitors how the files are broken
down into file blocks, nodes which store these blocks and directs
the slave datanodes to perform I/O tasks.
• DataNode: Performs the task of reading and writing files from HDFS
to local file system.
• Secondary NameNode: Takes snapshot of HDFS metadata after pre-
defined intervals of time. Useful to handle fault tolerance.
• Job Tracker: Determines which tasks to process, monitors tasks
while they are running and assigns nodes to tasks.
• Task Tracker: Manages the execution of individual tasks on each
slave node.
• Hadoop uses the MapReduce framework for easily scaling data
processing over multiple computing nodes.
Approaches for Implementing Hadoop
• On a Cloud Setting: Utilized Amazon Web Services(AWS)
namely, Amazon Elastic Cloud Computer(EC2) and Amazon Simple
Storage Service(S3).
• Using Commodity Hardware: Utilized several old PCs that were
being retired running Ubuntu 12.04 LTS.
• Choose one specific node which will host the NameNode and Job
Tracker daemons. This machine also activates the DataNode and Task
Tracker daemons on all slave nodes.
• Set up passphraseless SSH for the master to remotely access every
node in the cluster. Public key is stored locally on every node while
private key is send by the master node..
• User accounts should have the same name on all nodes.
• Generate an RSA keypair on the master node using:
ssh-keygen -t rsa
• Copy public key to every slave node as well as master node using:
scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key
• Log in to target node from the master::
ssh target
• Hadoop configuration settings are contained in three XML files:
core-site.xml, hdfs-site.xml, and mapred-site.xml.
• Hadoop can be run in three operational modes:
• Local (Standalone)Mode: Hadoop runs completely on local
machine. HDFS is not used and no Hadoop daemons are
launched.
• Psuedo-distributed mode: All daemons are running on a single
machine. Mainly used for development work.
• Fully Distributed mode: Actual Hadoop cluster runs in this mode.
• To start Hadoop Daemons: bin/start-all.sh
• To stop Hadoop Daemons: bin/stop-all.sh
Hadoop on the Cloud
Common Architecture of Hadoop Cluster
Secondary Name Node
NameNode
Job Tracker
DataNode
Task Tracker
DataNode
Task
Tracker
DataNode
Task Tracker
Only 1 Per Cluster
Only 1 Per ClusterMaster
Slave 1
. . . . .
Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with
NameNode and JobTracker as Masters and DataNode and TaskTracker
as Slaves
Slave 2 Slave N
Hadoop on Commodity Hardware

More Related Content

What's hot

Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Managementk_tauhid
 
Oracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new featuresOracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new featuresPhilip Stoyanov
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecturehasanshan
 
Introduction to Database Management Systems
Introduction to Database Management SystemsIntroduction to Database Management Systems
Introduction to Database Management SystemsAdri Jovin
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processingnurmeen1
 

What's hot (20)

Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Big data
Big dataBig data
Big data
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Oracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new featuresOracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new features
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
Introduction to Database Management Systems
Introduction to Database Management SystemsIntroduction to Database Management Systems
Introduction to Database Management Systems
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Optical disc drive
Optical disc driveOptical disc drive
Optical disc drive
 

Viewers also liked

Big Data Meets Biomedicine: Opportunities & Challenges
Big Data Meets Biomedicine: Opportunities & ChallengesBig Data Meets Biomedicine: Opportunities & Challenges
Big Data Meets Biomedicine: Opportunities & ChallengesJen-Hsiang Chuang
 
Big Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH PerspectiveBig Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH PerspectivePhilip Bourne
 
Pc poster sessions
Pc poster sessionsPc poster sessions
Pc poster sessionsLateka Grays
 
IoT Virtualization Poster
IoT Virtualization PosterIoT Virtualization Poster
IoT Virtualization PosterMehdi TAZI
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB ResultsSymantec
 
Breaking through the Clouds
Breaking through the CloudsBreaking through the Clouds
Breaking through the CloudsAndy Piper
 
2013 Future of Cloud Computing - 3rd Annual Survey Results
2013 Future of Cloud Computing - 3rd Annual Survey Results2013 Future of Cloud Computing - 3rd Annual Survey Results
2013 Future of Cloud Computing - 3rd Annual Survey ResultsMichael Skok
 
Intro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemIntro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemReuven Lerner
 
Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Tom Raftery
 
Summer School Scale Cloud Across the Enterprise
Summer School   Scale Cloud Across the EnterpriseSummer School   Scale Cloud Across the Enterprise
Summer School Scale Cloud Across the EnterpriseWSO2
 
Simplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSimplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSun Digital, Inc.
 
Penetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesPenetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesCompTIA
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud OutageNewvewm
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud OutageNati Shalom
 
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingLinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingMark Hinkle
 

Viewers also liked (20)

Jjm cloud computing
Jjm cloud computingJjm cloud computing
Jjm cloud computing
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Apmac poster 3 collums
Apmac poster 3 collumsApmac poster 3 collums
Apmac poster 3 collums
 
Big Data Meets Biomedicine: Opportunities & Challenges
Big Data Meets Biomedicine: Opportunities & ChallengesBig Data Meets Biomedicine: Opportunities & Challenges
Big Data Meets Biomedicine: Opportunities & Challenges
 
Big Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH PerspectiveBig Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH Perspective
 
Pc poster sessions
Pc poster sessionsPc poster sessions
Pc poster sessions
 
IoT Virtualization Poster
IoT Virtualization PosterIoT Virtualization Poster
IoT Virtualization Poster
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
Eye Ring ppt
Eye Ring pptEye Ring ppt
Eye Ring ppt
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results
 
Breaking through the Clouds
Breaking through the CloudsBreaking through the Clouds
Breaking through the Clouds
 
2013 Future of Cloud Computing - 3rd Annual Survey Results
2013 Future of Cloud Computing - 3rd Annual Survey Results2013 Future of Cloud Computing - 3rd Annual Survey Results
2013 Future of Cloud Computing - 3rd Annual Survey Results
 
Intro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemIntro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, Jerusalem
 
Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?
 
Summer School Scale Cloud Across the Enterprise
Summer School   Scale Cloud Across the EnterpriseSummer School   Scale Cloud Across the Enterprise
Summer School Scale Cloud Across the Enterprise
 
Simplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSimplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBs
 
Penetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesPenetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for Businesses
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingLinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
 

Similar to Big data processing using hadoop poster presentation

Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari
 

Similar to Big data processing using hadoop poster presentation (20)

Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 

Big data processing using hadoop poster presentation

  • 1. Hadoop : Cloud versus Commodity Hardware Presenter: Amrut Patil Advisor: Dr. Rajendra K. Raj Rochester Institute of Technology Amrut Patil Rochester Institute of Technology Email: axp7911@rit.edu Contact 1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10, Berkeley, CA, USA, 2004. USENIX Association.. 2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co. 3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose References • Big Data is becoming more commonplace, both in scientific research and industrial settings. • Hadoop, a parallelized and distributed storage and processing open source framework, is gaining increasing popularity to process vast amount of data. • This project investigates the use of Hadoop for Big Data processing. • We compare the design and implementation of Hadoop infrastructure in a cloud setting and on commodity hardware. Overview • Set up AWS account and get AWS authentication credentials, namely, Access Key ID, Secret Access Key, X.509 Certificate file, X.509 private key file, AWS account ID • Set up command line tools to start and stop EC2 instances. • Prepare an SSH key pair: Public key is embedded in the EC2 instance and private key is on the local machine. Together they establish a secure communication channel. • Set up Hadoop on EC2 by configuring security parameters(AWS Account ID, AWS Access Key ID and AWS Secret Access Key) in the single initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh. • To launch a Hadoop cluster on EC2, use: hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves> • To login to the master node of the cluster, use: hadoop-ec2 login <cluster-name> • Testing functionality of Hadoop cluster, use: bin/hadoop jar hadoop-*-examples.jar pi 10 10000000 • To shut down a cluster: bin/hadoop-ec2 terminate-cluster <cluster-name> Hadoop Background • Verified functionality of the Hadoop cluster by installing and running Hive, a datawarehousing package. • Accessible: This infrastructure can be set up using commodity hardware and in a cloud setting. • Scalable: The cluster capacity can be easily increased by adding more number of machines. • Fault Tolerant: In case of failure, it automatically restarts failed jobs • Low Cost: One can quickly and cheaply create their own cluster using a set of machines. Conclusions • Hadoop employs a master/slave architecture for distributed storage and computation. • The distributed storage system is called the Hadoop File System (HDFS).  Blocks of Hadoop for data processing: • NameNode: Master of HDFS. Monitors how the files are broken down into file blocks, nodes which store these blocks and directs the slave datanodes to perform I/O tasks. • DataNode: Performs the task of reading and writing files from HDFS to local file system. • Secondary NameNode: Takes snapshot of HDFS metadata after pre- defined intervals of time. Useful to handle fault tolerance. • Job Tracker: Determines which tasks to process, monitors tasks while they are running and assigns nodes to tasks. • Task Tracker: Manages the execution of individual tasks on each slave node. • Hadoop uses the MapReduce framework for easily scaling data processing over multiple computing nodes. Approaches for Implementing Hadoop • On a Cloud Setting: Utilized Amazon Web Services(AWS) namely, Amazon Elastic Cloud Computer(EC2) and Amazon Simple Storage Service(S3). • Using Commodity Hardware: Utilized several old PCs that were being retired running Ubuntu 12.04 LTS. • Choose one specific node which will host the NameNode and Job Tracker daemons. This machine also activates the DataNode and Task Tracker daemons on all slave nodes. • Set up passphraseless SSH for the master to remotely access every node in the cluster. Public key is stored locally on every node while private key is send by the master node.. • User accounts should have the same name on all nodes. • Generate an RSA keypair on the master node using: ssh-keygen -t rsa • Copy public key to every slave node as well as master node using: scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key • Log in to target node from the master:: ssh target • Hadoop configuration settings are contained in three XML files: core-site.xml, hdfs-site.xml, and mapred-site.xml. • Hadoop can be run in three operational modes: • Local (Standalone)Mode: Hadoop runs completely on local machine. HDFS is not used and no Hadoop daemons are launched. • Psuedo-distributed mode: All daemons are running on a single machine. Mainly used for development work. • Fully Distributed mode: Actual Hadoop cluster runs in this mode. • To start Hadoop Daemons: bin/start-all.sh • To stop Hadoop Daemons: bin/stop-all.sh Hadoop on the Cloud Common Architecture of Hadoop Cluster Secondary Name Node NameNode Job Tracker DataNode Task Tracker DataNode Task Tracker DataNode Task Tracker Only 1 Per Cluster Only 1 Per ClusterMaster Slave 1 . . . . . Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with NameNode and JobTracker as Masters and DataNode and TaskTracker as Slaves Slave 2 Slave N Hadoop on Commodity Hardware