Cloudera Hadoop (CDH 4)
Installation on Ubuntu 12.04 LTS
Department of Computer Engineering
MIT College of Engineering
Kothrud, Pune 411038
● Introduction to Hadoop
● Various components of Hadoop
● Installation steps for Cloudera Hadoop
Introduction to Hadoop
● The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters
of computers using simple programming
● It is designed to scale up from single servers
to thousands of machines, each offering local
computation and storage.
● The library itself is designed to detect and
handle failures at the application layer.
The project includes these modules:
Hadoop Common: The common utilities that
support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and
cluster resource management.
Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
Ambari™: A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for
Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single
points of failure.
Chukwa™: A data collection system for managing large
HBase™: A scalable, distributed database that supports
structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data
summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining
● Pig™: A high-level data-flow language and
execution framework for parallel computation.
● Spark™: A fast and general compute engine for
Hadoop data. Spark provides a simple and
expressive programming model that supports a
wide range of applications, including ETL,
machine learning, stream processing, and graph
● Tez™: A generalized data-flow programming
framework, built on Hadoop YARN.
● ZooKeeper™: A high-performance coordination
service for distributed applications
Cloudera Hadoop Installation
● What is Cloudera Hadoop?
● What is Cloudera Manager?
● Prerequisite for installation
● Installation Steps with Screen Shot
What is Cloudera Hadoop
● CDH is the world’s most complete, tested, and
popular distribution of Apache Hadoop.
● CDH is 100% Apache-licensed open source.
● CDH bundled all Hadoop related projects at one
What is Cloudera Manager
● Cloudera Manager automates the installation
and configuration of CDH on an entire cluster.
Update your Ubuntu
Password less ssh
Password less sudo
Edit host file
Install JDBC connector for above databases.
Update Your Ubuntu Machine
● Run sudo apt-get update
● If you have any problem for update
mv lists lists.old
mkdir -p lists/partial
● Still you are facing problem contact your
Password less SSH
● Secure Shell (SSH) is a cryptographic network protocol
for secure data communication, remote command-line
login, remote command execution, and other secure
network services between two networked computers.
● Install OpenSSH
sudo apt-get install openssh-server openssh-client
and change configuration of sshd_config file /etc/ssh/ by
sudo gedit /etc/ssh/sshd_config and set
PubkeyAuthentication to YES
sudo /etc/init.d/ssh reload
Password less SSH
● Run following command for password less ssh
3 ssh-copy-id -i firstname.lastname@example.org
4 ssh email@example.com
3 and 4 command for cluster implementation with specific
hostname or user_name@ip_address from master machine
It means connect client machines from master machine.
Password less sudo
● Make Sudo password less
● Make changes in sudoers file
sudo gedit /etc/sudoers
%sudo ALL:= NOPASSWD:ALL
save that file
● For Cluster Implementation Need to change
sudoers file of each and every client machine
Edit hosts file
● In this file mention IP address and host name
for cluster implementation mention all client IP
address and Host name in Masters hosts file
and masters IP address and Host Name in
each clients hosts file
Install JDBC connector and
configure for secure installation
sudo apt-get install libmysql-java
Enter current password for root (enter for none): password
Change the root password? [Y/n] n
Remove anonymous users? [Y/n] y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] y
Reload privilege tables now? [Y/n] y
Restart mysql server
sudo service mysql restart
Mysql -u root -p and enter password
create database sttpdatabase;
create database hive;
We need separate database for following activities
● Ubuntu 10.04 (Lucid Lynx), 64-bit
● Ubuntu 12.04 (Precise Pangolin), 64-bit
● Supported Browsers
Firefox 11 or later
Internet Explorer 9
Safari 5 or later
● Cloudera Manager Server:
5 GB on the partition hosting /var.
500 MB on the partition hosting /usr
RAM - 4 GB is appropriate for most cases, and is
required when using Oracle databases
Python - Cloudera Manager uses Python.
● Installation Path
Path A: Automated Path
Path B: Your Own Method
PATH A Installation
● Step 1: Download and Run the Cloudera Manager Installer
● Download cloudera-manager-installer.bin
● Install Cloudera Manager on a single host.
● Change it to have executable permission
chmod u+x cloudera-manager-installer.bin
● Run installer bin
● after completion of installer bin set up open browser with
● Login : admin
● Password : admin