Hadoop single node-installation

  • 673 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
673
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
40
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop Single Node Cluster Installation www.kannandreams.wordpress.com www.facebook.com/groups/huge360/
  • 2. Prerequisites Operation System: Ubuntu 10.04 / 12.04 - This tutorial tested on Ubuntu 12.04. - Recommended OS for Hadoop running is Linux Java SDK ( Java 1.6 i.e Java 6 ) - Hadoop requires a working Java 1.5+ - This tutorial tested on 1.6 OpenSSH - This package required to configure the SSH for the localhost.
  • 3. Java 6 Hadoop package and codes are written on Java. In order to run hadoop, Java required. Hadoop requires java 1.5+ for its working but Java 1.6 (aka java 6 ) is recommended Oracle Java 6 is preferred for Hadoop Steps : 1. To check the java version install : $ java -version 2. To install Java 1.6 : sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java6-installer 3. To check the installation package : /usr/lib/jvm/java-6-oracle
  • 4. Adding a dedicated user We will use a dedicated Hadoop user account for running Hadoop. it is not required but recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc). 1. Add New group : $ sudo addgroup hadoop 2. Add user to the group : $ sudo adduser --ingroup hadoop hduser
  • 5. Configuring SSH To know more about OpenSSH : https://help.ubuntu.com/10.04/serverguide/openssh-server.html Hadoop uses SSH to manage and access Nodes. Configure SSH access to localhost for our single node installation in local machine. 1. To install OpenSSH : $ sudo apt-get install openssh-server 2. $ su - hduser 3. $ ssh-keygen -t rsa -P "" 4.To enable SSH access to your local machine with this newly created key $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 5.The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. $ ssh localhost
  • 6. Disable IPv6 Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster. Please refer JIRA : https://issues.apache.org/jira/browse/HADOOP-3437 https://issues.apache.org/jira/browse/HADOOP-6056 1. To disable IPv6 on Ubuntu 10.04 LTS, Open /etc/sysctl.conf through Editor 2. Add the following lines to the end of the file # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 You have to reboot your machine in order to make the changes take effect.
  • 7. Download & Install Hadoop Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group http://apache.mirrors.pair.com/hadoop/core/hadoop-1.2.1/
  • 8. Profile File Configuration 1. Environment variables export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-oracle export PATH=$PATH:$HADOOP_HOME/bin 2. Convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" Note : First We will Unalias any associated string for the command fs ( > dev/null will discard the written data ) Alias fs command as shortcut for hadoop fs command. Similarly we do for fs -ls . We will see more on hdfs commands in next tutorial.
  • 9. .lzop is a file compressor very similar to gzip. lzop favors speed over compression ratio If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOP Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less } Continuing Profile Configuration
  • 10. Configurations We are going to modify the properties of configuration file to setup single node cluster option and directory structure setup. hadoop-env.sh conf/*-site.xml core-site.xml mapred-site.xml hdfs-site.xml
  • 11. hadoop-env.sh /usr/local/hadoop/conf/hadoop-env.sh Search the below commented line and Modify as below # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-oracle
  • 12. Directory creation Configure the directory where Hadoop will store its data files, the network ports it listens to, etc. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /app/hadoop/tmp
  • 13. conf/core-site.xml
  • 14. Add the lines in between the Configuration tag
  • 15. hadoop namenode formatting Before we start the hadoop services, we need to format the Hadoop filesystem which is deployed on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. /usr/local/hadoop/bin/hadoop namenode -format Why Do we need formatting? Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. Nothing but maintaining the metadata store related to datanodes. When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data Note: Format has to be done at initial setup and not to be done later.. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
  • 16. Service Control To Start the Service : This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine /usr/local/hadoop/bin/start-all.sh To Stop the Service: To stop all the process running related to hadoop on your machine. /usr/local/hadoop/bin/stop-all.sh
  • 17. Hadoop Web Portal Namenode: http://localhost:50070/ The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files. JobTracker: http://localhost:50030/ The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on). Task Tracker: http://localhost:50060/ The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.
  • 18. Hadoop Single Node Cluster Installation www.kannandreams.wordpress.com www.facebook.com/groups/huge360/
  • 19. Prerequisites Operation System: Ubuntu 10.04 / 12.04 - This tutorial tested on Ubuntu 12.04. - Recommended OS for Hadoop running is Linux Java SDK ( Java 1.6 i.e Java 6 ) - Hadoop requires a working Java 1.5+ - This tutorial tested on 1.6 OpenSSH - This package required to configure the SSH for the localhost.
  • 20. Java 6 Hadoop package and codes are written on Java. In order to run hadoop, Java required. Hadoop requires java 1.5+ for its working but Java 1.6 (aka java 6 ) is recommended Oracle Java 6 is preferred for Hadoop Steps : 1. To check the java version install : $ java -version 2. To install Java 1.6 : sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java6-installer 3. To check the installation package : /usr/lib/jvm/java-6-oracle Note: Step 2 required when java -version doesn’t show the recommended version.
  • 21. Adding a dedicated user We will use a dedicated Hadoop user account for running Hadoop. it is not required but recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc). 1. Add New group : $ sudo addgroup hadoop 2. Add user to the group : $ sudo adduser --ingroup hadoop hduser Note: To add this hduser as Sudo, Modify the Suderos file and add as ALL option.
  • 22. Configuring SSH To know more about OpenSSH : https://help.ubuntu.com/10.04/serverguide/openssh-server.html Hadoop uses SSH to manage and access Nodes. Configure SSH access to localhost for our single node installation in local machine. 1. To install OpenSSH : $ sudo apt-get install openssh-server 2. $ su - hduser 3. $ ssh-keygen -t rsa -P "" 4.To enable SSH access to your local machine with this newly created key $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 5.The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. $ ssh localhost Step 3: create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction Output: kannan@kannandreams:~$ su - hduser Password: hduser@kannandreams:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): /home/hduser/.ssh/id_rsa already exists.
  • 23. Disable IPv6 Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster. Please refer JIRA : https://issues.apache.org/jira/browse/HADOOP-3437 https://issues.apache.org/jira/browse/HADOOP-6056 1. To disable IPv6 on Ubuntu 10.04 LTS, Open /etc/sysctl.conf through Editor 2. Add the following lines to the end of the file # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 You have to reboot your machine in order to make the changes take effect. You can check whether IPv6 is enabled on your machine with the following command: $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6 A return value of 0 means IPv6 is enabled, a value of 1 means disabled Note : You can also disable IPv6 only for Hadoop as documented in HADOOP-3437
  • 24. Download & Install Hadoop Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group http://apache.mirrors.pair.com/hadoop/core/hadoop-1.2.1/
  • 25. I have captured some mistake to understand what we doing .
  • 26. Profile File Configuration 1. Environment variables export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-oracle export PATH=$PATH:$HADOOP_HOME/bin 2. Convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" Note : First We will Unalias any associated string for the command fs ( > dev/null will discard the written data ) Alias fs command as shortcut for hadoop fs command. Similarly we do for fs -ls . We will see more on hdfs commands in next tutorial.
  • 27. .lzop is a file compressor very similar to gzip. lzop favors speed over compression ratio If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOP Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less } Continuing Profile Configuration
  • 28. Configurations We are going to modify the properties of configuration file to setup single node cluster option and directory structure setup. hadoop-env.sh conf/*-site.xml core-site.xml mapred-site.xml hdfs-site.xml
  • 29. hadoop-env.sh /usr/local/hadoop/conf/hadoop-env.sh Search the below commented line and Modify as below # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-oracle
  • 30. Directory creation Configure the directory where Hadoop will store its data files, the network ports it listens to, etc. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /app/hadoop/tmp If you forget to set the required ownerships and permissions, you will see a java.io.IOException
  • 31. conf/core-site.xml
  • 32. Add the lines in between the Configuration tag <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
  • 33. <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
  • 34. <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
  • 35. hadoop namenode formatting Before we start the hadoop services, we need to format the Hadoop filesystem which is deployed on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. /usr/local/hadoop/bin/hadoop namenode -format Why Do we need formatting? Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. Nothing but maintaining the metadata store related to datanodes. When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data Note: Format has to be done at initial setup and not to be done later.. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
  • 36. Note: Warning : $HADOOP_HOME is deprecated - Can be omitted. This Env. Variable is stopped using in latest version.
  • 37. Service Control To Start the Service : This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine /usr/local/hadoop/bin/start-all.sh To Stop the Service: To stop all the process running related to hadoop on your machine. /usr/local/hadoop/bin/stop-all.sh
  • 38. Warning : $HADOOP_HOME is deprecated please omit it. This Env. Variable is stopped refering in higher version of hadoop package.
  • 39. For those JPS command not working, please don’t be panic. This is nothing to do with hadoop installation. jps is java process monitoring utility to show the running process. Please install Java 6 Oracle and set the path . It should work.
  • 40. Hadoop Web Portal Namenode: http://localhost:50070/ The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files. JobTracker: http://localhost:50030/ The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on). Task Tracker: http://localhost:50060/ The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.