Implementing Hadoop on a single cluster


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Implementing Hadoop on a single cluster

  1. 1. Implementing Hadoop on a Single Cluster - S A L IL NAVG IR E
  2. 2. Basic Setup 1. Install Ubuntu 2. Install Java, Python and update 3. Add group ‘hadoop’ and ‘hduser’ as user (for security and backup) 4. Configure SSH a) b) Configure it by editing file ssh_config and save a backup c) Generate ssh key for hduser d) Enable ssh access to your local machine with the newly created RSA key e) 5. Install OpenSSH Server hduser@Ubuntu:~$ ssh localhost Disable IPv6 in sysctl.conf file in editor
  3. 3. Installing Hadoop 1. Download hadoop from the collection of Apache Download Mirrors • salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz 2. Make sure to change the owner to hduser in hadoop group • $ sudo chown -R hduser:hadoop hadoop (change the permissions) 3. Update $HOME/.bashrc – hadoop related environment variables
  4. 4. Configuration 1. Edit environment variables in conf/ 2. Change settings in conf/*site.xml 3. Make directory and set the required ownerships and permissions • Now we create the directory and set the required ownerships and permissions: • $ sudo mkdir -p /app/hadoop/tmp • $ sudo chown hduser:hadoop /app/hadoop/tmp • $ sudo chmod 750 /app/hadoop/tmp 4. Add configurations snippets between <configuration> ... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml
  5. 5. Starting your single node cluster • First format the namenode • hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format • Start your single node cluster
  6. 6. • Running a MapReduce job • Download data and copy from local file to hdfs • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/project.txt /user/new • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/hadoop/project.txt /user/lol
  7. 7. • hduser@ubuntu:~$ hadoop dfs -ls /user/lol Found 2 items drwxr-xr-x - hduser supergroup 0 2013-10-10 06:30 /user/lol/output -rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt • hduser@ubuntu:~$ hadoop jar /home/hduser/hadoop/hadoop-examples-1.0.3.jar wordcount /user/lol/project.txt /user/lol/output/ • Hadoop Web interfaces • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon
  8. 8. • The NameNode Web interface gives us a cluster summary about total /remaining, capacity, live and dead nodes. • Aditionally we can browse the HDFS to view contents of files and log
  9. 9. • The Jobtracker Web interface provides general job statistics about Hadoop cluster, running/complet ed/failed jobs and a job history log file • Tasktracker provides info about running and non-running tasks
  10. 10. Writing MapReduce programs • Hadoop framework is written in java, which is complicated to code for Non-CS guys • Can be written in Python and converted to .jar file using Jython to run on a Hadoop cluster • But Jython has incomplete standard library because some Python features not provided in Jython • Alternative is to use Hadoop Streaming • Hadoop streaming is the utility that comes with Hadoop distribution; able to run any executable script as a mapper and reducer
  11. 11. • Write and in python • Download and copy data to HDFS • Run same as previous java implementation • There are other third party solutions of Python Mapreduce which are similar to Streaming/Jython but can be easily used as a library in Python
  12. 12. Python implementation stratagies • Streaming • mrjob • dumbo • Hadoopy • Non-Hadoop • disco • Prefer Hadoop streaming if possible because it is easy and has the lowest overhead • Prefer mrjob where you need higher abstraction and integration with AWS
  13. 13. Future Work…. • Python implementation in Hadoop • Running Hadoop in Multi node cluster • Pig and its implementation on linux • Apache Mahout, Hive, Solr