Benjamin Woottonwww.benjaminwootton.co.uk
What Is This Document?   A high level tutorial for setting up a small 3 node Hadoop    cluster on Amazons EC2 cloud;   I...
Recap - What Is Hadoop?   An open source framework for ‘reliable, scalable, distributed    computing’;   It gives you th...
Recap - What Is Amazon EC2?   A ‘cloud’ web host that allows you to dynamically add and    remove compute server resource...
Assumptions & Notes   The document assumes basic familiarity with Linux, Java,    and SSH;   The cluster will be set up ...
Part 1EC2 CONFIGURATION
1. Start EC2 Servers   Sign up to Amazon Web Services @ http://aws.amazon.com/;   Login and navigate to Amazon EC2. Usin...
2. Name EC2 Servers   For reference, name the instances Master, Slave 1, and    Slave 2 within the EC2 console once they ...
3. Prepare Keys from PEM file   We need to break down the Amazon supplied .pem file into    private and public keys in or...
4. Configure Putty   We now need to SSH into our EC2 servers from our local    machine. To do this, we will use the Putty...
5. Optional - mRemote   I use a tool called mRemote which allows you to embed Putty    instances into a tabbed browser. T...
6. Install Java & Hadoop   We need to install Java on the cluster machines in order to    run Hadoop. The OpenJDK7 will s...
7. Configure SSH Keypairs   Hadoop needs to SSH from master to the slave servers to    start and stop processes.   All o...
8. Passwordless SSH   It is better if Hadoop can move between boxes without    requiring the pass phrase to your key file...
9. Open Firewall Ports   We need to open a number of ports to allow the Hadoop    cluster to communicate and expose vario...
Part 2HADOOPCONFIGURATION
1. Decide On Cluster Layout   There are four components of Hadoop which we would like to    spread out across the cluster...
2a. Configure Server Names   Logout of all of the machines and log back into the master    server;   The hadoop configur...
2b. Configure Server Names   It should look like this, though of course using your own    allocated hostnames:   Do not ...
3a. Configure HDFS   HDFS is the distributed file system that sits behind Hadoop    instances, syncing data so that it’s ...
3b. Configure HDFS   Still on the master server, we also need to set the    dfs.replication parameter, which says how man...
4. Configure MapReduce   As well as the underlying HDFS file system, we have to set    one mandatory parameter that will ...
5a. Push Configuration To Slaves   We need to push out the little mandatory configuration that    we have done onto all o...
6a. Format HDFS   Before we can start Hadoop, we need to format and initialise    the underlying distributed file system;...
6b. Format HDFS
Part 3START HADOOP
1a. Start HDFS   We will begin by starting the HDFS file system from the    master server.   There is a script for this ...
1b. Start HDFS   At this point, monitor the log files on the master and the    slaves. You should see that HDFS cleanly s...
2. Start MapReduce   Once we’ve confirmed that the HDFS is up, is now time to    start the map reduce component of Hadoop...
Part 4EXPLORE HADOOP
2a. Web Interfaces   Now we’re up and running, Hadoop has started a number of    web interfaces that give information abo...
2b. Web Interfaces
2c. Web Interfaces
2d. Web Interfaces
Part 5YOUR FIRST MAPREDUCE JOB
1. Source Dataset   It’s now time to submit a processing job to the Hadoop    cluster.   Though we won’t go into much de...
2. Push Dataset Into HDFS   We need to push the dataset into the HDFS so that it’s    available to be shared across the n...
3a. Write A MapReduce Job   It’s now time to open our IDE and write the Java code that will    represent our MapReduce jo...
4. Package The MapReduce Job   Hadoop Jobs are typically packaged as Java JAR files. By    virtue of using Maven, we can ...
5a. Execute The MapReduce Job   We are finally ready to run the job! To do that, we’ll run the    Hadoop script and pass ...
5b. Execute The MapReduce Job   You will also be able to monitor progress of the job on    the job tracker web applicatio...
6. Confirm Results!   And the final step is to confirm our results by interogating the    HDFS:    ubuntu@ip-10-212-121-2...
Part 6WRAPPING UP!
1. What We Have Done!   Setup EC2, requested machines, configured firewalls and    passwordless SSH;   Downloaded Java a...
2. In Summary….   Hopefully you can see how this model of computation would    be useful for very large datasets that we ...
Configuring Your First Hadoop Cluster On EC2
Upcoming SlideShare
Loading in...5
×

Configuring Your First Hadoop Cluster On EC2

20,114

Published on

A step by step guide to manuaconfiguring a small

Published in: Technology
3 Comments
24 Likes
Statistics
Notes
  • Iam getting error while installing java
    Need to get 1,114 kB/106 MB of archives.
    After this operation, 246 MB of additional disk space will be used.
    Do you want to continue [Y/n]? Y
    Err http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ saucy-updates/main libnss3 amd64 2:3.15.4-0ubuntu0.13.10.1
    403 Forbidden
    Err http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ saucy-updates/main libgudev-1.0-0 amd64 1:204-0ubuntu19.1
    403 Forbidden
    Err http://security.ubuntu.com/ubuntu/ saucy-security/main libnss3 amd64 2:3.15.4-0ubuntu0.13.10.1
    404 Not Found [IP: 91.189.88.149 80]
    Err http://security.ubuntu.com/ubuntu/ saucy-security/main libnss3-1d amd64 2:3.15.4-0ubuntu0.13.10.1
    404 Not Found [IP: 91.189.88.149 80]
    Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/n/nss/libnss3_3.15.4-0ubuntu0.13.10.1_amd64.deb 404 Not Found [IP: 91.189.88.149 80]
    Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/n/nss/libnss3-1d_3.15.4-0ubuntu0.13.10.1_amd64.deb 404 Not Found [IP: 91.189.88.149 80]
    Failed to fetch http://us-west-2.ec2.archive.ubuntu.com/ubuntu/pool/main/s/systemd/libgudev-1.0-0_204-0ubuntu19.1_amd64.deb 403 Forbidden
    E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

    even while setting java_home I dont see /usr/lib/jvm folder. Please suggest
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • nice ppt...
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • good ppt. Below article could also help for setting up in EC2
    http://www.srikanthugar.in/2013/07/single-node-hadoop-setup-in-amazon-ec2_31.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
20,114
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
620
Comments
3
Likes
24
Embeds 0
No embeds

No notes for slide

Transcript of "Configuring Your First Hadoop Cluster On EC2"

  1. 1. Benjamin Woottonwww.benjaminwootton.co.uk
  2. 2. What Is This Document? A high level tutorial for setting up a small 3 node Hadoop cluster on Amazons EC2 cloud; It aims to be a complete, simple, and fast guide for people new to Hadoop. Documentation on the Internet is spotty and text dense so it can be slower to get started than it needs to be; It aims to cover the little stumbling blocks around SSH keys and a few small bug workarounds that can trip you up on the first pass through; It does not aim to cover MapReduce and writing jobs. A small job will be provided at the end to validate the cluster.
  3. 3. Recap - What Is Hadoop? An open source framework for ‘reliable, scalable, distributed computing’; It gives you the ability process and work with large datasets that are distributed across clusters of commodity hardware; It allows you to parallelize computation and ‘move processing to the data’ using the MapReduce framework.
  4. 4. Recap - What Is Amazon EC2? A ‘cloud’ web host that allows you to dynamically add and remove compute server resources as you need them, allowing you to pay for only the capacity that you need; It is well suited for Hadoop Computation – we can bring up enormous clusters within minutes and then spin it down when we’ve finished to reduce costs; EC2 is quick and cost effective for experimental and learning purposes, as well as being proven as a production Hadoop host.
  5. 5. Assumptions & Notes The document assumes basic familiarity with Linux, Java, and SSH; The cluster will be set up manually to demonstrate concepts of Hadoop. In real life, we would typically use a configuration management tool such as Puppet or Chef to manage and automate larger clusters; The configuration shown is not production ready. Real Hadoop clusters need much more bootstrap configuration and security; It assumes that you are running your cluster on Ubuntu Linux, but are accessing the cluster from a Windows host. This is possibly not a sensible assumption, but it’s what I had at the time of writing!
  6. 6. Part 1EC2 CONFIGURATION
  7. 7. 1. Start EC2 Servers Sign up to Amazon Web Services @ http://aws.amazon.com/; Login and navigate to Amazon EC2. Using the ‘classic wizard’, create three micro instances running the latest 64 bit Ubuntu Server; If you do not already have a key pair .pem file, you will need to create one during the process. We will later use this to connect to the servers and to navigate around within the cluster., so keep it in a safe place.
  8. 8. 2. Name EC2 Servers For reference, name the instances Master, Slave 1, and Slave 2 within the EC2 console once they are running; Note down the host names for each of the 3 instances in the bottom part of the management console. We will use these to access the servers:
  9. 9. 3. Prepare Keys from PEM file We need to break down the Amazon supplied .pem file into private and public keys in order to access the servers from our local machine; To do this, download PuttyGen @ http://www.chiark.greenend.org.uk/~sgtatham/putty/download. html Using PuttyGen, import your PEM file (Conversions > Import Key) and export public and private keys into a safe place.
  10. 10. 4. Configure Putty We now need to SSH into our EC2 servers from our local machine. To do this, we will use the Putty SSH client for Windows; Begin by configuring Putty sessions for each of the three servers and saving them for future convenience; Under connection > SSH > Auth in the putty tree, point towards the private key that you generated in the previous step.
  11. 11. 5. Optional - mRemote I use a tool called mRemote which allows you to embed Putty instances into a tabbed browser. Try it @ http://www.mremote.org. I recommend this as navigating around all of the hosts in your Hadoop cluster can be fiddly for larger manually managed clusters; If you do this, be sure to select the corresponding Putty Session for each mRemote connection to the private key is carried through so that you can connect:
  12. 12. 6. Install Java & Hadoop We need to install Java on the cluster machines in order to run Hadoop. The OpenJDK7 will suffice for this tutorial. Connect to all three machines using Putty or mRemote, and on each of the three machines run the following: sudo apt-get install openjdk-7-jdk When that’s complete, configure the JAVA_HOME variable by adding the following line at the top of ~/.bashrc: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ We can now download and unpack Hadoop. On each of the three machines run the following: cd ~ wget http://apache.mirror.rbftpnetworks.com/hadoop/common/hadoop- 1.0.3/hadoop-1.0.3-bin.tar.gz gzip –d hadoop-1.0.3-bin.tar.gz tar –xf hadoop-1.0.3-bin.tar
  13. 13. 7. Configure SSH Keypairs Hadoop needs to SSH from master to the slave servers to start and stop processes. All of the Amazon servers will have our generated public key installed on them in their ~ubuntu/.ssh/authorized_keys file automatically by Amazon. However, we need to put the corresponding private key in our ~ubuntu/.ssh/id_rsa file on the master server to be able to go there. To upload the file, use the file transfer software WinSCP @ http://winscp.net to push the file into your .ssh folder. Be sure to upload your OpenSSH private key generated from PuttyGen – (Conversions > Export Open SSH Key)!
  14. 14. 8. Passwordless SSH It is better if Hadoop can move between boxes without requiring the pass phrase to your key file; To do this, we can save the password into the ssh agent by using the following commands on the master server. This will avoid the need for specifying the password repeatedly when stopping and starting the cluster: ◦ ssh-agent bash ◦ ssh-add
  15. 15. 9. Open Firewall Ports We need to open a number of ports to allow the Hadoop cluster to communicate and expose various web interfaces to us. Do this by adding inbound rules to the default security group on the AWS EC2 management console. Open port 9000, 9001 and 50000-50100
  16. 16. Part 2HADOOPCONFIGURATION
  17. 17. 1. Decide On Cluster Layout There are four components of Hadoop which we would like to spread out across the cluster: ◦ Data nodes – actually store and manage data; ◦ Naming node – acts as a catalogue service, showing what data is stored where; ◦ Job tracker – tracks and manages submitted MapReduce tasks; ◦ Task tracker – low level worker that is issued jobs from job tracker. Lets go with the following setup. This is fairly typical in terms of data nodes and task trackers across the cluster, and one instance of the naming node and job tracker: Node Hostname Component Master ec2-23-22-133-70 Naming Node Job Tracker Slave 1 ec2-23-20-53-36 Data Node Task Tracker Slave 2 ec2-184-73-42-163 Data Node Task Tracker
  18. 18. 2a. Configure Server Names Logout of all of the machines and log back into the master server; The hadoop configuration will be located here on the server: cd /home/ubuntu/hadoop-1.0.3/conf Open the file ‘masters’ and replace the word ‘localhost’ with the hostname of the server that you have allocated to master: cd /home/ubuntu/hadoop-1.0.3/conf vi masters Open the file ‘slaves’ and replace the word ‘localhost’ with the 2 hostnames of the server that you have been allocated on 2 separate lines: cd /home/ubuntu/hadoop-1.0.3/conf vi slaves
  19. 19. 2b. Configure Server Names It should look like this, though of course using your own allocated hostnames: Do not use ‘localhost’ in the masters/slaves files as this can lead to non descriptive errors!
  20. 20. 3a. Configure HDFS HDFS is the distributed file system that sits behind Hadoop instances, syncing data so that it’s close to the processing and providing redundancy. We should therefore set this up first; We need to specify some mandatory parameters to get HDFS up and running in various XML configuration files; Still on the master server, the first thing we need to do is to set the name of the default file system so that it always points back at master, again using your own fully qualified hostname: /home/ubuntu/hadoop-1.0.3/conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://ec2-107-20-118-109.compute-1.amazonaws.com:9000</value> </property> </configuration>
  21. 21. 3b. Configure HDFS Still on the master server, we also need to set the dfs.replication parameter, which says how many nodes data should be replicated to for failover and redundancy purposes: /home/ubuntu/hadoop-1.0.3/conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
  22. 22. 4. Configure MapReduce As well as the underlying HDFS file system, we have to set one mandatory parameter that will be used by the Hadoop MapReduce framework; Still on the master server, we need to set the job tracker location, which Hadoop will use. As discussed earlier, we will put the job tracker on master in this instance, again being careful to substitute in your own master server host name: /home/ubuntu/hadoop-1.0.3/conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>ec2-107-22-78-136.compute-1.amazonaws.com:54311</value> </property> </configuration>
  23. 23. 5a. Push Configuration To Slaves We need to push out the little mandatory configuration that we have done onto all of the slaves. Typically, this could be mounted on a shared drive but we will do it manually this time using SCP: cd /home/ubuntu/hadoop-1.0.3/conf scp * ubuntu@ec2-23-20-53-36.compute-1.amazonaws.com:/home/ubuntu/hadoop- 1.0.3/conf scp * ubuntu@ec2-184-73-42-163.compute-1.amazonaws.com:/home/ubuntu/hadoop- 1.0.3/conf By virtue of pushing out the masters and slaves files, the various nodes in this cluster should all be correctly congfigured and referencing each other at this stage.
  24. 24. 6a. Format HDFS Before we can start Hadoop, we need to format and initialise the underlying distributed file system; To do this, on the master server, execute the following command: cd /home/ubuntu/hadoop-1.0.3/bin ./hadoop namenode -format It should only take a minute. The expected output of the format operation is on the next page; The file system will be built and formatted. It will exist in /tmp/hadoop-ubuntu if you would like to browse around. This file system will be managed by Hadoop to distribute it across nodes and access data.
  25. 25. 6b. Format HDFS
  26. 26. Part 3START HADOOP
  27. 27. 1a. Start HDFS We will begin by starting the HDFS file system from the master server. There is a script for this which will run the name node on the master and the data nodes on the slaves:  cd /home/ubuntu/hadoop-1.0.3  ./bin/start-dfs.sh
  28. 28. 1b. Start HDFS At this point, monitor the log files on the master and the slaves. You should see that HDFS cleanly starts on the slaves when we start it on the master: cd /home/ubuntu/hadoop-1.0.3 tail –f logs/hadoop-ubuntu-datanode-ip-10-245-114-186.log If anything appears to have gone wrong, double check the configuration files above are correct, firewall ports are open, and everything has been accurately pushed to all slaves.
  29. 29. 2. Start MapReduce Once we’ve confirmed that the HDFS is up, is now time to start the map reduce component of Hadoop; There is a script for this which will run the name node on the master and the data nodes on the slaves. Run the following on the master server.  cd /home/ubuntu/hadoop-1.0.3  ./bin/start-mapred.sh Again, double check the log files on all servers to check that everything is communicating cleanly before moving further.. Double check configuration and firewall ports in the case of issues.
  30. 30. Part 4EXPLORE HADOOP
  31. 31. 2a. Web Interfaces Now we’re up and running, Hadoop has started a number of web interfaces that give information about the cluster and HDFS. Take a look at these to familiarise yourself with them: NameNode master:50070 Information about the name node and the health of the distributed file system DataNode slave1:50075 TBC slave2:50075 JobTracker master:50030 Information about submitted and queued jobs TaskTracker slave1:50060 Information about tasks slave2:50060 that are submitted and queued
  32. 32. 2b. Web Interfaces
  33. 33. 2c. Web Interfaces
  34. 34. 2d. Web Interfaces
  35. 35. Part 5YOUR FIRST MAPREDUCE JOB
  36. 36. 1. Source Dataset It’s now time to submit a processing job to the Hadoop cluster. Though we won’t go into much detail here, for the exercise, I used a dataset of UK government spending which you can bring down onto your master server like so: wget http://www.dwp.gov.uk/docs/dwp-payments-april10.csv
  37. 37. 2. Push Dataset Into HDFS We need to push the dataset into the HDFS so that it’s available to be shared across the nodes for subsequent processing: /home/ubuntu/hadoop-1.0.3/bin/hadoop dfs -put dwp-payments-april10.csv dwp- payments-april10.csv After pushing the file, note how the NameNode data health page shows the file count, and space used increasing after the push.
  38. 38. 3a. Write A MapReduce Job It’s now time to open our IDE and write the Java code that will represent our MapReduce job. For the purposes of this presentation, we are just looking to validate the Hadoop cluster, so we will not go into detail with regards to MapReduce; My Java project is available at @ http://www.benjaminwootton.co.uk/wordcount.zip. As per the sample project, I recommend that you use Maven in order to easily bring in Hadoop depenendencies and build the required JAR. You can manage this part differently if you prefer.
  39. 39. 4. Package The MapReduce Job Hadoop Jobs are typically packaged as Java JAR files. By virtue of using Maven, we can get the packaged JAR simply by running a mvn clean package against the sample project. Upload the JAR onto your master server using WinSCP:
  40. 40. 5a. Execute The MapReduce Job We are finally ready to run the job! To do that, we’ll run the Hadoop script and pass in a reference to the JAR file, the name of the class containing the main method, and an input file and an output file that will be used by our Java code: cd /home/ubuntu/hadoop-1.0.3 ./bin/hadoop jar ~/firsthadoop.jar benjaminwootton.WordCount /user/ubuntu/dwp-payments-april10.csv /user/ubuntu/RESULTS If all goes well, we’ll see the job run with no errors: ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop jar ~/firsthadoop.jar benjaminwootton.WordCount /user/ubuntu/dwp-payments-april10.csv /user/ubuntu/RESULTS 12/06/03 16:06:12 INFO mapred.JobClient: Running job: job_201206031421_0004 12/06/03 16:06:13 INFO mapred.JobClient: map 0% reduce 0% 12/06/03 16:06:29 INFO mapred.JobClient: map 50% reduce 0% 12/06/03 16:06:31 INFO mapred.JobClient: map 100% reduce 0% 12/06/03 16:06:41 INFO mapred.JobClient: map 100% reduce 100% 12/06/03 16:06:50 INFO mapred.JobClient: Job complete: job_201206031421_0004
  41. 41. 5b. Execute The MapReduce Job You will also be able to monitor progress of the job on the job tracker web application that is running on the master server:
  42. 42. 6. Confirm Results! And the final step is to confirm our results by interogating the HDFS: ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop dfs -cat /user/ubuntu/RESULTS/part- 00000 Pension 2150 Corporate 5681 Employment Programmes 491 Entity 2 Housing Benefits 388 Jobcentre Plus 14774
  43. 43. Part 6WRAPPING UP!
  44. 44. 1. What We Have Done! Setup EC2, requested machines, configured firewalls and passwordless SSH; Downloaded Java and Hadoop; Configured HDFS and MapReduce and pushed configuration around the cluster; Started HDFS and MapReduce; Compiled a MapReduce job using Maven; Submitted the job, ran it succesfully, and viewed the output.
  45. 45. 2. In Summary…. Hopefully you can see how this model of computation would be useful for very large datasets that we wish to perform processing on; Hopefully you are also sold on EC2 as a distributed, fast, cost effective platform for using Hadoop for big-data work. Please get in touch with any questions, comments, or corrections!@benjaminwoottonwww.benjaminwootton.co.uk
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×