Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop on osx


Published on

Set of instructions to help you install and run an Apache Hadoop single node cluster on OSX

Published in: Technology

Hadoop on osx

  1. 1. Apache Hadoop cluster on Macintosh OSX
  2. 2. The Trigger #DIY
  3. 3. The Kitchen Setup TheNetwork Master Chef a.k.a Namenode Helpers a.k.a Datanode(s)
  4. 4. The Base Ingredients 0.13.0 10.7.5 0.9.5 200 MB/s 2.4.0 5.6.17
  5. 5. Basics • Ensure that all the namenode and datanode machines are running on the same OSX version • For the purpose of this POC, I have selected OSX 10.7.5. All sample commands are specific to this OS. You may need to tweak the commands to suit your OS version compatibility • I am a homebrew fan , so I have used the old and gold ruby based platform for downloading all software needed to run the POC. You may very well opt for downloading the installers individually and tweak the process if you wish • You will need fair bit of understanding of OSX and Hadoop to understand and interpret. If not, no worries – most of the stuff can be looked up online by simple Google search • The “Namenode” machine needs more RAM than “Datanode” machines. Please configure the namenode machine with at least 8 GB RAM
  6. 6. The Cooking • Ensure that ALL datanodes and namenode machines are running on the same OSX version and preferably have regulated software update strategy (i.e. automatic software disabled) • Disable automatic “sleep” options in the machines to avoid machines goes into hibernation (from System Preferences) • Download and Install “Xcode command line tools for Lion” (skip if Xcode present) • As of today, hadoop is not IPv6 friendly. So, please disable IPv6 on all machines:  “networksetup –listallnetworkservices” command will display all the network names that your machine uses to connect to your network (E.g: Ethernet, Wi- Fi etc.)  “networksetup –setv6off Ethernet” will disable IPv6 over Ethernet (you may need to change the network name if it is any different)
  7. 7. The Cooking.. • Give logical names to ALL machines e.g. namenode.local ,datanode01.local datanode02.local et al. (from System Preferences -> Sharing -> Computer Name) • Enable the following services from the Sharing panel of System Preferences – File Sharing – Remote Login – Remote Management • Create one universal username (with Administrator privileges) on all machines . E.g: hadoopuser. Preferably have the same password • For the rest of steps , please login as this user and execute the commands
  8. 8. The Cooking • On the namenode, run the command: vi /etc/hosts • Add all datanode hostnames , one host per line • On each of the datanodes, run the command: vi /etc/hosts • Add the namenode hostname sudo visudo • Add an entry on the last line of the file as under: hadoopuser ALL=(ALL) NOPASSWD: ALL
  9. 9. Coffee Time • Install Java JDK and JRE on all the machines from Oracle Site ( . Configure $JAVA_HOME (see slides for instructions) • Set $JAVA_HOME in ALL machines. Usually, it is best to configure the same in your .profile file. Run the following command to open your .profile • vi ~/.profile • #Paste the subsequent lines in the file and save it : export JAVA_HOME="`/System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java_home`" • You may additionally paste the following lines in the same file: export PATH=$PATH:/usr/local/sbin PS1="H : d t: w :" This is helpful for housekeeping activities
  10. 10. The Brewing • Install “brew” and other components from it  Run on terminal : ruby -e "$(curl -fsSL" [the quotes need to be there]  Run following command on terminal to ensure that it has been installed properly brew doctor  Run following commands in the same order on terminal brew install makedepend brew install wget brew install ssh-copy-id brew install hadoop  Run following command on the “namenode” machine brew install hive brew install mysql [assumption is that namenode will host resourcemanager, jobtracker, hive metastore, hiveserver. brew installs the software in “/usr/local/Cellar” location]
  11. 11.  Run the following command for setting up keyless login from namenode to ALL datanodes. Run the command on namenode: ssh-keygen [press Enter key twice to accept default RSA , and no-passphrase]  Run the following command recursively for ALL datanode hostnames. Run the command on namenode: ssh-copy-id hadoopuser@datanode01.local provide the password when prompted. The command is verbose and tells if the key is installed properly. You may validate the same by executing the command : ssh hadoopuser@datanode01.local . It should NOT ask you to supply password anymore. After the requisite software has been installed , the next step is to configure the different components in a stepwise manner. Hadoop works in a distributed mode with “namenode” being the central hub of the cluster. This gives enough reason to have the common configuration files created on namenode first, and then copied in an automated manner into all the datanodes. Let’s start with the .profile changes on namenode machine first. The Saute
  12. 12.  We are going to configure Hive to use MySQL as the metastore for this POC. All we need is to create a db user “hiveuser” with a valid password in the MySQL DB installed and running on namenode AND copy the MySQL driver jar into Hive lib directory  On the namenode , please fire the command to go to your HADOOP_CONF_DIR location: cd /usr/local/Cellar/hadoop/2.4.0/libexec/etc/hadoop Here , we need to create/modify the following set of files: slaves core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml  On the namenode, please fire the command to go to your HIVE_CONF_DIR location: cd /usr/local/Cellar/hive/0.13.0/libexec/conf Here , we need to create/modify the following set of files: hive-site.xml The Slow cooking
  13. 13.  Please find attached a simple script that, if installed on the namenode, can help you copy your config files to ALL datanodes (I call it the config-push)  Please find attached another simple script that I use for rebooting all the datanodes. The Plating
  14. 14.  You may wish to take the next steps if desired:  Install zookeeper  Configure and run journalnodes  Go for High Availability cluster implementation with multiple Namenodes  Leave feedback if you wish to know the Hadoop configuration samples The Garnishing
  15. 15. Disclaimer: Don’t sue me for any damage/infringement, I am not rich 