SlideShare a Scribd company logo
1 of 15
Apache Hadoop cluster
on Macintosh OSX
The Trigger #DIY
The Kitchen Setup
TheNetwork
Master Chef a.k.a Namenode
Helpers a.k.a Datanode(s)
The Base Ingredients
0.13.0
10.7.5
0.9.5
200
MB/s
2.4.0
1.7.0.55
5.6.17
Basics
• Ensure that all the namenode and datanode machines are running
on the same OSX version
• For the purpose of this POC, I have selected OSX 10.7.5. All sample
commands are specific to this OS. You may need to tweak the
commands to suit your OS version compatibility
• I am a homebrew fan , so I have used the old and gold ruby based
platform for downloading all software needed to run the POC. You
may very well opt for downloading the installers individually and
tweak the process if you wish
• You will need fair bit of understanding of OSX and Hadoop to
understand and interpret. If not, no worries – most of the stuff can
be looked up online by simple Google search
• The “Namenode” machine needs more RAM than “Datanode”
machines. Please configure the namenode machine with at least 8
GB RAM
The Cooking
• Ensure that ALL datanodes and namenode machines are running on the
same OSX version and preferably have regulated software update strategy
(i.e. automatic software disabled)
• Disable automatic “sleep” options in the machines to avoid machines goes
into hibernation (from System Preferences)
• Download and Install “Xcode command line tools for Lion” (skip if Xcode
present)
• As of today, hadoop is not IPv6 friendly. So, please disable IPv6 on all
machines:
 “networksetup –listallnetworkservices” command will display all the network
names that your machine uses to connect to your network (E.g: Ethernet, Wi-
Fi etc.)
 “networksetup –setv6off Ethernet” will disable IPv6 over Ethernet (you may
need to change the network name if it is any different)
The Cooking..
• Give logical names to ALL machines e.g. namenode.local ,datanode01.local
datanode02.local et al. (from System Preferences -> Sharing -> Computer
Name)
• Enable the following services from the Sharing panel of System
Preferences
– File Sharing
– Remote Login
– Remote Management
• Create one universal username (with Administrator privileges) on all
machines . E.g: hadoopuser. Preferably have the same password
• For the rest of steps , please login as this user and execute the commands
The Cooking
• On the namenode, run the command:
vi /etc/hosts
• Add all datanode hostnames , one host per line
• On each of the datanodes, run the command:
vi /etc/hosts
• Add the namenode hostname
sudo visudo
• Add an entry on the last line of the file as under:
hadoopuser ALL=(ALL) NOPASSWD: ALL
Coffee Time
• Install Java JDK and JRE on all the machines from Oracle Site
(http://bit.ly/1s2i7VC) . Configure $JAVA_HOME (see slides for
instructions)
• Set $JAVA_HOME in ALL machines. Usually, it is best to configure the same
in your .profile file. Run the following command to open your .profile
• vi ~/.profile
• #Paste the subsequent lines in the file and save it :
export JAVA_HOME="`/System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java_home`"
• You may additionally paste the following lines in the same file:
export PATH=$PATH:/usr/local/sbin
PS1="H : d t: w :"
This is helpful for housekeeping activities
The Brewing
• Install “brew” and other components from it
 Run on terminal :
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
[the quotes need to be there]
 Run following command on terminal to ensure that it has been installed properly
brew doctor
 Run following commands in the same order on terminal
brew install makedepend
brew install wget
brew install ssh-copy-id
brew install hadoop
 Run following command on the “namenode” machine
brew install hive
brew install mysql
[assumption is that namenode will host resourcemanager, jobtracker, hive metastore, hiveserver.
brew installs the software in “/usr/local/Cellar” location]
 Run the following command for setting up keyless login from namenode to ALL
datanodes. Run the command on namenode:
ssh-keygen
[press Enter key twice to accept default RSA , and no-passphrase]
 Run the following command recursively for ALL datanode hostnames. Run the command
on namenode:
ssh-copy-id hadoopuser@datanode01.local
provide the password when prompted. The command is verbose and tells if the key is
installed properly. You may validate the same by executing the command :
ssh hadoopuser@datanode01.local . It should NOT ask you to supply password anymore.
After the requisite software has been installed , the next step is to configure the different
components in a stepwise manner. Hadoop works in a distributed mode with “namenode”
being the central hub of the cluster. This gives enough reason to have the common
configuration files created on namenode first, and then copied in an automated manner
into all the datanodes. Let’s start with the .profile changes on namenode machine first.
The Saute
 We are going to configure Hive to use MySQL as the metastore for this POC. All we need
is to create a db user “hiveuser” with a valid password in the MySQL DB installed and
running on namenode AND copy the MySQL driver jar into Hive lib directory
 On the namenode , please fire the command to go to your HADOOP_CONF_DIR
location:
cd /usr/local/Cellar/hadoop/2.4.0/libexec/etc/hadoop
Here , we need to create/modify the following set of files:
slaves
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
log4j.properties
 On the namenode, please fire the command to go to your HIVE_CONF_DIR location:
cd /usr/local/Cellar/hive/0.13.0/libexec/conf
Here , we need to create/modify the following set of files:
hive-site.xml
hive-log4j.properties
The Slow cooking
 Please find attached a simple script that, if installed on the namenode, can help you
copy your config files to ALL datanodes (I call it the config-push)
 Please find attached another simple script that I use for rebooting all the datanodes.
The Plating
 You may wish to take the next steps if desired:
 Install zookeeper
 Configure and run journalnodes
 Go for High Availability cluster implementation with multiple Namenodes
 Leave feedback if you wish to know the Hadoop configuration samples
The Garnishing
Disclaimer: Don’t sue me for any damage/infringement, I am not rich 

More Related Content

What's hot

NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 

What's hot (20)

HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with Clusterdock
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 

Similar to Hadoop on osx

Lamp Server With Drupal Installation
Lamp Server With Drupal InstallationLamp Server With Drupal Installation
Lamp Server With Drupal Installation
franbow
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
Aiden Seonghak Hong
 
Setting up mongo replica set
Setting up mongo replica setSetting up mongo replica set
Setting up mongo replica set
Sudheer Kondla
 

Similar to Hadoop on osx (20)

02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Linux
LinuxLinux
Linux
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
 
Ansible presentation
Ansible presentationAnsible presentation
Ansible presentation
 
Drupal from scratch
Drupal from scratchDrupal from scratch
Drupal from scratch
 
IT Automation with Ansible
IT Automation with AnsibleIT Automation with Ansible
IT Automation with Ansible
 
Jones_Lamp_Tutorial
Jones_Lamp_TutorialJones_Lamp_Tutorial
Jones_Lamp_Tutorial
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Lamp Server With Drupal Installation
Lamp Server With Drupal InstallationLamp Server With Drupal Installation
Lamp Server With Drupal Installation
 
Lumen
LumenLumen
Lumen
 
Hadoop meet Rex(How to construct hadoop cluster with rex)
Hadoop meet Rex(How to construct hadoop cluster with rex)Hadoop meet Rex(How to construct hadoop cluster with rex)
Hadoop meet Rex(How to construct hadoop cluster with rex)
 
WP Sandbox Presentation WordCamp Toronto 2011
WP Sandbox Presentation WordCamp Toronto 2011WP Sandbox Presentation WordCamp Toronto 2011
WP Sandbox Presentation WordCamp Toronto 2011
 
Fedora Atomic Workshop handout for Fudcon Pune 2015
Fedora Atomic Workshop handout for Fudcon Pune  2015Fedora Atomic Workshop handout for Fudcon Pune  2015
Fedora Atomic Workshop handout for Fudcon Pune 2015
 
Linux basic for CADD biologist
Linux basic for CADD biologistLinux basic for CADD biologist
Linux basic for CADD biologist
 
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 
WordPress Development Environments
WordPress Development Environments WordPress Development Environments
WordPress Development Environments
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy System
 
Setting up mongo replica set
Setting up mongo replica setSetting up mongo replica set
Setting up mongo replica set
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Hadoop on osx

  • 1. Apache Hadoop cluster on Macintosh OSX
  • 3. The Kitchen Setup TheNetwork Master Chef a.k.a Namenode Helpers a.k.a Datanode(s)
  • 5. Basics • Ensure that all the namenode and datanode machines are running on the same OSX version • For the purpose of this POC, I have selected OSX 10.7.5. All sample commands are specific to this OS. You may need to tweak the commands to suit your OS version compatibility • I am a homebrew fan , so I have used the old and gold ruby based platform for downloading all software needed to run the POC. You may very well opt for downloading the installers individually and tweak the process if you wish • You will need fair bit of understanding of OSX and Hadoop to understand and interpret. If not, no worries – most of the stuff can be looked up online by simple Google search • The “Namenode” machine needs more RAM than “Datanode” machines. Please configure the namenode machine with at least 8 GB RAM
  • 6. The Cooking • Ensure that ALL datanodes and namenode machines are running on the same OSX version and preferably have regulated software update strategy (i.e. automatic software disabled) • Disable automatic “sleep” options in the machines to avoid machines goes into hibernation (from System Preferences) • Download and Install “Xcode command line tools for Lion” (skip if Xcode present) • As of today, hadoop is not IPv6 friendly. So, please disable IPv6 on all machines:  “networksetup –listallnetworkservices” command will display all the network names that your machine uses to connect to your network (E.g: Ethernet, Wi- Fi etc.)  “networksetup –setv6off Ethernet” will disable IPv6 over Ethernet (you may need to change the network name if it is any different)
  • 7. The Cooking.. • Give logical names to ALL machines e.g. namenode.local ,datanode01.local datanode02.local et al. (from System Preferences -> Sharing -> Computer Name) • Enable the following services from the Sharing panel of System Preferences – File Sharing – Remote Login – Remote Management • Create one universal username (with Administrator privileges) on all machines . E.g: hadoopuser. Preferably have the same password • For the rest of steps , please login as this user and execute the commands
  • 8. The Cooking • On the namenode, run the command: vi /etc/hosts • Add all datanode hostnames , one host per line • On each of the datanodes, run the command: vi /etc/hosts • Add the namenode hostname sudo visudo • Add an entry on the last line of the file as under: hadoopuser ALL=(ALL) NOPASSWD: ALL
  • 9. Coffee Time • Install Java JDK and JRE on all the machines from Oracle Site (http://bit.ly/1s2i7VC) . Configure $JAVA_HOME (see slides for instructions) • Set $JAVA_HOME in ALL machines. Usually, it is best to configure the same in your .profile file. Run the following command to open your .profile • vi ~/.profile • #Paste the subsequent lines in the file and save it : export JAVA_HOME="`/System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java_home`" • You may additionally paste the following lines in the same file: export PATH=$PATH:/usr/local/sbin PS1="H : d t: w :" This is helpful for housekeeping activities
  • 10. The Brewing • Install “brew” and other components from it  Run on terminal : ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)" [the quotes need to be there]  Run following command on terminal to ensure that it has been installed properly brew doctor  Run following commands in the same order on terminal brew install makedepend brew install wget brew install ssh-copy-id brew install hadoop  Run following command on the “namenode” machine brew install hive brew install mysql [assumption is that namenode will host resourcemanager, jobtracker, hive metastore, hiveserver. brew installs the software in “/usr/local/Cellar” location]
  • 11.  Run the following command for setting up keyless login from namenode to ALL datanodes. Run the command on namenode: ssh-keygen [press Enter key twice to accept default RSA , and no-passphrase]  Run the following command recursively for ALL datanode hostnames. Run the command on namenode: ssh-copy-id hadoopuser@datanode01.local provide the password when prompted. The command is verbose and tells if the key is installed properly. You may validate the same by executing the command : ssh hadoopuser@datanode01.local . It should NOT ask you to supply password anymore. After the requisite software has been installed , the next step is to configure the different components in a stepwise manner. Hadoop works in a distributed mode with “namenode” being the central hub of the cluster. This gives enough reason to have the common configuration files created on namenode first, and then copied in an automated manner into all the datanodes. Let’s start with the .profile changes on namenode machine first. The Saute
  • 12.  We are going to configure Hive to use MySQL as the metastore for this POC. All we need is to create a db user “hiveuser” with a valid password in the MySQL DB installed and running on namenode AND copy the MySQL driver jar into Hive lib directory  On the namenode , please fire the command to go to your HADOOP_CONF_DIR location: cd /usr/local/Cellar/hadoop/2.4.0/libexec/etc/hadoop Here , we need to create/modify the following set of files: slaves core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml log4j.properties  On the namenode, please fire the command to go to your HIVE_CONF_DIR location: cd /usr/local/Cellar/hive/0.13.0/libexec/conf Here , we need to create/modify the following set of files: hive-site.xml hive-log4j.properties The Slow cooking
  • 13.  Please find attached a simple script that, if installed on the namenode, can help you copy your config files to ALL datanodes (I call it the config-push)  Please find attached another simple script that I use for rebooting all the datanodes. The Plating
  • 14.  You may wish to take the next steps if desired:  Install zookeeper  Configure and run journalnodes  Go for High Availability cluster implementation with multiple Namenodes  Leave feedback if you wish to know the Hadoop configuration samples The Garnishing
  • 15. Disclaimer: Don’t sue me for any damage/infringement, I am not rich 