SlideShare a Scribd company logo
1 of 18
EWT Portal Practice Team 2013

Hadoop Cluster Configuration

Table of contents
1. Introduction…………………………………………………………………………………………………………… 2
2. Prerequisites Softwares…………………………………………………………………………………………. 8
3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8
4. Hadoop………..………………………………………………………………………………………………………… 9
5. Flume…………………………………………………………………………………………………………………….. 11
6. Hive……………………………………………………………………………………………………………………….. 12
7. Hbase…………………………………………………………………………………………………………………….. 13
8. Organizations using Hadoop………………………………………………………………………………….. 14
9. References…………………………………………………………………………………………………………….. 14

1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications.
[Type text]

Page 1
EWT Portal Practice Team 2013
a. Hadoop?
Software platform that lets one easily write and run applications that process vast amounts of
data. It includes:
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
– MapReduce – offline computing engine
•

Yahoo! is the biggest contributor

•

Here's what makes it especially useful:
Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available
computers (in thousands).
Efficient: By distributing the data, it can process it in parallel on the nodes where the data
is located.
Reliable: It automatically maintains multiple copies of data and automatically redeploys
computing tasks based on failures.

b. What does it do?
•

Hadoop implements Google’s MapReduce, using HDFS

•

MapReduce divides applications into many small blocks of work.

•

HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster.

•

MapReduce can then process the data where it is located.

•

Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
Written in Java
Does work with other languages
Runs on
Linux, Windows and more

c. HDFS?

[Type text]

Page 2
EWT Portal Practice Team 2013
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant.
–
–

provides high throughput access to application data and is suitable for applications that
have large data sets.

–

[Type text]

highly fault-tolerant and is designed to be deployed on low-cost hardware.

relaxes a few POSIX requirements to enable streaming access to file system data.

Page 3
EWT Portal Practice Team 2013
d. MapReduce?
•

Programming model developed at Google

•

Sort/merge based distributed computing

•

Initially, it was intended for their internal search/indexing application, but now used
extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)

•

It is functional style programming (e.g., LISP) that is naturally parallelizable across a
large cluster of workstations or PCS.

•

The underlying system takes care of the partitioning of the input data, scheduling the
program’s execution across several machines, handling machine failures, and managing
required inter-machine communication. (This is the key for Hadoop’s success)

e. How does MapReduce work?
•

The run time partitions the input and provides it to different Map instances;

•

Map (key, value)  (key’, value’)

•

The run time collects the (key’, value’) pairs and distributes them to several Reduce
functions so that each Reduce function gets the pairs with the same key’.

•

Each Reduce produces a single (or zero) file output.

•

Map and Reduce are user written functions

f. Flume?
[Type text]

Page 4
EWT Portal Practice Team 2013
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. The system is
centrally managed and allows for intelligent dynamic management. It uses a simple
extensible data model that allows for online analytic applications.
Flume Architecture

g. Hive?

[Type text]

Page 5
EWT Portal Practice Team 2013
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
•

Hive - SQL on top of Hadoop

•

Rich data types (structs, lists and maps)

•

Efficient implementations of SQL filters, joins and group-by’s on top of map reduce

•

Allow users to access Hive data without using Hive
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce

h. Hbase?

[Type text]

Page 6
EWT Portal Practice Team 2013
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the
database isn't an RDBMS which supports SQL as its primary access language, but there
are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL
database, whereas HBase is very much a distributed database. Technically speaking,
HBase is really more a "Data Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and
advanced query languages, etc.

i. When Should I Use HBase?
•

HBase isn't suitable for every problem.

•

First, make sure you have enough data. If you have hundreds of millions or billions of
rows, then HBase is a good candidate. If you only have a few thousand/million rows, then
using a traditional RDBMS might be a better choice due to the fact that all of your data
might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

•

Second, make sure you can live without all the extra features that an RDBMS provides
(e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An
application built against an RDBMS cannot be "ported" to HBase by simply changing a
JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete
redesign as opposed to a port.

•

Third, make sure you have enough hardware. Even HDFS doesn't do well with anything
less than 5 DataNodes (due to things such as HDFS block replication which has a default
of 3), plus a NameNode.

•

HBase can run quite well stand-alone on a laptop - but this should be considered a
development configuration only.

j. What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's
documentation states that it is not, however, a general purpose file system, and does not
provide fast individual record lookups in files. HBase, on the other hand, is built on top of
HDFS and provides fast record lookups (and updates) for large tables. This can
sometimes be a point of conceptual confusion. HBase internally puts your data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.

2. Prerequisites Softwares
[Type text]

Page 7
EWT Portal Practice Team 2013
To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the
Apache Download Mirrors.
Note: Configuration setup on two linux servers (Server1 and Server2).

2.1 Download the prerequisites softwares from the below urls(Server 1 machine)
a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/
b. Flume : http://archive.apache.org/dist/flume/stable/
c. Hive : http://download.nextag.com/apache/hive/stable/
d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/

2.2 Download the Java 1.6/1.7 :
http://www.oracle.com/technetwork/java/javase/downloads/index.html

2.3 Stable versions of the Hadoop Components on August 2013

•

Hadoop-1.1.2

•

Flume-1.4.0

•

Hbase-0.94.9

•

Hive -0.10.0

3. Cluster configuration on Server1 and Server2

Create a user and password to give the admin permission(Server1 and Server2)
3.1 Task: Add a user with group to the system
useradd -G {hadoop} hduser

3.2 Task : Add a password to hduser
[Type text]

Page 8
EWT Portal Practice Team 2013
Passwd hduser

3.3 Open the host file from server1 system and edit the file /etc/hosts
# For example:
#

102.54.94.97

rhino.acme.com

master

rhino.acme.com

slaves

Server2 system
#

102.54.94.98

3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2
3.4.1 First login into server1 with user hduser and generate a pair of public keys
using following command. (Note: Same steps to server2)
Sshy-keygen –t rsa –P “”

3.4.2

Upload Generated Public Keys to – server1 to server2

Use SSH from server1 and upload new generated public key (id_rsa.pub) on
server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note:
Same steps server2 to server1).

3.4.3

Login from SERVER1 to SERVER2 Server without Password

Ssh server1
Ssh server2

4. Hadoop

Create a directory called Hadoop under the /home/hduser
Mkdir hadoop
Chmod –R 777 hadoop
[Type text]

Page 9
EWT Portal Practice Team 2013

a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory
tar –xzvf hadoop-1.1.2.tar.gz
Check the extracted files into the below dir /home/hduser/hadoop
sudo chown -R hduser:hadoop hadoop

b. Create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown
hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up
security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp
Add the following snippets between the <configuration> ... </configuration> tags in the
respective configuration XML file.

core-site.xml

In file hadoop/conf/core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system. A URI whose scheme and authority determine
the FileSystem implementation. The uri's scheme determines the config property
(fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

[Type text]

Page 10
EWT Portal Practice Team 2013

mapred-site.xml

In file conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are
run in-process as a single map and reduce task. </description>
</property>

hdfs-site.xml

In file conf/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when
the file is created. The default is used if replication is not specified in create time. </description>
</property>

In file conf/hadoop-env.sh, masters and slaves

hadoop-env.sh

masters

slaves

c. Formatting the HDFS filesystem via the NameNode
hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format

d. Starting your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
[Type text]

Page 11
EWT Portal Practice Team 2013

e. Stopping your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh

For Clustering, Open the Server2(Slave) System

1. Login into hduser
2. Make directory /home/hduser
3. Move the hadoop directory into Server2(Slave) system
4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop
5. Starting your multi-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
6. Check the process should be started on both machines(master and slave)
7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
8. Ps –e | grep java

5. Flume
Apache Flume Configuration

1. Extract the apache-flume-1.4.0-bin.tar into Flume directory
tar –xzvf apache-flume-1.4.0-bin.tar
Check the extracted files into the below dir /home/hduser/flume
sudo chown -R hduser:hadoop flume
[Type text]

Page 12
EWT Portal Practice Team 2013
2. Open the flume directory and run the below command
a. sudo cp conf/flume-conf.properties.template conf/flume.conf
b. sudo cp conf/flume-env.sh.template conf/flume-env.sh
c. Open the conf directory and check 5 files are available

flume.conf

3. In file flume/conf/flume.conf overwrite the flume.conf file
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Here exec1 is source name.
agent1.sources.exec1.channels = ch1
agent1.sources.exec1.type = exec
agent1.sources.exec1.command = tail -F /var/log/anaconda.log
#in /home/hadoop/as/ash i have kept a text file.

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
# Here HDFS is sink name.
agent1.sinks.HDFS.channel = ch1
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log
agent1.sinks.HDFS.hdfs.file.Type = DataStream

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
[Type text]

Page 13
EWT Portal Practice Team 2013
agent1.channels = ch1
#source name can be of anything.(here i have chosen exec1)
agent1.sources = exec1
#sinkname can be of anything.(here i have chosen HDFS)
agent1.sinks = HDFS

4. Run the command
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

5. Check the file is written in HDFS
6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
7. Hadoop fs –cat /user/root/*

6. Hive
Apache Hive Configuration

1. Extract the hive-0.10.0.tar into Hbase directory
tar –xzvf hive-0.10.0.tar
Check the extracted files into the below dir /home/hduser/hive
sudo chown -R hduser:hadoop hive

2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
[Type text]

Page 14
EWT Portal Practice Team 2013
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hive_metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
</configuration>
[Type text]

Page 15
EWT Portal Practice Team 2013

7. Hbase
Apache Hbase Configuration

1. Extract the hbase-0.94.9.tar into Hbase directory
tar –xzvf hbase-0.94.9.tar
Check the extracted files into the below dir /home/hduser/hbase
sudo chown -R hduser:hadoop hbase

2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file

hbase-site.xml

<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/hadoop/data/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
[Type text]

Page 16
EWT Portal Practice Team 2013

regionservers

3. In file hbase/conf/regionservers overwrite regionservers file
Master
Slaves

4. Open the hbase directory and run the below command
hbase/bin start-hbase.sh

8. Example Applications and Organizations using Hadoop
•

A9.com – Amazon: To build Amazon's product search indices; process millions of
sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.

•

Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop;
biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support
research for Ad Systems and Web Search

•

AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is
50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and
800 GB hard-disk giving us a total of 37 TB HDFS capacity.

•

Facebook: To store copies of internal log and dimension data sources and use it
as a source for reporting/analytics and machine learning; 320 machine cluster
with 2,560 cores and about 1.3 PB raw storage;

9. References:

1. http://download.nextag.com/apache/hadoop/common/stable/
[Type text]

Page 17
EWT Portal Practice Team 2013
2. http://archive.apache.org/dist/flume/stable/
3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/
4. Hive : http://download.nextag.com/apache/hive/stable/
5. http://www.oracle.com/technetwork/java/javase/downloads/index.html
6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/
7. http://hadoop.apache.org/docs/stable/single_node_setup.html
8. http://hadoop.apache.org/docs/stable/cluster_setup.html

[Type text]

Page 18

More Related Content

What's hot

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
Showdown: IBM DB2 versus Oracle Database for OLTP
Showdown: IBM DB2 versus Oracle Database for OLTPShowdown: IBM DB2 versus Oracle Database for OLTP
Showdown: IBM DB2 versus Oracle Database for OLTPcomahony
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDBMongoDB
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremRahul Jain
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuningAiougVizagChapter
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Oracle RAC on Extended Distance Clusters - Presentation
Oracle RAC on Extended Distance Clusters - PresentationOracle RAC on Extended Distance Clusters - Presentation
Oracle RAC on Extended Distance Clusters - PresentationMarkus Michalewicz
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 

What's hot (20)

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Showdown: IBM DB2 versus Oracle Database for OLTP
Showdown: IBM DB2 versus Oracle Database for OLTPShowdown: IBM DB2 versus Oracle Database for OLTP
Showdown: IBM DB2 versus Oracle Database for OLTP
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuning
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Oracle RAC on Extended Distance Clusters - Presentation
Oracle RAC on Extended Distance Clusters - PresentationOracle RAC on Extended Distance Clusters - Presentation
Oracle RAC on Extended Distance Clusters - Presentation
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 

Similar to Hadoop cluster configuration

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 

Similar to Hadoop cluster configuration (20)

Hdfs design
Hdfs designHdfs design
Hdfs design
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Unit 5
Unit  5Unit  5
Unit 5
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

More from prabakaranbrick

Install and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windowsInstall and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windowsprabakaranbrick
 
Web services for remote portlets v01
Web services for remote portlets v01Web services for remote portlets v01
Web services for remote portlets v01prabakaranbrick
 
Jackrabbit setup configuration
Jackrabbit setup configurationJackrabbit setup configuration
Jackrabbit setup configurationprabakaranbrick
 
Integrating open am with liferay portal
Integrating open am with liferay portalIntegrating open am with liferay portal
Integrating open am with liferay portalprabakaranbrick
 
Installation and configure the oracle webcenter
Installation and configure the oracle webcenterInstallation and configure the oracle webcenter
Installation and configure the oracle webcenterprabakaranbrick
 

More from prabakaranbrick (9)

Install and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windowsInstall and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windows
 
Sonar
SonarSonar
Sonar
 
Web services for remote portlets v01
Web services for remote portlets v01Web services for remote portlets v01
Web services for remote portlets v01
 
Jmeter
JmeterJmeter
Jmeter
 
Nuxeo dm installation
Nuxeo dm installationNuxeo dm installation
Nuxeo dm installation
 
Gwt portlet
Gwt portletGwt portlet
Gwt portlet
 
Jackrabbit setup configuration
Jackrabbit setup configurationJackrabbit setup configuration
Jackrabbit setup configuration
 
Integrating open am with liferay portal
Integrating open am with liferay portalIntegrating open am with liferay portal
Integrating open am with liferay portal
 
Installation and configure the oracle webcenter
Installation and configure the oracle webcenterInstallation and configure the oracle webcenter
Installation and configure the oracle webcenter
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Hadoop cluster configuration

  • 1. EWT Portal Practice Team 2013 Hadoop Cluster Configuration Table of contents 1. Introduction…………………………………………………………………………………………………………… 2 2. Prerequisites Softwares…………………………………………………………………………………………. 8 3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8 4. Hadoop………..………………………………………………………………………………………………………… 9 5. Flume…………………………………………………………………………………………………………………….. 11 6. Hive……………………………………………………………………………………………………………………….. 12 7. Hbase…………………………………………………………………………………………………………………….. 13 8. Organizations using Hadoop………………………………………………………………………………….. 14 9. References…………………………………………………………………………………………………………….. 14 1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications. [Type text] Page 1
  • 2. EWT Portal Practice Team 2013 a. Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access – MapReduce – offline computing engine • Yahoo! is the biggest contributor • Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. b. What does it do? • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. Written in Java Does work with other languages Runs on Linux, Windows and more c. HDFS? [Type text] Page 2
  • 3. EWT Portal Practice Team 2013 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. – – provides high throughput access to application data and is suitable for applications that have large data sets. – [Type text] highly fault-tolerant and is designed to be deployed on low-cost hardware. relaxes a few POSIX requirements to enable streaming access to file system data. Page 3
  • 4. EWT Portal Practice Team 2013 d. MapReduce? • Programming model developed at Google • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) • It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. • The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) e. How does MapReduce work? • The run time partitions the input and provides it to different Map instances; • Map (key, value)  (key’, value’) • The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. • Each Reduce produces a single (or zero) file output. • Map and Reduce are user written functions f. Flume? [Type text] Page 4
  • 5. EWT Portal Practice Team 2013 Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications. Flume Architecture g. Hive? [Type text] Page 5
  • 6. EWT Portal Practice Team 2013 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. • Hive - SQL on top of Hadoop • Rich data types (structs, lists and maps) • Efficient implementations of SQL filters, joins and group-by’s on top of map reduce • Allow users to access Hive data without using Hive Hive Optimizations Efficient Execution of SQL on top of Map-Reduce h. Hbase? [Type text] Page 6
  • 7. EWT Portal Practice Team 2013 HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. i. When Should I Use HBase? • HBase isn't suitable for every problem. • First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. • Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. • Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. • HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. j. What Is The Difference Between HBase and Hadoop/HDFS? HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. 2. Prerequisites Softwares [Type text] Page 7
  • 8. EWT Portal Practice Team 2013 To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the Apache Download Mirrors. Note: Configuration setup on two linux servers (Server1 and Server2). 2.1 Download the prerequisites softwares from the below urls(Server 1 machine) a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/ b. Flume : http://archive.apache.org/dist/flume/stable/ c. Hive : http://download.nextag.com/apache/hive/stable/ d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 2.2 Download the Java 1.6/1.7 : http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.3 Stable versions of the Hadoop Components on August 2013 • Hadoop-1.1.2 • Flume-1.4.0 • Hbase-0.94.9 • Hive -0.10.0 3. Cluster configuration on Server1 and Server2 Create a user and password to give the admin permission(Server1 and Server2) 3.1 Task: Add a user with group to the system useradd -G {hadoop} hduser 3.2 Task : Add a password to hduser [Type text] Page 8
  • 9. EWT Portal Practice Team 2013 Passwd hduser 3.3 Open the host file from server1 system and edit the file /etc/hosts # For example: # 102.54.94.97 rhino.acme.com master rhino.acme.com slaves Server2 system # 102.54.94.98 3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2 3.4.1 First login into server1 with user hduser and generate a pair of public keys using following command. (Note: Same steps to server2) Sshy-keygen –t rsa –P “” 3.4.2 Upload Generated Public Keys to – server1 to server2 Use SSH from server1 and upload new generated public key (id_rsa.pub) on server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note: Same steps server2 to server1). 3.4.3 Login from SERVER1 to SERVER2 Server without Password Ssh server1 Ssh server2 4. Hadoop Create a directory called Hadoop under the /home/hduser Mkdir hadoop Chmod –R 777 hadoop [Type text] Page 9
  • 10. EWT Portal Practice Team 2013 a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory tar –xzvf hadoop-1.1.2.tar.gz Check the extracted files into the below dir /home/hduser/hadoop sudo chown -R hduser:hadoop hadoop b. Create the directory and set the required ownerships and permissions: $ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file. core-site.xml In file hadoop/conf/core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/hduser/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> [Type text] Page 10
  • 11. EWT Portal Practice Team 2013 mapred-site.xml In file conf/mapred-site.xml: <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> hdfs-site.xml In file conf/hdfs-site.xml: <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> In file conf/hadoop-env.sh, masters and slaves hadoop-env.sh masters slaves c. Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format d. Starting your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh [Type text] Page 11
  • 12. EWT Portal Practice Team 2013 e. Stopping your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh For Clustering, Open the Server2(Slave) System 1. Login into hduser 2. Make directory /home/hduser 3. Move the hadoop directory into Server2(Slave) system 4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop 5. Starting your multi-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh 6. Check the process should be started on both machines(master and slave) 7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 8. Ps –e | grep java 5. Flume Apache Flume Configuration 1. Extract the apache-flume-1.4.0-bin.tar into Flume directory tar –xzvf apache-flume-1.4.0-bin.tar Check the extracted files into the below dir /home/hduser/flume sudo chown -R hduser:hadoop flume [Type text] Page 12
  • 13. EWT Portal Practice Team 2013 2. Open the flume directory and run the below command a. sudo cp conf/flume-conf.properties.template conf/flume.conf b. sudo cp conf/flume-env.sh.template conf/flume-env.sh c. Open the conf directory and check 5 files are available flume.conf 3. In file flume/conf/flume.conf overwrite the flume.conf file # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Here exec1 is source name. agent1.sources.exec1.channels = ch1 agent1.sources.exec1.type = exec agent1.sources.exec1.command = tail -F /var/log/anaconda.log #in /home/hadoop/as/ash i have kept a text file. # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. # Here HDFS is sink name. agent1.sinks.HDFS.channel = ch1 agent1.sinks.HDFS.type = hdfs agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log agent1.sinks.HDFS.hdfs.file.Type = DataStream # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. [Type text] Page 13
  • 14. EWT Portal Practice Team 2013 agent1.channels = ch1 #source name can be of anything.(here i have chosen exec1) agent1.sources = exec1 #sinkname can be of anything.(here i have chosen HDFS) agent1.sinks = HDFS 4. Run the command bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 5. Check the file is written in HDFS 6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 7. Hadoop fs –cat /user/root/* 6. Hive Apache Hive Configuration 1. Extract the hive-0.10.0.tar into Hbase directory tar –xzvf hive-0.10.0.tar Check the extracted files into the below dir /home/hduser/hive sudo chown -R hduser:hadoop hive 2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> [Type text] Page 14
  • 15. EWT Portal Practice Team 2013 </property> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/hive/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> </configuration> [Type text] Page 15
  • 16. EWT Portal Practice Team 2013 7. Hbase Apache Hbase Configuration 1. Extract the hbase-0.94.9.tar into Hbase directory tar –xzvf hbase-0.94.9.tar Check the extracted files into the below dir /home/hduser/hbase sudo chown -R hduser:hadoop hbase 2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://master:9000/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/hduser/hadoop/data/zookeeper</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> [Type text] Page 16
  • 17. EWT Portal Practice Team 2013 regionservers 3. In file hbase/conf/regionservers overwrite regionservers file Master Slaves 4. Open the hbase directory and run the below command hbase/bin start-hbase.sh 8. Example Applications and Organizations using Hadoop • A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. • Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search • AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. • Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; 9. References: 1. http://download.nextag.com/apache/hadoop/common/stable/ [Type text] Page 17
  • 18. EWT Portal Practice Team 2013 2. http://archive.apache.org/dist/flume/stable/ 3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 4. Hive : http://download.nextag.com/apache/hive/stable/ 5. http://www.oracle.com/technetwork/java/javase/downloads/index.html 6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/ 7. http://hadoop.apache.org/docs/stable/single_node_setup.html 8. http://hadoop.apache.org/docs/stable/cluster_setup.html [Type text] Page 18