The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Hadoop cluster configuration
1. EWT Portal Practice Team 2013
Hadoop Cluster Configuration
Table of contents
1. Introduction…………………………………………………………………………………………………………… 2
2. Prerequisites Softwares…………………………………………………………………………………………. 8
3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8
4. Hadoop………..………………………………………………………………………………………………………… 9
5. Flume…………………………………………………………………………………………………………………….. 11
6. Hive……………………………………………………………………………………………………………………….. 12
7. Hbase…………………………………………………………………………………………………………………….. 13
8. Organizations using Hadoop………………………………………………………………………………….. 14
9. References…………………………………………………………………………………………………………….. 14
1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications.
[Type text]
Page 1
2. EWT Portal Practice Team 2013
a. Hadoop?
Software platform that lets one easily write and run applications that process vast amounts of
data. It includes:
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
– MapReduce – offline computing engine
•
Yahoo! is the biggest contributor
•
Here's what makes it especially useful:
Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available
computers (in thousands).
Efficient: By distributing the data, it can process it in parallel on the nodes where the data
is located.
Reliable: It automatically maintains multiple copies of data and automatically redeploys
computing tasks based on failures.
b. What does it do?
•
Hadoop implements Google’s MapReduce, using HDFS
•
MapReduce divides applications into many small blocks of work.
•
HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster.
•
MapReduce can then process the data where it is located.
•
Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
Written in Java
Does work with other languages
Runs on
Linux, Windows and more
c. HDFS?
[Type text]
Page 2
3. EWT Portal Practice Team 2013
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant.
–
–
provides high throughput access to application data and is suitable for applications that
have large data sets.
–
[Type text]
highly fault-tolerant and is designed to be deployed on low-cost hardware.
relaxes a few POSIX requirements to enable streaming access to file system data.
Page 3
4. EWT Portal Practice Team 2013
d. MapReduce?
•
Programming model developed at Google
•
Sort/merge based distributed computing
•
Initially, it was intended for their internal search/indexing application, but now used
extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)
•
It is functional style programming (e.g., LISP) that is naturally parallelizable across a
large cluster of workstations or PCS.
•
The underlying system takes care of the partitioning of the input data, scheduling the
program’s execution across several machines, handling machine failures, and managing
required inter-machine communication. (This is the key for Hadoop’s success)
e. How does MapReduce work?
•
The run time partitions the input and provides it to different Map instances;
•
Map (key, value) (key’, value’)
•
The run time collects the (key’, value’) pairs and distributes them to several Reduce
functions so that each Reduce function gets the pairs with the same key’.
•
Each Reduce produces a single (or zero) file output.
•
Map and Reduce are user written functions
f. Flume?
[Type text]
Page 4
5. EWT Portal Practice Team 2013
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. The system is
centrally managed and allows for intelligent dynamic management. It uses a simple
extensible data model that allows for online analytic applications.
Flume Architecture
g. Hive?
[Type text]
Page 5
6. EWT Portal Practice Team 2013
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
•
Hive - SQL on top of Hadoop
•
Rich data types (structs, lists and maps)
•
Efficient implementations of SQL filters, joins and group-by’s on top of map reduce
•
Allow users to access Hive data without using Hive
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce
h. Hbase?
[Type text]
Page 6
7. EWT Portal Practice Team 2013
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the
database isn't an RDBMS which supports SQL as its primary access language, but there
are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL
database, whereas HBase is very much a distributed database. Technically speaking,
HBase is really more a "Data Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and
advanced query languages, etc.
i. When Should I Use HBase?
•
HBase isn't suitable for every problem.
•
First, make sure you have enough data. If you have hundreds of millions or billions of
rows, then HBase is a good candidate. If you only have a few thousand/million rows, then
using a traditional RDBMS might be a better choice due to the fact that all of your data
might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
•
Second, make sure you can live without all the extra features that an RDBMS provides
(e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An
application built against an RDBMS cannot be "ported" to HBase by simply changing a
JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete
redesign as opposed to a port.
•
Third, make sure you have enough hardware. Even HDFS doesn't do well with anything
less than 5 DataNodes (due to things such as HDFS block replication which has a default
of 3), plus a NameNode.
•
HBase can run quite well stand-alone on a laptop - but this should be considered a
development configuration only.
j. What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's
documentation states that it is not, however, a general purpose file system, and does not
provide fast individual record lookups in files. HBase, on the other hand, is built on top of
HDFS and provides fast record lookups (and updates) for large tables. This can
sometimes be a point of conceptual confusion. HBase internally puts your data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.
2. Prerequisites Softwares
[Type text]
Page 7
8. EWT Portal Practice Team 2013
To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the
Apache Download Mirrors.
Note: Configuration setup on two linux servers (Server1 and Server2).
2.1 Download the prerequisites softwares from the below urls(Server 1 machine)
a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/
b. Flume : http://archive.apache.org/dist/flume/stable/
c. Hive : http://download.nextag.com/apache/hive/stable/
d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/
2.2 Download the Java 1.6/1.7 :
http://www.oracle.com/technetwork/java/javase/downloads/index.html
2.3 Stable versions of the Hadoop Components on August 2013
•
Hadoop-1.1.2
•
Flume-1.4.0
•
Hbase-0.94.9
•
Hive -0.10.0
3. Cluster configuration on Server1 and Server2
Create a user and password to give the admin permission(Server1 and Server2)
3.1 Task: Add a user with group to the system
useradd -G {hadoop} hduser
3.2 Task : Add a password to hduser
[Type text]
Page 8
9. EWT Portal Practice Team 2013
Passwd hduser
3.3 Open the host file from server1 system and edit the file /etc/hosts
# For example:
#
102.54.94.97
rhino.acme.com
master
rhino.acme.com
slaves
Server2 system
#
102.54.94.98
3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2
3.4.1 First login into server1 with user hduser and generate a pair of public keys
using following command. (Note: Same steps to server2)
Sshy-keygen –t rsa –P “”
3.4.2
Upload Generated Public Keys to – server1 to server2
Use SSH from server1 and upload new generated public key (id_rsa.pub) on
server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note:
Same steps server2 to server1).
3.4.3
Login from SERVER1 to SERVER2 Server without Password
Ssh server1
Ssh server2
4. Hadoop
Create a directory called Hadoop under the /home/hduser
Mkdir hadoop
Chmod –R 777 hadoop
[Type text]
Page 9
10. EWT Portal Practice Team 2013
a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory
tar –xzvf hadoop-1.1.2.tar.gz
Check the extracted files into the below dir /home/hduser/hadoop
sudo chown -R hduser:hadoop hadoop
b. Create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown
hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up
security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp
Add the following snippets between the <configuration> ... </configuration> tags in the
respective configuration XML file.
core-site.xml
In file hadoop/conf/core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system. A URI whose scheme and authority determine
the FileSystem implementation. The uri's scheme determines the config property
(fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
[Type text]
Page 10
11. EWT Portal Practice Team 2013
mapred-site.xml
In file conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are
run in-process as a single map and reduce task. </description>
</property>
hdfs-site.xml
In file conf/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when
the file is created. The default is used if replication is not specified in create time. </description>
</property>
In file conf/hadoop-env.sh, masters and slaves
hadoop-env.sh
masters
slaves
c. Formatting the HDFS filesystem via the NameNode
hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format
d. Starting your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
[Type text]
Page 11
12. EWT Portal Practice Team 2013
e. Stopping your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh
For Clustering, Open the Server2(Slave) System
1. Login into hduser
2. Make directory /home/hduser
3. Move the hadoop directory into Server2(Slave) system
4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop
5. Starting your multi-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
6. Check the process should be started on both machines(master and slave)
7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
8. Ps –e | grep java
5. Flume
Apache Flume Configuration
1. Extract the apache-flume-1.4.0-bin.tar into Flume directory
tar –xzvf apache-flume-1.4.0-bin.tar
Check the extracted files into the below dir /home/hduser/flume
sudo chown -R hduser:hadoop flume
[Type text]
Page 12
13. EWT Portal Practice Team 2013
2. Open the flume directory and run the below command
a. sudo cp conf/flume-conf.properties.template conf/flume.conf
b. sudo cp conf/flume-env.sh.template conf/flume-env.sh
c. Open the conf directory and check 5 files are available
flume.conf
3. In file flume/conf/flume.conf overwrite the flume.conf file
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Here exec1 is source name.
agent1.sources.exec1.channels = ch1
agent1.sources.exec1.type = exec
agent1.sources.exec1.command = tail -F /var/log/anaconda.log
#in /home/hadoop/as/ash i have kept a text file.
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
# Here HDFS is sink name.
agent1.sinks.HDFS.channel = ch1
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log
agent1.sinks.HDFS.hdfs.file.Type = DataStream
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
[Type text]
Page 13
14. EWT Portal Practice Team 2013
agent1.channels = ch1
#source name can be of anything.(here i have chosen exec1)
agent1.sources = exec1
#sinkname can be of anything.(here i have chosen HDFS)
agent1.sinks = HDFS
4. Run the command
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
5. Check the file is written in HDFS
6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
7. Hadoop fs –cat /user/root/*
6. Hive
Apache Hive Configuration
1. Extract the hive-0.10.0.tar into Hbase directory
tar –xzvf hive-0.10.0.tar
Check the extracted files into the below dir /home/hduser/hive
sudo chown -R hduser:hadoop hive
2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
[Type text]
Page 14
16. EWT Portal Practice Team 2013
7. Hbase
Apache Hbase Configuration
1. Extract the hbase-0.94.9.tar into Hbase directory
tar –xzvf hbase-0.94.9.tar
Check the extracted files into the below dir /home/hduser/hbase
sudo chown -R hduser:hadoop hbase
2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file
hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/hadoop/data/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
[Type text]
Page 16
17. EWT Portal Practice Team 2013
regionservers
3. In file hbase/conf/regionservers overwrite regionservers file
Master
Slaves
4. Open the hbase directory and run the below command
hbase/bin start-hbase.sh
8. Example Applications and Organizations using Hadoop
•
A9.com – Amazon: To build Amazon's product search indices; process millions of
sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.
•
Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop;
biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support
research for Ad Systems and Web Search
•
AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is
50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and
800 GB hard-disk giving us a total of 37 TB HDFS capacity.
•
Facebook: To store copies of internal log and dimension data sources and use it
as a source for reporting/analytics and machine learning; 320 machine cluster
with 2,560 cores and about 1.3 PB raw storage;
9. References:
1. http://download.nextag.com/apache/hadoop/common/stable/
[Type text]
Page 17