Hadoop cluster configuration

2,220 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,220
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
69
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop cluster configuration

  1. 1. EWT Portal Practice Team 2013 Hadoop Cluster Configuration Table of contents 1. Introduction…………………………………………………………………………………………………………… 2 2. Prerequisites Softwares…………………………………………………………………………………………. 8 3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8 4. Hadoop………..………………………………………………………………………………………………………… 9 5. Flume…………………………………………………………………………………………………………………….. 11 6. Hive……………………………………………………………………………………………………………………….. 12 7. Hbase…………………………………………………………………………………………………………………….. 13 8. Organizations using Hadoop………………………………………………………………………………….. 14 9. References…………………………………………………………………………………………………………….. 14 1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications. [Type text] Page 1
  2. 2. EWT Portal Practice Team 2013 a. Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access – MapReduce – offline computing engine • Yahoo! is the biggest contributor • Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. b. What does it do? • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. Written in Java Does work with other languages Runs on Linux, Windows and more c. HDFS? [Type text] Page 2
  3. 3. EWT Portal Practice Team 2013 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. – – provides high throughput access to application data and is suitable for applications that have large data sets. – [Type text] highly fault-tolerant and is designed to be deployed on low-cost hardware. relaxes a few POSIX requirements to enable streaming access to file system data. Page 3
  4. 4. EWT Portal Practice Team 2013 d. MapReduce? • Programming model developed at Google • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) • It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. • The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) e. How does MapReduce work? • The run time partitions the input and provides it to different Map instances; • Map (key, value)  (key’, value’) • The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. • Each Reduce produces a single (or zero) file output. • Map and Reduce are user written functions f. Flume? [Type text] Page 4
  5. 5. EWT Portal Practice Team 2013 Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications. Flume Architecture g. Hive? [Type text] Page 5
  6. 6. EWT Portal Practice Team 2013 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. • Hive - SQL on top of Hadoop • Rich data types (structs, lists and maps) • Efficient implementations of SQL filters, joins and group-by’s on top of map reduce • Allow users to access Hive data without using Hive Hive Optimizations Efficient Execution of SQL on top of Map-Reduce h. Hbase? [Type text] Page 6
  7. 7. EWT Portal Practice Team 2013 HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. i. When Should I Use HBase? • HBase isn't suitable for every problem. • First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. • Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. • Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. • HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. j. What Is The Difference Between HBase and Hadoop/HDFS? HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. 2. Prerequisites Softwares [Type text] Page 7
  8. 8. EWT Portal Practice Team 2013 To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the Apache Download Mirrors. Note: Configuration setup on two linux servers (Server1 and Server2). 2.1 Download the prerequisites softwares from the below urls(Server 1 machine) a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/ b. Flume : http://archive.apache.org/dist/flume/stable/ c. Hive : http://download.nextag.com/apache/hive/stable/ d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 2.2 Download the Java 1.6/1.7 : http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.3 Stable versions of the Hadoop Components on August 2013 • Hadoop-1.1.2 • Flume-1.4.0 • Hbase-0.94.9 • Hive -0.10.0 3. Cluster configuration on Server1 and Server2 Create a user and password to give the admin permission(Server1 and Server2) 3.1 Task: Add a user with group to the system useradd -G {hadoop} hduser 3.2 Task : Add a password to hduser [Type text] Page 8
  9. 9. EWT Portal Practice Team 2013 Passwd hduser 3.3 Open the host file from server1 system and edit the file /etc/hosts # For example: # 102.54.94.97 rhino.acme.com master rhino.acme.com slaves Server2 system # 102.54.94.98 3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2 3.4.1 First login into server1 with user hduser and generate a pair of public keys using following command. (Note: Same steps to server2) Sshy-keygen –t rsa –P “” 3.4.2 Upload Generated Public Keys to – server1 to server2 Use SSH from server1 and upload new generated public key (id_rsa.pub) on server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note: Same steps server2 to server1). 3.4.3 Login from SERVER1 to SERVER2 Server without Password Ssh server1 Ssh server2 4. Hadoop Create a directory called Hadoop under the /home/hduser Mkdir hadoop Chmod –R 777 hadoop [Type text] Page 9
  10. 10. EWT Portal Practice Team 2013 a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory tar –xzvf hadoop-1.1.2.tar.gz Check the extracted files into the below dir /home/hduser/hadoop sudo chown -R hduser:hadoop hadoop b. Create the directory and set the required ownerships and permissions: $ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file. core-site.xml In file hadoop/conf/core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/hduser/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> [Type text] Page 10
  11. 11. EWT Portal Practice Team 2013 mapred-site.xml In file conf/mapred-site.xml: <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> hdfs-site.xml In file conf/hdfs-site.xml: <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> In file conf/hadoop-env.sh, masters and slaves hadoop-env.sh masters slaves c. Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format d. Starting your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh [Type text] Page 11
  12. 12. EWT Portal Practice Team 2013 e. Stopping your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh For Clustering, Open the Server2(Slave) System 1. Login into hduser 2. Make directory /home/hduser 3. Move the hadoop directory into Server2(Slave) system 4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop 5. Starting your multi-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh 6. Check the process should be started on both machines(master and slave) 7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 8. Ps –e | grep java 5. Flume Apache Flume Configuration 1. Extract the apache-flume-1.4.0-bin.tar into Flume directory tar –xzvf apache-flume-1.4.0-bin.tar Check the extracted files into the below dir /home/hduser/flume sudo chown -R hduser:hadoop flume [Type text] Page 12
  13. 13. EWT Portal Practice Team 2013 2. Open the flume directory and run the below command a. sudo cp conf/flume-conf.properties.template conf/flume.conf b. sudo cp conf/flume-env.sh.template conf/flume-env.sh c. Open the conf directory and check 5 files are available flume.conf 3. In file flume/conf/flume.conf overwrite the flume.conf file # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Here exec1 is source name. agent1.sources.exec1.channels = ch1 agent1.sources.exec1.type = exec agent1.sources.exec1.command = tail -F /var/log/anaconda.log #in /home/hadoop/as/ash i have kept a text file. # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. # Here HDFS is sink name. agent1.sinks.HDFS.channel = ch1 agent1.sinks.HDFS.type = hdfs agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log agent1.sinks.HDFS.hdfs.file.Type = DataStream # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. [Type text] Page 13
  14. 14. EWT Portal Practice Team 2013 agent1.channels = ch1 #source name can be of anything.(here i have chosen exec1) agent1.sources = exec1 #sinkname can be of anything.(here i have chosen HDFS) agent1.sinks = HDFS 4. Run the command bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 5. Check the file is written in HDFS 6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 7. Hadoop fs –cat /user/root/* 6. Hive Apache Hive Configuration 1. Extract the hive-0.10.0.tar into Hbase directory tar –xzvf hive-0.10.0.tar Check the extracted files into the below dir /home/hduser/hive sudo chown -R hduser:hadoop hive 2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> [Type text] Page 14
  15. 15. EWT Portal Practice Team 2013 </property> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/hive/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> </configuration> [Type text] Page 15
  16. 16. EWT Portal Practice Team 2013 7. Hbase Apache Hbase Configuration 1. Extract the hbase-0.94.9.tar into Hbase directory tar –xzvf hbase-0.94.9.tar Check the extracted files into the below dir /home/hduser/hbase sudo chown -R hduser:hadoop hbase 2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://master:9000/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/hduser/hadoop/data/zookeeper</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> [Type text] Page 16
  17. 17. EWT Portal Practice Team 2013 regionservers 3. In file hbase/conf/regionservers overwrite regionservers file Master Slaves 4. Open the hbase directory and run the below command hbase/bin start-hbase.sh 8. Example Applications and Organizations using Hadoop • A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. • Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search • AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. • Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; 9. References: 1. http://download.nextag.com/apache/hadoop/common/stable/ [Type text] Page 17
  18. 18. EWT Portal Practice Team 2013 2. http://archive.apache.org/dist/flume/stable/ 3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 4. Hive : http://download.nextag.com/apache/hive/stable/ 5. http://www.oracle.com/technetwork/java/javase/downloads/index.html 6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/ 7. http://hadoop.apache.org/docs/stable/single_node_setup.html 8. http://hadoop.apache.org/docs/stable/cluster_setup.html [Type text] Page 18

×