Hadoop

BIG DATA
@nuboat
Peerapat Asoktummarungsri

(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13

Who am I?

Software Engineer @ FICO

Blogger @ Thailand Java User Group

Failed Startuper

Cat Lover

เกรียนตัวน้อยๆ ในทวีตเตอร์


What is Big Data?

Big data analytics is concerned with the analysis of large volumes
of transaction/event data and behavioral analysis of human/human
a human/system interactions. (Gartner)

Big data represents the collection of technologies that handle large
data volumes well beyond that inﬂection point and for which, at
least in theory, hardware resources required to manage data by
volume track to an almost straight line rather than a curve. (IDC)


Structure & Nonstructure
Level Example

Structure RDBMS

Semi-structure XML, JSON

Quasi-structure Text Document

Unstructure Image, Video


Source: go-globe.com

Unstructure data around the world in 1min.

The challenges associated with
Big Data are

Source: oracle.com


Use The Right tool

RDBMS HADOOP
Interactive Reporting (<1s) Affordable Storage/Compute

Multistep Transaction Structure of Not (Agility)

Lots of Insert/Update/Delete Resilent Auto Scalability


ZooKeeper

Cassandra,
Hive Sqoop Pig
Mahout, ฯลฯ

Map Reduce YARN COMMON Z F

HBASE
HDFS

Flume


HDFS Hadoop Distributed File System
1 1 A 1 D
2 2 5
3
5 3
4 1 C
5 4
3
4 B 2 E
Replication 2 5
Factor = 3
3 4

HDFS Hadoop Distributed File System
1 1 A 1 D
2 2 5
3
5 3
4
5

4 B 2 E
2 5
3 4

Client
Master Slave mode only Map Reduce

Name Node HDFS Job Tracker

Data Node Data Node Data Node




Install & Conﬁg
Install JDK
Adding Hadoop System User
Conﬁg SSH & Passwordless
Disabling IPv6
Installing Hadoop, chown, .sh, conf.xml
Formatting HDFS
Start Hadoop
Hadoop Web Console
Stop Hadoop


1. Install Java
[root@localhost ~]# ./jdk-6u39-linux-i586.bin
[root@localhost ~]# chown -R root:root ./jdk1.6.0_39/
[root@localhost ~]# mv jdk1.6.0_39/ /usr/lib/jvm/
[root@localhost ~]# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_39/bin/java"
1
[root@localhost ~]# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.6.0_39/bin/
javac" 1
[root@localhost ~]# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.6.0_39/bin/
javaws" 1
[root@localhost ~]# update-alternatives --conﬁg java
There are 3 programs which provide 'java'.

Selection    Command
-----------------------------------------------
* 1           /usr/lib/jvm/jre-1.6.0-openjdk/bin/java
   2           /usr/lib/jvm/jre-1.5.0-gcj/bin/java
+ 3           /usr/lib/jvm/jdk1.6.0_39/bin/java

Enter to keep the current selection[+], or type selection number: 3

[root@localhost ~]# java -version
java version "1.6.0_39"
Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
Java HotSpot(TM) Server VM (build 20.14-b01, mixed mode)


2. Create hadoop user
[root@localhost ~]# groupadd hadoop
[root@localhost ~]# useradd -g hadoop -d /home/hdadmin hdadmin
[root@localhost ~]# passwd hdadmin

3. Conﬁg SSH
[root@localhost ~]# service sshd start
...
[root@localhost ~]# chkconﬁg sshd on
[root@localhost ~]# su - hdadmin
[root@localhost ~]# ssh-keygen -t rsa -P ""
$ ~/.ssh/id_rsa.pub >> ~/.ssh/authroized_keys
$ ssh localhost

4. Disable IPv6
[root@localhost ~]# vi /etc/sysctl.conf

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1


5. SET Path
$ vi ~/.bashrc

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# Do not set HADOOP_HOME
# User specific aliases and functions
export HIVE_HOME=/usr/local/hive-0.9.0
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39
export PATH=$PATH:/usr/local/hadoop-0.20.205.0/bin:$JAVA_HOME/bin:
$HIVE_HOME/bin


6. Install Hadoop
[root@localhost ~]# tar xzf hadoop-0.20.205.0-bin.tar.gz
[root@localhost ~]# mv hadoop-0.20.205.0 /usr/local/hadoop-0.20.205.0
[root@localhost ~]# cd /usr/local
[root@localhost ~]# chown -R hdadmin:hadoop hadoop-0.20.205.0

[root@localhost ~]# mkdir -p /data/hadoop
[root@localhost ~]# chown hdadmin:hadoop /data/hadoop
[root@localhost ~]# chmod 750 /data/hadoop


7. Config HADOOP
$ vi /usr/local/hadoop-0.20.205.0/conf/hadoop-env.sh

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39
export HADOOP_HOME_WARN_SUPPRESS="TRUE"

$ vi /usr/local/hadoop-0.20.205.0/conf/core-site.sh

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>
The name of the default file system. A URI whose scheme and authority determine the
FileSystem implementation.
The uri's scheme determine the config property(fs..SCHEME.impl) naming the FileSystem
implementation class.
The uri's authority is used to determine the host, port, etc,. for a filesystem.
</description>
</property>
</configuration>


$ vi /usr/local/hadoop-0.20.205.0/conf/hdfs-site.sh

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>
Default block replication. The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>

$ vi /usr/local/hadoop-0.20.205.0/conf/mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>
The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a
single map and reduce task.</description>
</property>
</configuration>


8. Format HDFS
$ hadoop namenode -format
13/04/03 13:59:54 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
-r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:26:14 UTC 2011
************************************************************/
13/04/03 14:00:02 INFO util.GSet: VM type       = 32-bit
13/04/03 14:00:02 INFO util.GSet: 2% max memory = 17.77875 MB
13/04/03 14:00:02 INFO util.GSet: capacity      = 2^22 = 4194304 entries
13/04/03 14:00:02 INFO util.GSet: recommended=4194304, actual=4194304
13/04/03 14:00:02 INFO namenode.FSNamesystem: fsOwner=hdadmin
13/04/03 14:00:02 INFO namenode.FSNamesystem: supergroup=supergroup
13/04/03 14:00:02 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/04/03 14:00:02 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/04/03 14:00:02 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0
min(s), accessTokenLifetime=0 min(s)
13/04/03 14:00:02 INFO namenode.NameNode: Caching ﬁle names occuring more than 10 times
13/04/03 14:00:02 INFO common.Storage: Image ﬁle of size 113 saved in 0 seconds.
13/04/03 14:00:02 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully
formatted.
13/04/03 14:00:02 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/


9. Start NameNode, DataNode, JobTracker and TaskTracker
$ cd /usr/local/hadoop/bin/
$ start-all.sh

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-datanode-
localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-
secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-jobtracker-
localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-tasktracker-

$ jps

27410 SecondaryNameNode
27643 TaskTracker
27758 Jps
27504 JobTracker
27110 NameNode
27259 DataNode


10. Browse HDFS via Browser
http://localhost:50070/

NameNode 'localhost.localdomain:54310'
Started:      Wed Apr 03 14:11:39 ICT 2013
Version:      0.20.205.0, r1179940
Compiled:      Fri Oct 7 06:26:14 UTC 2011 by hortonfo
Upgrades:      There are no upgrades in progress.

Browse the filesystem
Namenode Logs
Cluster Summary
6 files and directories, 1 blocks = 7 total. Heap Size is 58.88 MB / 888.94 MB (6%)
Configured Capacity     :     15.29 GB
DFS Used     :     28.01 KB
Non DFS Used     :     4.94 GB
DFS Remaining     :     10.36 GB
DFS Used%     :     0 %
DFS Remaining%     :     67.72 %
Live Nodes      :     1
Dead Nodes      :     0
Decommissioning Nodes      :     0
Number of Under-Replicated Blocks     :     0


Map Reduce
Map: (k1, v1) -> list(k2, v2)

Reduce: (k2, List(v2)) -> list(k3, v3)


Map Reduce


Map


Reduce


Job


Execute

$ hadoop jar ./wordcount.jar com.ﬁco.Wordcount /input/* /output/wordcount_output_dir


ของแถม


HIVE

Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in
Hadoop compatible file systems. Hive provides a mechanism to project structure
onto this data and query the data using a SQL-like language called HiveQL. At the
same time this language also allows traditional map/reduce programmers to plug
in their custom mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.


HIVE Command Ex.
hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> SHOW TABLES;

hive> DESCRIBE invites;

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

hive> DROP TABLE pokes;

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt'
OVERWRITE INTO TABLE pokes;

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';


SQOOP

With Sqoop, you can import data from a relational database system
into HDFS. The input to the import process is a database table. Sqoop
will read the table row-by-row into HDFS. The output of this import
process is a set of files containing a copy of the imported table. The
import process is performed in parallel. For this reason, the output will
be in multiple files. These files may be delimited text files (for
example, with commas or tabs separating each field), or binary Avro
or SequenceFiles containing serialized record data.


THANK YOU :)


Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Hadoop

More from Peerapat Asoktummarungsri

Hadoop