Your SlideShare is downloading. ×
Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop

910
views

Published on


1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
910
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
55
Comments
1
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. BIG DATA @nuboat Peerapat Asoktummarungsri(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 2. Who am I? Software Engineer @ FICO Blogger @ Thailand Java User Group Failed Startuper Cat Lover เกรียนตัวน้อยๆ ในทวีตเตอร์(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 3. What is Big Data? Big data analytics is concerned with the analysis of large volumes of transaction/event data and behavioral analysis of human/human a human/system interactions. (Gartner) Big data represents the collection of technologies that handle large data volumes well beyond that inflection point and for which, at least in theory, hardware resources required to manage data by volume track to an almost straight line rather than a curve. (IDC)(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 4. Structure & Nonstructure Level Example Structure RDBMS Semi-structure XML, JSON Quasi-structure Text Document Unstructure Image, Video(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 5. Source: go-globe.com Unstructure data around the world in 1min.(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 6. The challenges associated with Big Data are Source: oracle.com(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 7. Use The Right tool RDBMS HADOOP Interactive Reporting (<1s) Affordable Storage/Compute Multistep Transaction Structure of Not (Agility) Lots of Insert/Update/Delete Resilent Auto Scalability(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 8. ZooKeeper Cassandra, Hive Sqoop Pig Mahout, ฯลฯ Map Reduce YARN COMMON Z F HBASE HDFS Flume(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 9. HDFS Hadoop Distributed File System 1 1 A 1 D 2 2 5 3 5 3 4 1 C 5 4 3 4 B 2 E Replication 2 5 Factor = 3 3 4(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 10. HDFS Hadoop Distributed File System 1 1 A 1 D 2 2 5 3 5 3 4 5 4 B 2 E 2 5 3 4(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 11. Client Master Slave mode only Map Reduce Name Node HDFS Job Tracker Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 12. Install & Config Install JDK Adding Hadoop System User Config SSH & Passwordless Disabling IPv6 Installing Hadoop, chown, .sh, conf.xml Formatting HDFS Start Hadoop Hadoop Web Console Stop Hadoop(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 13. 1.  Install Java [root@localhost ~]# ./jdk-6u39-linux-i586.bin [root@localhost ~]# chown -R root:root ./jdk1.6.0_39/ [root@localhost ~]# mv jdk1.6.0_39/ /usr/lib/jvm/ [root@localhost ~]# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_39/bin/java" 1 [root@localhost ~]# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.6.0_39/bin/ javac" 1 [root@localhost ~]# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.6.0_39/bin/ javaws" 1 [root@localhost ~]# update-alternatives --config java There are 3 programs which provide java.   Selection    Command ----------------------------------------------- *  1           /usr/lib/jvm/jre-1.6.0-openjdk/bin/java    2           /usr/lib/jvm/jre-1.5.0-gcj/bin/java + 3           /usr/lib/jvm/jdk1.6.0_39/bin/java Enter to keep the current selection[+], or type selection number: 3 [root@localhost ~]# java -version java version "1.6.0_39" Java(TM) SE Runtime Environment (build 1.6.0_39-b04) Java HotSpot(TM) Server VM (build 20.14-b01, mixed mode)(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 14. 2.  Create hadoop user [root@localhost ~]# groupadd hadoop [root@localhost ~]# useradd -g hadoop -d /home/hdadmin hdadmin [root@localhost ~]# passwd hdadmin 3.  Config SSH [root@localhost ~]# service sshd start ... [root@localhost ~]# chkconfig sshd on [root@localhost ~]# su - hdadmin [root@localhost ~]# ssh-keygen -t rsa -P "" $ ~/.ssh/id_rsa.pub >> ~/.ssh/authroized_keys $ ssh localhost 4. Disable IPv6 [root@localhost ~]# vi /etc/sysctl.conf # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 15. 5. SET Path $ vi ~/.bashrc # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then         . /etc/bashrc fi # Do not set HADOOP_HOME # User specific aliases and functions export HIVE_HOME=/usr/local/hive-0.9.0 export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39 export PATH=$PATH:/usr/local/hadoop-0.20.205.0/bin:$JAVA_HOME/bin: $HIVE_HOME/bin(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 16. 6. Install Hadoop [root@localhost ~]# tar xzf hadoop-0.20.205.0-bin.tar.gz [root@localhost ~]# mv hadoop-0.20.205.0 /usr/local/hadoop-0.20.205.0 [root@localhost ~]# cd /usr/local [root@localhost ~]# chown -R hdadmin:hadoop hadoop-0.20.205.0 [root@localhost ~]# mkdir -p /data/hadoop [root@localhost ~]# chown hdadmin:hadoop /data/hadoop [root@localhost ~]# chmod 750 /data/hadoop(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 17. 7. Config HADOOP $ vi /usr/local/hadoop-0.20.205.0/conf/hadoop-env.sh # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39 export HADOOP_HOME_WARN_SUPPRESS="TRUE" $ vi /usr/local/hadoop-0.20.205.0/conf/core-site.sh <configuration> <property>   <name>hadoop.tmp.dir</name>   <value>/data/hadoop</value>   <description>A base for other temporary directories.</description> </property> <property>   <name>fs.default.name</name>   <value>hdfs://localhost:54310</value>   <description> The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uris scheme determine the config property(fs..SCHEME.impl) naming the FileSystem implementation class. The uris authority is used to determine the host, port, etc,. for a filesystem. </description> </property> </configuration>(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 18. $ vi /usr/local/hadoop-0.20.205.0/conf/hdfs-site.sh <configuration> <property>   <name>dfs.replication</name>   <value>1</value>   <description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.   </description> </property> </configuration> $ vi /usr/local/hadoop-0.20.205.0/conf/mapred-site.xml <configuration> <property>   <name>mapred.job.tracker</name>   <value>localhost:54311</value>   <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description> </property> </configuration>(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 19. 8. Format HDFS$ hadoop namenode -format13/04/03 13:59:54 INFO namenode.NameNode: STARTUP_MSG: /************************************************************STARTUP_MSG: Starting NameNodeSTARTUP_MSG:   host = localhost.localdomain/127.0.0.1STARTUP_MSG:   args = [-format]STARTUP_MSG:   version = 0.20.205.0STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205-r 1179940; compiled by hortonfo on Fri Oct  7 06:26:14 UTC 2011************************************************************/13/04/03 14:00:02 INFO util.GSet: VM type       = 32-bit13/04/03 14:00:02 INFO util.GSet: 2% max memory = 17.77875 MB13/04/03 14:00:02 INFO util.GSet: capacity      = 2^22 = 4194304 entries13/04/03 14:00:02 INFO util.GSet: recommended=4194304, actual=419430413/04/03 14:00:02 INFO namenode.FSNamesystem: fsOwner=hdadmin13/04/03 14:00:02 INFO namenode.FSNamesystem: supergroup=supergroup13/04/03 14:00:02 INFO namenode.FSNamesystem: isPermissionEnabled=true13/04/03 14:00:02 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=10013/04/03 14:00:02 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0min(s), accessTokenLifetime=0 min(s)13/04/03 14:00:02 INFO namenode.NameNode: Caching file names occuring more than 10 times 13/04/03 14:00:02 INFO common.Storage: Image file of size 113 saved in 0 seconds.13/04/03 14:00:02 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfullyformatted.13/04/03 14:00:02 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1************************************************************/(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 20. 9. Start NameNode, DataNode, JobTracker and TaskTracker $ cd /usr/local/hadoop/bin/ $ start-all.sh starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-namenode- localhost.localdomain.out localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-datanode- localhost.localdomain.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin- secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-jobtracker- localhost.localdomain.out localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-tasktracker- localhost.localdomain.out $ jps 27410 SecondaryNameNode 27643 TaskTracker 27758 Jps 27504 JobTracker 27110 NameNode 27259 DataNode(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 21. 10. Browse HDFS via Browser http://localhost:50070/ NameNode localhost.localdomain:54310 Started:      Wed Apr 03 14:11:39 ICT 2013 Version:      0.20.205.0, r1179940 Compiled:      Fri Oct 7 06:26:14 UTC 2011 by hortonfo Upgrades:      There are no upgrades in progress. Browse the filesystem Namenode Logs Cluster Summary 6 files and directories, 1 blocks = 7 total. Heap Size is 58.88 MB / 888.94 MB (6%) Configured Capacity     :     15.29 GB DFS Used     :     28.01 KB Non DFS Used     :     4.94 GB DFS Remaining     :     10.36 GB DFS Used%     :     0 % DFS Remaining%     :     67.72 % Live Nodes      :     1 Dead Nodes      :     0 Decommissioning Nodes      :     0 Number of Under-Replicated Blocks     :     0(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 22. Map Reduce Map: (k1, v1) -> list(k2, v2) Reduce: (k2, List(v2)) -> list(k3, v3)(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 23. Map Reduce(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 24. Map(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 25. Reduce(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 26. Job(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 27. Execute $ hadoop jar ./wordcount.jar com.fico.Wordcount /input/* /output/wordcount_output_dir(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
  • 28. ของแถมThursday, April 11, 13
  • 29. HIVEHive is a data warehouse system for Hadoop that facilitates easy datasummarization, ad-hoc queries, and the analysis of large datasets stored inHadoop compatible file systems. Hive provides a mechanism to project structureonto this data and query the data using a SQL-like language called HiveQL. At thesame time this language also allows traditional map/reduce programmers to plugin their custom mappers and reducers when it is inconvenient or inefficient toexpress this logic in HiveQL.Thursday, April 11, 13
  • 30. HIVE Command Ex. hive> CREATE TABLE pokes (foo INT, bar STRING); hive> SHOW TABLES; hive> DESCRIBE invites; hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> DROP TABLE pokes; hive> LOAD DATA LOCAL INPATH ./examples/files/kv1.txt OVERWRITE INTO TABLE pokes; hive> SELECT a.foo FROM invites a WHERE a.ds=2008-08-15;Thursday, April 11, 13
  • 31. SQOOP With Sqoop, you can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.Thursday, April 11, 13
  • 32. THANK YOU :)(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13