• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop 調査報告書
 

Hadoop 調査報告書

on

  • 10,573 views

 

Statistics

Views

Total Views
10,573
Views on SlideShare
10,573
Embed Views
0

Actions

Likes
5
Downloads
322
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hello,
    How are you doing and how is life down there in your Country? I guess you are having good time. My name is Cordelia . I would love to be communicating with you for a mutual and well established friendship. I am really consumed by your profile, giving me the impression that you will be a type i desire to have as a friend. Pleas i will like you to write me back on my email address(ask4pretycordelia@hotmail.com) so that we can share our photos and experience of life and every other things in common with each other. I will wait to hear from you.

    Take care of yourself.
    Cordelia
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop 調査報告書 Hadoop 調査報告書 Document Transcript

    • Hadoop Preferred Infrastructure 20 8 25
    • • ( NTT Preferred Infras- tructure( Preferred Infrastructure ) NTT Preferred Infrastructure NTT • Preferred Infrastructure NTT • Preferred Infrastructure: E-mail: info@preferred.jp NTT E-mail: pr@nttr.co.jp Copyright c NTT Resonant Inc. 2008 i
    • 2008 8 25 ii
    • 1 1 8 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Hadoop 9 3 GFS HDFS 10 3.1 GFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.3 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.5 . . . . . . . . . . . . . . . . . . . . . . 14 3.3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
    •ead-Only ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Google MapReduce Hadoop MapReduce 26 4.1 Google MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.3 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.3 Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.4 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.5 Map
    • 4.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.1 Combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.2 . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.3 Map Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5.5 Maporg.apache.hadoop.util . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.1 MergeSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.2 PriorityQueue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.3 ReflectionUtils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.4 RunJar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.5 Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3 org.apache.hadoop.io . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.1 Writable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.2 SequenceFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.3 compress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 org.apache.hadoop.ipc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.1 VersionedProtocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.2 RPC, Server, Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5 org.apache.hadoop.net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.1 DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.2 Node, NodeBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.3 NetworkTopology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6 org.apache.hadoop.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.1 FileSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.2 LocalFileSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3
    • 5.6.3 InMemoryFileSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.4 FSOutputSummer, FSInputStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6.5 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6.6 Trash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6.7 FileUtil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6.8 FsShell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6.9 DU, DF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7 org.apache.hadoop.dfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7.1 ClientProtocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7.2 DatanodeProtocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7.3 NamenodeProtocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7.4 DistributedFileSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.7.5 DFSClient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.7.6 DataNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.7.7 NameNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.7.8 FSNamesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.7.9 FSImage, FSEditLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.7.10 ReplicationTargetChooser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.7.11 SecondaryNameNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.7.12 Balancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.7.13 NamenodeFsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.8 org.apache.hadoop.mapred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.8.1 JobConf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.8.2 InputFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.8.3 OutputFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.8.4 JobClient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.8.5 JobTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.8.6 TaskTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.8.7 StatusHttpServer
    • 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7 59 60 5
    • 2.1 Google, OSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6 JobConf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.7 JobConf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.1 bonnie++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 1G * 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3 1G * 100 ( (MB) / ) . . . . . . . . . . . . . . . . . . . . . . . 53 6.4 1G * 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.5 1G * 100 ( (MB) / ) . . . . . . . . . . . . . . . . . . . 54 6.6 100G (randomwriter.conf
    • 3.1 Google File System Hadoop . . . . . . . . . . . . . . . . . . . . . . 11 4.1 Google MapReduce Hadoop . . . . . . . . . . . . . . . . . . . . . . 27 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7
    • 1 1 1.1 Hadoop[4] 2 Hadoop 3, 4 Google Google File System[10] MapReduce[9] Hadoop 5 Hadoop 6 Hadoop 7 Hadoop 0.16.4 8
    • 2 Hadoop 2 Hadoop Hadoop Yahoo! Inc. Doug Cutting Lucene[8] Lucene Hadoop Google Google File System( GFS) MapReduce Hadoop HDFS Hadoop Distributed File System Hadoop MapReduce Framework Google GFS MapReduce 2.1 BigTable hBase 2.1 Google, OSS Hadoop Java MapReduce Java Hadoop Streaming[5] C/C++ Ruby Python MapReduce 9
    • 3 GFS HDFS 3 GFS HDFS GFS HDFS GFS Hadoop 3.1 GFS GFS [10] 3.1.1 GFS PC • TB • PC • 3.1.2 GFS 64MB PC GFS 3 • • • 10
    • 3.2. 3 GFS HDFS GFS GFS GFS 3.1.3 HDFS HDFS GFS HDFS NameNode DataNode HDFS 3.2 3.1 GFS Hadoop HDFS 3.1: Google File System Hadoop Hadoop 11
    • 3.3. 3 GFS HDFS Hadoop (Read-Only ) 3.3 3.3.1 Hadoop NameNode NameNode DFSClient::mkdirs 12
    • 3.3. 3 GFS HDFS 3.3.2 Hadoop NameNode NameNode DFSClient::delete 3.3.3 Hadoop NameNode NameNode DFSClient::create 3.3.4 Hadoop NameNode NameNode DFSClient::delete 13
    • 3.3. 3 GFS HDFS 3.3.5 (5.6.6) Hadoop delete /trash /trash NameNode.emptier 3.3.6 Hadoop DFSInputStream (5.7.5) NameNode DataNode 1 DFSInputStream::read 3.3.7 Hadoop DFSOutputStream (5.7.5) NameNode DataNode 14
    • 3.3. 3 GFS HDFS DFSOutputStream::writeChunk 3.3.8 Hadoop DFSInputStream (5.7.5) NameNode DataNode NameNode DFSInputStream::read 3.3.9 Hadoop 3.3.10 Hadoop DFSClient NameNode NameNode 15
    • 3.3. 3 GFS HDFS DFSClient::rename 3.3.11 Hadoop DFSClient NameNode NameNode DFSClient::listPaths 3.3.12 Hadoop whoami bash -c groups ( ) (bin/hadoop dfs) DFSClient::getFileInfo http://hadoop.apache.org/core/docs/current/hdfs permissions guide.html 16
    • 3.3. 3 GFS HDFS 3.3.13 Hadoop whoami bash -c groups ( ) (bin/hadoop dfs) DFSClient::getFileInfo http://hadoop.apache.org/core/docs/current/hdfs permissions guide.html 3.3.14 Hadoop HADOOP-1700 http://issues.apache.org/jira/browse/HADOOP-1700 17
    • 3.4. 3 GFS HDFS 3.3.15 ( ) Hadoop FSImage NameNode FSImage 3.3.16 Hadoop HeartBeat DataNode NameNode HeartBeat NameNode DataNode HeartBeat DataNode HeartBeat ”dfs.heartbeat.interval” 3 DataNode::offerService → NameNode::sendHeartbeat 3.4 3.4.1 Hadoop DataNode NameNode HeartBeat NameNode DataNode DataNode 18
    • 3.4. 3 GFS HDFS DataNode::offerService → NameNode::sendHeartbeat 3.4.2 Hadoop ./bin/hadoop/start-balancer.sh DF::run (5.6.9) df DataNode DF HeartBeat NameNode Balancer https://issues.apache.org/jira/browse/HADOOP-1652 3.4.3 Hadoop FSNameSystem::checkPermission → PermissionChecker::checkPermission 19
    • 3.5. 3 GFS HDFS 3.4.4 RPC Hadoop Hadoop Commons-Logging([2]) log4j (org.apache.commons.logging.*) 3.5 3.5.1 Hadoop (5.5.2, 5.5.3) ”dfs.network.scripts” FSNameSystem::getBlockLocations → NetworkTopology::pseudoSortByDistance http://issues.apache.org/jira/secure/attachment/12345251/Rack aware HDFS proposal.pdf 20
    • 3.6. 3 GFS HDFS 3.6 3.6.1 Hadoop id NameNode id Block FSNamesystem::allocateBlock 3.6.2 Hadoop ”io.bytes.per.checksum” ( 512 ) NameNode DFSOutputStream DFSInputStream 21
    • 3.6. 3 GFS HDFS 3.6.3 Hadoop 3.6.4 Hadoop write 5.7.10 FSNameSystem::pendingTransfers ReplicationTargetChooser(5.7.10) http://issues.apache.org/jira/secure/attachment/12345251/Rack aware HDFS proposal.pdf 3.6.5 2 22
    • 3.6. 3 GFS HDFS Hadoop GFS DataNode DataNode DataNode DataNode DFSOutputStream 3.6.6 Hadoop RPC (5.4.2) (5.7.5) RPC DFSOutputStream DFSInputStream 3.6.7 Hadoop FSEditLog (5.7.9) dfs.name.dir FSImage FSEditLog 23
    • 3.6. 3 GFS HDFS 3.6.8 Hadoop SecondaryNameNode NameNode (5.7.11) NameNode SecondaryNameNode SecondaryNameNode NameNode SecondaryNameNode SecondaryNameNode::run 3.6.9 Hadoop NameNode FSImage loadFSImage (5.7.9) FSImage::loadFSImage → FSEditLog::loadFSEdits 3.6.10 (Read-Only ) 24
    • 3.7. 3 GFS HDFS Hadoop SecondaryNameNode 3.7 HDFS GFS 0.16.4 0.19 HDFS DataNode NameNode NameNode NameNode NameNode 25
    • 4 Google MapReduce Hadoop MapReduce 4 Google MapReduce Hadoop MapReduce 4.1 Google MapReduce Google MapReduce [9] 4.1.1 Google MapReduce PC • MapReduce MapReduce PC 3 – Map – Shuffle – Reduce MapReduce • GFS • 26
    • 4.2. 4 Google MapReduce Hadoop MapReduce 4.1.2 Google MapReduce M Map M Map Map PC Map Map R Reduce R Reduce Reduce PC M Reduce R MR Reduce Reduce M Reduce Google MapReduce MapReduce MapReduce Map Reduce Map Reduce 4.1.3 Hadoop MapReduce Hadoop MapReduce Google MapReduce Hadoop MapReduce JobTracker, TaskTracker Hadoop HadoopStream- ing MapReduce HadoopStreaming[5] MapReduce 4.2 4.1 Google MapReduce Hadoop Hadoop MapReduce 4.1: Google MapReduce Hadoop Hadoop MapReduce Shuffle Map Reduce Map 27
    • 4.3. 4 Google MapReduce Hadoop MapReduce Hadoop Combine Map Shuffle Map 4.3 4.3.1 MapReduce MapReduce Hadoop Hadoop Java MapReduce HadoopStreaming MapReduce JobClient JobTracker TaskTracker 28
    • 4.3. 4 Google MapReduce Hadoop MapReduce 4.3.2 Hadoop HeartBeat TaskTracker JobTracker HeartBeat HeartBeat ( /50 + 1) (JobTracker::getNextHeartbeatInterval) 5 TaskTracker::transmitHeartBeat → JobTracker::heartbeat 4.3.3 Shuffle Reducer (Shuffle) Hash Shuffle Shuffle Shuffle Hadoop JobConf.setPartitioner HashPartitioner, KeyFieldBasedPartitioner HashPartitioner Reducer KeyFieldBasedPartitioner HashPartitioner KeyFieldBasedPartitioner 4.3.4 Map Reduce Map Reduce 29
    • 4.3. 4 Google MapReduce Hadoop MapReduce Hadoop MapReduce JobConf.setNumReduceTasks, JobConf.setNumMapTasks (5.8.1) JobConf::setNumMapTasks JobConf::setNumReduceTasks 4.3.5 Map Map Hadoop InputSplit Map (5.8.5) InputSplit InputSplit 4.3.6 MapReduce Hadoop JobConf::setJobPriority 4.3.7 MapReduce 30
    • 4.3. 4 Google MapReduce Hadoop MapReduce Hadoop JobConf.setInputFormat, JobConf.setOutputFormat MapReduce (5.8.1) TextInputFormat key-value 1 1 SequenceFileAsTextInputFormat key-value 1 1 TextOutputFormat InputFormat OutputFormat TextInputFormat TextOutputFormat SequenceFile 4.3.8 Hadoop Task Counter enumeration 1 key-value • MAP INPUT RECORDS, - Map • MAP OUTPUT RECORDS, - Map • MAP INPUT BYTES, - Map • MAP OUTPUT BYTES, - Map • COMBINE INPUT RECORDS, - Combine • COMBINE OUTPUT RECORDS, - Combine • REDUCE INPUT GROUPS, - Reduce • REDUCE INPUT RECORDS, - Reduce • REDUCE OUTPUT RECORDS - Reduce Task 31
    • 4.4. 4 Google MapReduce Hadoop MapReduce 4.3.9 Hadoop Reporter::incrCount Reporter http://www.jakobhoman.com/2007/11/quick-tour-of-hadoops-reporter-object.html 4.4 4.4.1 MapReduce Hadoop HTTP JobTracker HTTP (5.8.7) CUI % JobClient StatusHttpServer 4.4.2 MapReduce 32
    • 4.5. 4 Google MapReduce Hadoop MapReduce Hadoop JobConf ”mapred.job.tracker” local Map Shuffle Reduce 1 (5.8.1) JobClient::init → LocalJobRunner 4.4.3 MapReduce Hadoop ./bin/hadoop job -kill-task ./bin/hadoop job -list JobClient 4.5 4.5.1 Combine Map Combine Hadoop JobConf.setCombinerClass (5.8.1) JobConf 33
    • 4.5. 4 Google MapReduce Hadoop MapReduce 4.5.2 Map Reduce Hadoop 3.5.1 ( ) Task Map Reducer Shuffle JobInProgress::createCache → InputFormat::getLocations → DistributedFileSystem::getFileBlockLocations http://issues.apache.org/jira/secure/attachment/12345251/Rack aware HDFS proposal.pdf 4.5.3 Map Shuffle Map Shuffle Map Shuffle Hadoop Shuffle Reduce Map Reduce Map TaskTracker Shuffle Fetch 1 TaskTracker ReduceTask.ReduceCopier::fetchOutputs 34
    • 4.6. 4 Google MapReduce Hadoop MapReduce 4.5.4 I/O MapReduce Hadoop SequenceFile key-value gzip lzo ”mapred.output.compress” true (5.8.1) OutputFormatBase 4.5.5 Map Map Shuffle Hadoop JobConf.setCompressMapOutput (5.8.1) MapTask True JobConf MapTask.MapOutputBuffer::MapOutputBuffer 4.6 4.6.1 Hadoop 35
    • 4.6. 4 Google MapReduce Hadoop MapReduce JobInProgress::failedTask 4.6.2 Hadoop SpeculativeTask JobConf.setMapSpeculativeExecution, JobConf.setReduceSpeculativeExecution true enable TaskInProgress::hasSpeculativeTask (5.8.5) TaskInProgress::hasSpeculativeTask 4.6.3 Hadoop KILL JobInProgress::completedTask alreadyCompletedTask KILL completed SUCCEEDED JobInProgress::completedTask → TaskInProgress::alreadyCompletedTask 36
    • 4.7. 4 Google MapReduce Hadoop MapReduce 4.6.4 Hadoop 4.6.5 HTML Hadoop HADOOP-153 http://issues.apache.org/jira/browse/HADOOP-153 4.7 Hadoop MapReduce Google MapReduce Hadoop MapReduce 37
    • 5.1. 5 5 Hadoop 5.1 src/ 5.1: conf dfs HDFS filecache fs io ipc IPC(Inter Process Communication) log mapred MapReduce metrics net record security tools util 38
    • 5.2. org.apache.hadoop.util 5 5.2 org.apache.hadoop.util Hadoop 5.2.1 MergeSort Map 5.2.2 PriorityQueue 5.2.3 ReflectionUtils Java ReflectionUtils::newInstance 5.2.4 RunJar Jar 5.2.5 Tool MapReduce ToolRunner::run 5.3 org.apache.hadoop.io 5.3.1 Writable MapReduce key, value java.io.DataInput, java.io.DataOutput IntWritable, LongWritable, FloatWritable, BytesWritable, ArrayWritable, TwoDArrayWritable, MapWritable 5.3.2 SequenceFile Key-Value Key-Value 39
    • 5.4. org.apache.hadoop.ipc 5 5.3.3 compress compress BlockCompressorStream GzipCodec LzoCodec 5.4 org.apache.hadoop.ipc 5.4.1 VersionedProtocol 5.4.2 RPC, Server, Client RPC(Remote Procedure Call) 5.1   Configuration conf = new Configuration(); Server server = RPC.getServer(this, quot;localhostquot;, 8000, conf); // localhost:8000 server.start();   5.1 5.2 ClientProtocol   Configuration conf = new Configuration(); InetSocketAddress addr = new InetSocketAddress(quot;localhostquot;, 8000); // ClientProtocol client = (ClientProtocol)RPC.waitForProxy(ClientProtocol.class, ClientProtocol.versionID, addr, conf);   5.2 ClientProtocol 5.3 ClientProtocol ClientProtocol Writable Java ( 5.4) ”ipc.client.connect.max.retries” ( 10 40
    • 5.5. org.apache.hadoop.net 5   interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol { public static final long versionID = 1L; HeartbeatResponse heartbeat(); } public class HeartbeatResponse implements org.apache.hadoop.io.Writable { String status; public void write(DataOutput out) throws IOException { UTF8.writeString(out, status); } public void readFields(DataInput in) throws IOException { this.status = UTF8.readString(in); } }   5.3   client.heartbeat();   5.4 ) 60 (FSConstants.READ TIMEOUT) 1 5.5 org.apache.hadoop.net 5.5.1 DNS DNS (reverseDns ) IP (getIPs ) 5.5.2 Node, NodeBase ”dfs.network.scripts” (3.5.1 ) 5.5.3 NetworkTopology Hadoop Node / (isOnSameRack ) getDistance 1 41
    • 5.6. org.apache.hadoop.fs 5 5.6 org.apache.hadoop.fs 5.6.1 FileSystem Amazon S3 ( s3 ) Hadoop hdfs:// file:// Amazon S3 s3:// Kosmos [7] kfs:// createFileSystem (URI) ”fs.[scheme].impl” ”fs.hdfs.impl” org.apache.hadoop.dfs.DistributedFileSystem   Configuration conf = new Configuration(); FileSystem fs1 = FileSystem.getNamed(quot;hdfs:///quot;, conf); Path inFile = new Path(quot;hdfs:///user/kzk/infilequot;); FSDataInputStream in = fs1.open(inFile); FileSystem fs2 = FileSystem.getNamed(quot;s3:///quot;, conf); Path outFile = new Path(quot;s3:///user/kzk/outfilequot;); FSDataOutputStream out = fs2.create(outFile); while((bytesRead = in.read(buffer)) > 0){ out.write(buffer, 0, bytesRead); } in.close(); out.close();   5.5 5.5 5.6.2 LocalFileSystem FileSystem 5.6.3 InMemoryFileSystem reserveSpace reserveSpaceWithCheckSum InMemoryFileSystem ReduceTask Key Value 42
    • 5.7. org.apache.hadoop.dfs 5 5.6.4 FSOutputSummer, FSInputStream FileSystem 5.6.5 Path Path 5.6.6 Trash HDFS (3.3.5) Emptier 5.6.7 FileUtil copy 5.6.8 FsShell 5.6.9 DU, DF UNIX du df DataNode 5.7 org.apache.hadoop.dfs 5.7.1 ClientProtocol NameNode RPC 5.7.2 DatanodeProtocol DataNode NameNode RPC 5.7.3 NamenodeProtocol Balancer NameNode RPC 43
    • 5.7. org.apache.hadoop.dfs 5 5.7.4 DistributedFileSystem FileSystem(5.6.1) ”hdfs” DFSClient 5.7.5 DFSClient DFSClient HDFS open(), create(), exists(), listPaths(), mkdir() createNamenode NameNode ClientProto- col DFSInputStream DFSOutputStream HDFS DFSInputStream DFSInputStream NameNode DataNode BlockReader BlockReader DFSInputStream blockSeekTo BlockReader RPC Socket DataNode DataNode (DFSInputStream::readBuffer ) DFSOutputStream DFSOutputStream 64K ”Packet” 512K DFSOutputStream Socket dataQueue DataStreamer dataQueue DataNode ackQueue DataNode ack ResponseProcessor DataNode ack DataNode ack ackQueue ackQueue dataQueue Datanode (DataStreamer::processDatanodeError ) (DataStreamer::run ) 44
    • 5.7. org.apache.hadoop.dfs 5 5.7.6 DataNode DataNode NameNode DataNode NameNode HeartBeat (DataNode::offerService ) HeartBeat DataNode RPC DatanodeProtocol HeartBeat DatanodeCommand NameNode NameNode HeartBeat DataNode NameNode NameNode 5.7.7 NameNode NameNode NameNode 1 NameNode ClientProtocol DatanodeProtocol DataNode HeartBeat ( ) 5.7.8 FSNamesystem NameNode ClientProtocol FSNamesystem NameNode RPC FSNamesystem • (1) • (2) ((1) ) • (3) • (4) ((3) ) HDFS FSDirectory FSNameSystem INode 45
    • 5.7. org.apache.hadoop.dfs 5 BlocksMap INode 5.7.9 FSImage, FSEditLog FSImage FSImage FSEditLog 5.7.10 ReplicationTargetChooser DataNode DataNode 2 1 3 1 5.7.11 SecondaryNameNode SecondaryNameNode NameNode NameNode ”fs.checkpoint.size” NameNode ”fs.checkpoint.dir” SecondaryNameNode NameNode ClientProtocol 5.7.12 Balancer Balancer DataNode DataNode HDFS DataNode Balancer (3.4.2) 3.4.2 5.7.13 NamenodeFsck HDFS [3] DataNode NameNode 46
    • 5.8. org.apache.hadoop.mapred 5 5.8 org.apache.hadoop.mapred 5.8.1 JobConf JobConf MapReduce JobConf • (setJobName) • Mapper (setMapperClass) • Combiner (setCombinerClass) • Reducer (setReducerClass) • InputFormat (setInputFormat) • OutputFormat (setOutputFormat) • (setInputPath) • (setOutputPath) JobConf 5.6   // Create a new JobConf JobConf job = new JobConf(new Configuration(), MyJob.class); // Specify various job-specific parameters job.setJobName(quot;myjobquot;); job.setMapperClass(MyJob.MyMapper.class); job.setCombinerClass(MyJob.MyReducer.class); job.setReducerClass(MyJob.MyReducer.class); job.setInputFormat(SequenceFileInputFormat.class); job.setOutputFormat(SequenceFileOutputFormat.class); job.setInputPath(new Path(quot;inquot;)); job.setOutputPath(new Path(quot;outquot;));   5.6 JobConf ( 5.7) 5.8.2 InputFormat InputFormat MapReduce InputFormat • (validateInput ) • Mapper (getSplits ) 47
    • 5.8. org.apache.hadoop.mapred 5   // Map conf.setNumMapTasks(100); // Reduce conf.setNumReduceTasks(40); // Map conf.setMapDebugScript(quot;/home/kzk/debug/map-fail.shquot;); // Reduce conf.setReduceDebugScript(quot;/home/kzk/debug/reduce-fail.shquot;); // Map conf.setCompressMapOutput(true); // conf.setBoolean(quot;mapred.output.compressquot;, true); // MapReduce conf.set(quot;mapred.job.trackerquot;, quot;localquot;); conf.set(quot;fs.default.namequot;, quot;localquot;);   5.7 JobConf • InputSplit( ) RecordReader (getRecordReader ) getSplits InputSplit FileSplit getRecordReader Key-Value ( ) RecordReader RecordReader::next InputFormat TextInputFormat TextInputFormat InputFormat getRecordReader LineRecordReader InputFormat KeyValueTextInputFormat KeyValueTextInputFormat Key-Value Input- Format KeyValueTextInputFormat getRecor- dReader KeyValueLineRecordReader 5.8.3 OutputFormat OutputFormat MapReduce OutputFormat 48
    • 5.8. org.apache.hadoop.mapred 5 • (checkOutputSpecs ) • RecordWriter (getRecordWriter ) TextOutputFormat OutputFormat keytvalue OutputFormatBase::setCompressOutput 5.8.4 JobClient Job JobTracker JobClient.runJob Job JobTracker Job 5.8.5 JobTracker JobTracker Job TaskTracker Task JobClient JobTracker submitJob RPC Job Job jobInitQueue add JobInitThread JobInProgress::initTasks Job InputSplit TaskTracker HeartBeat (heartbeat ) TaskTracker TaskTrackerAction LaunchJobAction, KillJobAction, KillTaskAction, ReinitTrackerAction TaskTracker Task LaunchTaskAction Task TaskTracker getNewTaskForTaskTracker Map Reduce JobInProgress obtainNewMapTask obtainNewReduceTask findNewTask TaskInProgress::hasSpeculativeTask SpeculativeTask • Task • SpeculativeTask • SPECULATIVE GAP(2 ) • SPECULATIVE LAG(60 ) • Task 49
    • 5.8. org.apache.hadoop.mapred 5 5.8.6 TaskTracker TaskTracker Task offerService JobTracker HeartBeat LaunchTaskAction (startNewTask ) startNewTask localizeJob jar HDFS launchTaskForJob (TaskInProgress::launchTask ) launchTask localizeTask createRunner TaskRunner TaskRunner TaskRunner java MapTask MapTask Map run Map run Map collector MapRunner::run MapRunner::run RecordReader map collector Reduce DirectMapOutputCollector MapOutputBuffer MapOutputBuffer MapOutputBuffer::collect map ReduceTask MergeSorter MergeSorter::addKeyValue MergeSorter (maxBufferSize) (bufferWriter) sortAndSpillToDisk sortAndSpillToDisk MergeSorter (pendingSortImpl[i].sort()) Combiner combine RecordWriter (spill ) startPartition RecordWriter endPartition Partition ReduceTask Partition run collector::flush mergeParts Partition 1 SequenceFile::Sorter 50
    • 5.9. 5 map ReduceTask ReduceTask ReduceTask Reduce run ReduceTask ReduceCopier fetchOut- puts Map Reducer Map 1 run reduce ReduceValuesIterator Reduce collector reduce collector colect RecordWriter Mapper map map Reducer reduce reduce 5.8.7 StatusHttpServer JobTracker, TaskTracker StatusHttpServer HTTP (4.4.1) HTTP Jetty[6] 5.9 Map Reduce UNIX 51
    • 6 6 HDFS Hadoop MapReduce 6.1 12 DataNode TaskTracker 1 NameNode JobTracker 6.1 100MBps Ethernet 6.1 CPU Intel Xeon E5430 2.66 GHz Quad Core Memory 16G Disk SAS OS Linux 2.6.18-53.1.14.el5PAE NIC Broadcom NetXtreme II BGM5708 Gigabit Ethernet I/O Scheduler CFQ(Completely Fair Queing) 6.1.1 bonnie++[1] read/write bonnie++ 6.1 ( ) 80.2MB/sec ( ) 94.2MB/sec 347.7 6.2 HDFS HDFS MapReduce Hadoop TestDFSIO(hadoop-0.16.4-test.jar ) 1 3 52
    • 6.2. HDFS 6   $ tar vzxf bonnie++-1.03c.tar.gz $ cd bonnie++-1.03c $ ./configure $ make $ ./bonnie++ Version 1.03c ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP tmr001 32G 44896 66 80263 14 39105 7 66683 94 94257 11 347.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 99920 98 585100 96 116893 100 101574 99 917100 100 121861 100   6.1 bonnie++ 6.2.1 6.2 1G 100 5.7.10   $ ./bin/hadoop jar hadoop-0.16.4-test.jar TestDFSIO -write -nrFiles 100 -fileSize 1000   6.2 1G * 100 6.3 80 75 70 65 MegaBytes/Sec 60 55 50 45 40 35 5 6 7 8 9 10 11 12 Machines 6.3 1G * 100 ( (MB) / ) 53
    • 6.3. 6 6.2.2 100G 6.4   $ ./bin/hadoop jar hadoop-0.16.4-test.jar TestDFSIO -read -nrFiles 100 -fileSize 1000   6.4 1G * 100 6.5 200 190 180 MegaBytes/Sec 170 160 150 140 130 5 6 7 8 9 10 11 12 Machines 6.5 1G * 100 ( (MB) / ) 6.3 MapReduce 100G 1 6.3.1 Hadoop randomwrite 100G 6.6 Key, Value 10 1000 100G 1G 1 Map Key-Value 1KB 100M 6.7 54
    • 6.3. 6   <?xml version=quot;1.0quot;?> <?xml-stylesheet type=quot;text/xslquot; href=quot;configuration.xslquot;?> <configuration> <property> <name>test.randomwrite.min_key</name> <value>10</value> </property> <property> <name>test.randomwrite.max_key</name> <value>1000</value> </property> <property> <name>test.randomwrite.min_value</name> <value>10</value> </property> <property> <name>test.randomwrite.max_value</name> <value>1000</value> </property> <property> <name>test.randomwriter.bytes_per_map</name> <value>1000000000</value> </property> <property> <name>test.randomwrite.total_bytes</name> <value>100000000000</value> </property> </configuration>   6.6 100G (randomwriter.conf)   $ ./bin/hadoop jar hadoop-0.16.4-examples.jar randomwriter -conf randomwriter.conf random   6.7 100G 100G 6.8 6.9 55
    • 6.3. 6 9000 8000 7000 6000 Sec 5000 4000 3000 2000 1000 3 4 5 6 7 8 9 10 11 12 Machines 6.8 100G ( / ) 60 55 50 45 40 MegaBytes/Sec 35 30 25 20 15 10 3 4 5 6 7 8 9 10 11 12 Machines 6.9 100G ( (MB) / ) 56
    • 6.3. 6 6.3.2 6.10   $ ./bin/hadoop jar hadoop-0.16.4-examples.jar sort random radom-sort   6.10 900 800 700 600 Sec 500 400 300 200 100 3 4 5 6 7 8 9 10 11 12 Machines 6.11 100G ( / ) 550 500 450 400 MegaBytes/Sec 350 300 250 200 150 100 3 4 5 6 7 8 9 10 11 12 Machines 6.12 100G ( (MB) / ) 6.11 6.12 57
    • 6.4. 6 6.4 Hadoop 12 1 Hadoop 3 Hadoop 58
    • 7 7 Hadoop GFS, Google MapReduce Hadoop Hadoop Hadoop Hadoop Hadoop 12 Hadoop 59
    • [1] Bonnie++ project homepage. http://www.coker.com.au/bonnie++/. [2] Commons logging. http://commons.apache.org/logging/. [3] Hadoop dfs user guide. http://hadoop.apache.org/core/docs/current/hdfs user guide.html. [4] Hadoop project homepage. http://hadoop.apache.org/core/. [5] Hadoop streaming documentation. http://hadoop.apache.org/core/docs/current/streaming.html. [6] Jetty. http://www.mortbay.org/jetty-6/. [7] Kosmos filesystem. http://kosmosfs.sourceforge.net/. [8] Lucene project homepage. http://lucene.apache.org/. [9] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008. [10] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003. 60