0
Hadoop and HDFS in CMRI

  China Mobile Research Institute
      WANG, Xu [wangxu(at)chinamobile.com]
Apache Hadoop

 http://hadoop.apache.org/
 Open source clone of Google infrastructure
 De facto standards of MapReduce fra...
Hadoop in China 2009




                            Beijing
                       Nov 15, 2009

                        ...
Subprojects of Hadoop
   Data              K-V
                     K- Store /          Distributed
 Warehouse          Co...
HDFS Principles

 Follow Google GFS Paper
 For Big data storage and processing
 Write once, read frequently
   Modify is n...
HDFS Architecture




                    内部资料 注意保密
Data in HDFS NameNode’s Memory

 Namespace Info
   FS Hierarchical Tree
   Map(file, blocks)
 DataNode Map
   Map(living d...
Persistence of NameNode data

 NameNode persistence
   Namespace: FSImage & EditLog
   Starting & Shutdown
 Secondary Name...
High Availability Considerations
 Availability in Mainstream
    SPOF in NameNode, Fail of NameNode may cause
        Serv...
HDFS+NNC Architecture




                        内部资料 注意保密
NNC Design
             Master & Slave: 1:N
             Master synchronize
             the FSNamesystem
             to ...
Update Events
 NNU_NOP      // nothing to do
 NNU_BLK      // add or remove a block
 NNU_INODE    // add or remove or modi...
Performance and Other Issues
 The overhead of NameNode synchronization
    For typical file IO and MapReduce (sort, wordco...
Q&A
Upcoming SlideShare
Loading in...5
×

20100130 hardoop apache

2,720

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,720
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
116
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "20100130 hardoop apache"

  1. 1. Hadoop and HDFS in CMRI China Mobile Research Institute WANG, Xu [wangxu(at)chinamobile.com]
  2. 2. Apache Hadoop http://hadoop.apache.org/ Open source clone of Google infrastructure De facto standards of MapReduce framework, win Terasort several times Search Engine, Data Mining, Log Analyzing Clusters scale up to 4,000 nodes Yahoo!, Facebook, Cloudera Baidu, Alibaba, China Mobile 内部资料 注意保密
  3. 3. Hadoop in China 2009 Beijing Nov 15, 2009 内部资料 注意保密
  4. 4. Subprojects of Hadoop Data K-V K- Store / Distributed Warehouse Column based Lock DB HBase ZooKeeper Pig Hive Basic (BigTable) (Chubby) Platform Hadoop MapReduce (Google MapReduce) Core HDFS (Google GFS) Serialized Data Format Hadoop Common Avro & (io, ipc….) (ipc) RPC JVM 内部资料 注意保密
  5. 5. HDFS Principles Follow Google GFS Paper For Big data storage and processing Write once, read frequently Modify is not permitted, append will be support soon Read is prior to writing Working on commodity PC Hardware may fail anytime Multiple replicas for data safety 内部资料 注意保密
  6. 6. HDFS Architecture 内部资料 注意保密
  7. 7. Data in HDFS NameNode’s Memory Namespace Info FS Hierarchical Tree Map(file, blocks) DataNode Map Map(living datanode, blocks) Blocks Map Map(block, file/datanodes) Other runtime info Lock holding by clients Blocks being processed (replication, invalid…) 内部资料 注意保密
  8. 8. Persistence of NameNode data NameNode persistence Namespace: FSImage & EditLog Starting & Shutdown Secondary NameNode Checkpoint (merge EditLog into FSImage) Periodically work (1 hour by default) Backup NameNode Introduced In 0.21 (not release yet) “Real time Secondary NameNode” or Remote Editlog DataNode Map and other Info only exists in NameNode Memory 内部资料 注意保密
  9. 9. High Availability Considerations Availability in Mainstream SPOF in NameNode, Fail of NameNode may cause Service interruption for minutes Data loss for a ckpt period (worst case) Possible Solution: DRBD+Linux-HA Mature fail over mechanism Service interruption for minutes Almost no data loss Another Solution: NameNode Cluster Extension Service continuous Almost no data loss Modify the code Consistency vs. Performance 内部资料 注意保密
  10. 10. HDFS+NNC Architecture 内部资料 注意保密
  11. 11. NNC Design Master & Slave: 1:N Master synchronize the FSNamesystem to slaves Zookeeper works as a registry, client and datanode can lookup namenode list from it. DFSClient can access multiple namenode for reading operation Failover is controlled by linux- HA by far, which get namenode status info from ClientProtocol 内部资料 注意保密
  12. 12. Update Events NNU_NOP // nothing to do NNU_BLK // add or remove a block NNU_INODE // add or remove or modify an inode (add or remove file; new block allocation) NNU_NEWFILE // start new file NNU_CLSFILE // close new file NNU_MVRM // move or remove file NNU_MKDIR // mkdir NNU_LEASE // add/update or release a lease NNU_LEASE_BATCH //update batch of leases NNU_DNODEHB_BATCH //batch of datanode heartbeat NNU_DNODEREG // dnode register NNU_DNODEBLK // block report NNU_DNODERM // remove dnode NNU_BLKRECV // block received message from datanode NNU_REPLICAMON //replication monitor work NNU_WORLD //bootstrap a slave node NNU_MASSIVE //bootstrap a slave node 内部资料 注意保密
  13. 13. Performance and Other Issues The overhead of NameNode synchronization For typical file IO and MapReduce (sort, wordcount) NNC system reaches 95% performance of hadoop without NNC For meta data write only operation (parallel touchz or mkdir) NNC system reaches 15% performance of hadoop without NNC Performance gaining of Multiple NameNode in read-only operation Cannot observed till now, unfortunately Other design issue Why from master to slaves directly without an additional delivery node? That may introduce another SPOF, and make the problem more complex. Why don’t use Zookeeper for failover? Linux-HA works well, and we are also evaluate whether change to ZK, any suggestions? 内部资料 注意保密
  14. 14. Q&A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×