20100130 hadoop apache


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

20100130 hadoop apache

  1. 1. Hadoop and HDFS in CMRI China Mobile Research Institute WANG, Xu [wangxu(at)chinamobile.com]
  2. 2. Apache Hadoop http://hadoop.apache.org/ Open source clone of Google infrastructure De facto standards of MapReduce framework, win Terasort several times Search Engine, Data Mining, Log Analyzing Clusters scale up to 4,000 nodes Yahoo!, Facebook, Cloudera Baidu, Alibaba, China Mobile 内部资料 注意保密
  3. 3. Hadoop in China 2009 Beijing Nov 15, 2009 内部资料 注意保密
  4. 4. Subprojects of Hadoop Data K-V K- Store / Distributed Warehouse Column based Lock DB HBase ZooKeeper Pig Hive Basic (BigTable) (Chubby) Platform Hadoop MapReduce (Google MapReduce) Core HDFS (Google GFS) Serialized Data Format Hadoop Common Avro & (io, ipc….) (ipc) RPC JVM 内部资料 注意保密
  5. 5. HDFS Principles Follow Google GFS Paper For Big data storage and processing Write once, read frequently Modify is not permitted, append will be support soon Read is prior to writing Working on commodity PC Hardware may fail anytime Multiple replicas for data safety 内部资料 注意保密
  6. 6. HDFS Architecture 内部资料 注意保密
  7. 7. Data in HDFS NameNode’s Memory Namespace Info FS Hierarchical Tree Map(file, blocks) DataNode Map Map(living datanode, blocks) Blocks Map Map(block, file/datanodes) Other runtime info Lock holding by clients Blocks being processed (replication, invalid…) 内部资料 注意保密
  8. 8. Persistence of NameNode data NameNode persistence Namespace: FSImage & EditLog Starting & Shutdown Secondary NameNode Checkpoint (merge EditLog into FSImage) Periodically work (1 hour by default) Backup NameNode Introduced In 0.21 (not release yet) “Real time Secondary NameNode” or Remote Editlog DataNode Map and other Info only exists in NameNode Memory 内部资料 注意保密
  9. 9. High Availability Considerations Availability in Mainstream SPOF in NameNode, Fail of NameNode may cause Service interruption for minutes Data loss for a ckpt period (worst case) Possible Solution: DRBD+Linux-HA Mature fail over mechanism Service interruption for minutes Almost no data loss Another Solution: NameNode Cluster Extension Service continuous Almost no data loss Modify the code Consistency vs. Performance 内部资料 注意保密
  10. 10. HDFS+NNC Architecture 内部资料 注意保密
  11. 11. NNC Design Master & Slave: 1:N Master synchronize the FSNamesystem to slaves Zookeeper works as a registry, client and datanode can lookup namenode list from it. DFSClient can access multiple namenode for reading operation Failover is controlled by linux- HA by far, which get namenode status info from ClientProtocol 内部资料 注意保密
  12. 12. Update Events NNU_NOP // nothing to do NNU_BLK // add or remove a block NNU_INODE // add or remove or modify an inode (add or remove file; new block allocation) NNU_NEWFILE // start new file NNU_CLSFILE // close new file NNU_MVRM // move or remove file NNU_MKDIR // mkdir NNU_LEASE // add/update or release a lease NNU_LEASE_BATCH //update batch of leases NNU_DNODEHB_BATCH //batch of datanode heartbeat NNU_DNODEREG // dnode register NNU_DNODEBLK // block report NNU_DNODERM // remove dnode NNU_BLKRECV // block received message from datanode NNU_REPLICAMON //replication monitor work NNU_WORLD //bootstrap a slave node NNU_MASSIVE //bootstrap a slave node 内部资料 注意保密
  13. 13. Performance and Other Issues The overhead of NameNode synchronization For typical file IO and MapReduce (sort, wordcount) NNC system reaches 95% performance of hadoop without NNC For meta data write only operation (parallel touchz or mkdir) NNC system reaches 15% performance of hadoop without NNC Performance gaining of Multiple NameNode in read-only operation Cannot observed till now, unfortunately Other design issue Why from master to slaves directly without an additional delivery node? That may introduce another SPOF, and make the problem more complex. Why don’t use Zookeeper for failover? Linux-HA works well, and we are also evaluate whether change to ZK, any suggestions? 内部资料 注意保密
  14. 14. Q&A