Hadoop at a glance


Published on

"An Elephan can't jump. But can carry heavy load".
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop at a glance

  1. 1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
  2. 2. Agenda1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model4. Parallel copying5. Demo - Command line
  3. 3. The Design of HDFSVery large distributed file system Up to 10K nodes, 1 billion files, 100PBStreaming data access Write once, read many timesCommodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
  4. 4. Worst fit withLow-latency data accessLots of small filesMultiple writers, arbitrary file modifications
  5. 5. HDFS BlocksNormal Filesystem blocks are few kilobytesHDFS has Large block size  Default 64MB  Typical 128MBUnlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
  6. 6. HDFS BlocksA file is stored in blocks on various nodes in hadoop cluster.HDFS creates several replication of the data blocksEach and every data block is replicated to multiple nodes across the cluster.
  7. 7. HDFS BlocksDhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
  8. 8. Why blocks in HDFS so large?Minimize the cost of seeks=> Make transfer time = disk transfer rate
  9. 9. Benefit of Block abstractionA file can be larger than any single disk in the networkSimplify the storage subsystemProviding fault tolerance and availability
  10. 10. Namenode & Datanodes
  11. 11. Namenode & Datanodes Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree. Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks Clients communicate with both namenode and datanodes.
  12. 12. Anatomy of a File Read
  13. 13. Anatomy of a File ReadBenefits:- Avoid “bottle neck”- Multi-Clients
  14. 14. Writing in HDFSNamenodeDatanodeBlock
  15. 15. Writing in HDFSExeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
  16. 16. Coherency ModelNot visible when copyinguse sync()Apply in applications
  17. 17. Parallel copying in HDFSTransfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
  18. 18. SetupCluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDDUsing vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04  Windows: Not tested
  19. 19. Setup Guide - Single Nodejava runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html/etc/hadoop/core-site.xml/etc/hadoop/hdfs-site.xml
  20. 20. Cluster/etc/hadoop/masters/etc/hadoop/slaveshttp://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
  21. 21. Command LineSimilar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/Namedone-specific:  hadoop namenode -format  start-all.sh
  22. 22. Command LineSorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort resultCommand Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
  23. 23. Benchmark Result2 Nodes, 1GB data: 0:03:383 Nodes, 1GB data: 0:03:132 Nodes, 10GB data: 0:38:073 Nodes, 10GB data: 0:31:28Virtual Machines harddisks are the bottle-neck
  24. 24. Whowins…?
  25. 25. ReferencesHadoop The Definitive Guide