Hadoop at a glance

2,603 views
2,314 views

Published on

"An Elephan can't jump. But can carry heavy load".
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,603
On SlideShare
0
From Embeds
0
Number of Embeds
298
Actions
Shares
0
Downloads
74
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Hadoop at a glance

  1. 1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
  2. 2. Agenda1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model4. Parallel copying5. Demo - Command line
  3. 3. The Design of HDFSVery large distributed file system Up to 10K nodes, 1 billion files, 100PBStreaming data access Write once, read many timesCommodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
  4. 4. Worst fit withLow-latency data accessLots of small filesMultiple writers, arbitrary file modifications
  5. 5. HDFS BlocksNormal Filesystem blocks are few kilobytesHDFS has Large block size  Default 64MB  Typical 128MBUnlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
  6. 6. HDFS BlocksA file is stored in blocks on various nodes in hadoop cluster.HDFS creates several replication of the data blocksEach and every data block is replicated to multiple nodes across the cluster.
  7. 7. HDFS BlocksDhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
  8. 8. Why blocks in HDFS so large?Minimize the cost of seeks=> Make transfer time = disk transfer rate
  9. 9. Benefit of Block abstractionA file can be larger than any single disk in the networkSimplify the storage subsystemProviding fault tolerance and availability
  10. 10. Namenode & Datanodes
  11. 11. Namenode & Datanodes Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree. Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks Clients communicate with both namenode and datanodes.
  12. 12. Anatomy of a File Read
  13. 13. Anatomy of a File ReadBenefits:- Avoid “bottle neck”- Multi-Clients
  14. 14. Writing in HDFSNamenodeDatanodeBlock
  15. 15. Writing in HDFSExeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
  16. 16. Coherency ModelNot visible when copyinguse sync()Apply in applications
  17. 17. Parallel copying in HDFSTransfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
  18. 18. SetupCluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDDUsing vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04  Windows: Not tested
  19. 19. Setup Guide - Single Nodejava runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html/etc/hadoop/core-site.xml/etc/hadoop/hdfs-site.xml
  20. 20. Cluster/etc/hadoop/masters/etc/hadoop/slaveshttp://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
  21. 21. Command LineSimilar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/Namedone-specific:  hadoop namenode -format  start-all.sh
  22. 22. Command LineSorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort resultCommand Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
  23. 23. Benchmark Result2 Nodes, 1GB data: 0:03:383 Nodes, 1GB data: 0:03:132 Nodes, 10GB data: 0:38:073 Nodes, 10GB data: 0:31:28Virtual Machines harddisks are the bottle-neck
  24. 24. Whowins…?
  25. 25. ReferencesHadoop The Definitive Guide

×