Your SlideShare is downloading. ×
0
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Hadoop at a glance
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop at a glance

2,133

Published on

"An Elephan can't jump. But can carry heavy load". …

"An Elephan can't jump. But can carry heavy load".
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,133
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
69
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
  • 2. Agenda1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model4. Parallel copying5. Demo - Command line
  • 3. The Design of HDFSVery large distributed file system Up to 10K nodes, 1 billion files, 100PBStreaming data access Write once, read many timesCommodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
  • 4. Worst fit withLow-latency data accessLots of small filesMultiple writers, arbitrary file modifications
  • 5. HDFS BlocksNormal Filesystem blocks are few kilobytesHDFS has Large block size  Default 64MB  Typical 128MBUnlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
  • 6. HDFS BlocksA file is stored in blocks on various nodes in hadoop cluster.HDFS creates several replication of the data blocksEach and every data block is replicated to multiple nodes across the cluster.
  • 7. HDFS BlocksDhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
  • 8. Why blocks in HDFS so large?Minimize the cost of seeks=> Make transfer time = disk transfer rate
  • 9. Benefit of Block abstractionA file can be larger than any single disk in the networkSimplify the storage subsystemProviding fault tolerance and availability
  • 10. Namenode & Datanodes
  • 11. Namenode & Datanodes Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree. Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks Clients communicate with both namenode and datanodes.
  • 12. Anatomy of a File Read
  • 13. Anatomy of a File ReadBenefits:- Avoid “bottle neck”- Multi-Clients
  • 14. Writing in HDFSNamenodeDatanodeBlock
  • 15. Writing in HDFSExeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
  • 16. Coherency ModelNot visible when copyinguse sync()Apply in applications
  • 17. Parallel copying in HDFSTransfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
  • 18. SetupCluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDDUsing vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04  Windows: Not tested
  • 19. Setup Guide - Single Nodejava runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html/etc/hadoop/core-site.xml/etc/hadoop/hdfs-site.xml
  • 20. Cluster/etc/hadoop/masters/etc/hadoop/slaveshttp://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
  • 21. Command LineSimilar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/Namedone-specific:  hadoop namenode -format  start-all.sh
  • 22. Command LineSorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort resultCommand Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
  • 23. Benchmark Result2 Nodes, 1GB data: 0:03:383 Nodes, 1GB data: 0:03:132 Nodes, 10GB data: 0:38:073 Nodes, 10GB data: 0:31:28Virtual Machines harddisks are the bottle-neck
  • 24. Whowins…?
  • 25. ReferencesHadoop The Definitive Guide

×