• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop at a glance

Hadoop at a glance



"An Elephan can't jump. But can carry heavy load". ...

"An Elephan can't jump. But can carry heavy load".
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...



Total Views
Views on SlideShare
Embed Views



1 Embed 293

http://minhtan.wordpress.com 293



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hadoop at a glance Hadoop at a glance Presentation Transcript

    • Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
    • Agenda1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model4. Parallel copying5. Demo - Command line
    • The Design of HDFSVery large distributed file system Up to 10K nodes, 1 billion files, 100PBStreaming data access Write once, read many timesCommodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
    • Worst fit withLow-latency data accessLots of small filesMultiple writers, arbitrary file modifications
    • HDFS BlocksNormal Filesystem blocks are few kilobytesHDFS has Large block size  Default 64MB  Typical 128MBUnlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
    • HDFS BlocksA file is stored in blocks on various nodes in hadoop cluster.HDFS creates several replication of the data blocksEach and every data block is replicated to multiple nodes across the cluster.
    • HDFS BlocksDhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
    • Why blocks in HDFS so large?Minimize the cost of seeks=> Make transfer time = disk transfer rate
    • Benefit of Block abstractionA file can be larger than any single disk in the networkSimplify the storage subsystemProviding fault tolerance and availability
    • Namenode & Datanodes
    • Namenode & Datanodes Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree. Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks Clients communicate with both namenode and datanodes.
    • Anatomy of a File Read
    • Anatomy of a File ReadBenefits:- Avoid “bottle neck”- Multi-Clients
    • Writing in HDFSNamenodeDatanodeBlock
    • Writing in HDFSExeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
    • Coherency ModelNot visible when copyinguse sync()Apply in applications
    • Parallel copying in HDFSTransfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
    • SetupCluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDDUsing vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04  Windows: Not tested
    • Setup Guide - Single Nodejava runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html/etc/hadoop/core-site.xml/etc/hadoop/hdfs-site.xml
    • Cluster/etc/hadoop/masters/etc/hadoop/slaveshttp://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
    • Command LineSimilar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/Namedone-specific:  hadoop namenode -format  start-all.sh
    • Command LineSorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort resultCommand Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
    • Benchmark Result2 Nodes, 1GB data: 0:03:383 Nodes, 1GB data: 0:03:132 Nodes, 10GB data: 0:38:073 Nodes, 10GB data: 0:31:28Virtual Machines harddisks are the bottle-neck
    • Whowins…?
    • ReferencesHadoop The Definitive Guide