HDFS: Hadoop Distributed Filesystem


Published on

Presentation on 2013-06-27, Workshop on the future of Big Data management, discussing hadoop for a science audience that are either HPC/grid users or people suddenly discovering that their data is accruing towards PB.

The other talks were on GPFS, LustreFS and Ceph, so rather than just do beauty-contest slides, I decided to raise the question of "what is a filesystem?", whether the constraints imposed by the Unix metaphor and API are becoming limits on scale and parallelism (both technically and, for GPFS and Lustre Enterprise in cost).

Then: HDFS as the foundation for the Hadoop stack.

All the other FS talks did emphasise their Hadoop integration, with the Intel talk doing the most to assert performance improvements of LustreFS over HDFSv1 in dfsIO and Terasort (no gridmix?), which showed something important: Hadoop is the application that add DFS developers have to have a story for

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is weak chart as it doesn't separate storage scale from workload scale or split availability into it's own dimension. NFS has voluntary locks and can relax both write flushing and read consistency.Andrew FS (not shown: even more relaxed consistency)
  • HDFS is built on the concept that in a large cluster, disk failure is inevitable. The system is designed to change the impact of this from the beeping of pagers to a background hum.Akey part of the HDFS design: copying the blocks across machines means that the loss of a disk, server or even entire rack keeps the data available.
  • There's lots of checksumming going on of the data to pick up corruption -CRCs created at write time (and even verified end-to-end in a cross-machine write), scanned on read time.
  • Rack failures can generate a lot of replication traffic, as every block that was stored in the rack needs to be replicated at least once. The replication still has to follow the constraints of no more than one block copy per server. Much of this traffic is intra-rack, but every block which already has 2x replicas on a single rack will be replicated to another rack if possible.This is what scares ops team. Important: there is no specific notion of "mass failure" or "network partition". Here HDFS only sees that four machines have gone down.
  • HDFS: Hadoop Distributed Filesystem

    1. 1. © Hortonworks Inc. 2013 HDFS: Hadoop Distributed FS Steve Loughran, Hortonworks stevel@hortonworks.com @steveloughran Big Data workshop, June 2013
    2. 2. © Hortonworks Inc. What is a Filesystem? • Durable store of data: write, read, probe, delete • Metadata for organisation: locate, change • A conceptual model for humans • API for programmatic access to data & metadata Page 2
    3. 3. © Hortonworks Inc. Unix is the model & POSIX its API • directories and files: directories have children, files have data • API: open, read, seek, write, stat, rename, unlink, flock • Consistency: all sync()'d changes are globally visible • Atomic metadata operations: mv, rm, mkdir Page 3 Features are also constraints
    4. 4. © Hortonworks Inc Relax constraints  scale and availability Page 4 Scaleandavailability Distance from Unix Filesystem model & API ext4 NFS +cross host locks, sync HDFS +data locality (seek+write) locks S3 +cross-site append metadata ops consistency
    5. 5. © Hortonworks Inc. HDFS: goals • Store Petabytes of web data: logs, web snapshots • Keep per-node costs down to afford more nodes • Commodity x86 servers, storage (SAS), GbE LAN • Open source software: O(1) costs • O(1) operations • Accept failure as a background noise • Support computation in each server Written for location aware applications -MapReduce, Pregel/Giraph & others that can tolerate partial failures Page 5
    6. 6. © Hortonworks Inc. HDFS: what • Open Source: hadoop.apache.org • Java code on Linux, Unix, Windows • Replication rather than RAID –break file into blocks –store across servers and racks –delivers bandwidth and more locations for work • Background work handles failures –replication of under-replicated blocks –rebalancing of unbalanced servers –checksum verification of stored files Location data for work schedulers Page 6
    7. 7. © Hortonworks Inc. Page 7 DataNode DataNode DataNode DataNode ToR Switch DataNode DataNode DataNode DataNode ToR Switch Switch (Job Tracker) ToR Switch 2ary Name Node Name Node file block1 block2 block3 … Hadoop HDFS: replication is the key
    8. 8. Some of largest filesystems ever e.g. Facebook Prineville 45PB in 1 cluster, PUE 1.05
    9. 9. © Hortonworks Inc. And an emergent stack 9 Kafka
    10. 10. © Hortonworks Inc. HDFS: Enterprise Checlist •Auth: Kerberos •Snapshots (in HDFSv2) •NFS (in HDFSv2) •HA metadata server, uses "Zookeeper" Page 10
    11. 11. © Hortonworks Inc. HDFS: what next? •Exabytes in a single cluster. •Cross cluster, cross-site what constraints can be relaxed here? •More efficient cold-data storage •Evolving application needs. •Networking: 2x1GbE, 4x1GbE , 10GbE •Power budgets Page 11
    12. 12. © Hortonworks Inc. HDD  HDD+ SSD  SSD •New solid state storage technologies emerging •When will HDDs go away? •How to take advantage of mixed storage •SSD retains the HDD metaphor, hides the details (access bus, wear levelling) Page 12 We need to give the OS and DFS control of the storage, work with the application
    13. 13. © Hortonworks Inc Download and Play! http://hadoop.apache.org http://hortonworks.com Page 13
    14. 14. © Hortonworks Inc http://hortonworks.com/careers/ Page 14 P.S: we are hiring
    15. 15. © Hortonworks Inc. Replication handles data integrity • CRC32 checksum per 512 bytes • Verified across datanodes on write • Verified on all reads • Background verification of all blocks (~weekly) • Corrupt blocks re-replicated • All replicas corrupt  operations team intervention 2009: Yahoo! lost 19 out of 329M blocks on 20K servers –bugs now fixed Page 15
    16. 16. © Hortonworks Inc. Page 16 DataNode DataNode DataNode DataNode ToR Switch DataNode DataNode DataNode DataNode ToR Switch Switch (Job Tracker) ToR Switch 2ary Name Node Name Node file block1 block2 block3 … Rack/Switch failure