Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HDFS: Hadoop Distributed Filesystem


Published on

Presentation on 2013-06-27, Workshop on the future of Big Data management, discussing hadoop for a science audience that are either HPC/grid users or people suddenly discovering that their data is accruing towards PB.

The other talks were on GPFS, LustreFS and Ceph, so rather than just do beauty-contest slides, I decided to raise the question of "what is a filesystem?", whether the constraints imposed by the Unix metaphor and API are becoming limits on scale and parallelism (both technically and, for GPFS and Lustre Enterprise in cost).

Then: HDFS as the foundation for the Hadoop stack.

All the other FS talks did emphasise their Hadoop integration, with the Intel talk doing the most to assert performance improvements of LustreFS over HDFSv1 in dfsIO and Terasort (no gridmix?), which showed something important: Hadoop is the application that add DFS developers have to have a story for

Published in: Technology
  • Be the first to comment

HDFS: Hadoop Distributed Filesystem

  1. 1. © Hortonworks Inc. 2013 HDFS: Hadoop Distributed FS Steve Loughran, Hortonworks @steveloughran Big Data workshop, June 2013
  2. 2. © Hortonworks Inc. What is a Filesystem? • Durable store of data: write, read, probe, delete • Metadata for organisation: locate, change • A conceptual model for humans • API for programmatic access to data & metadata Page 2
  3. 3. © Hortonworks Inc. Unix is the model & POSIX its API • directories and files: directories have children, files have data • API: open, read, seek, write, stat, rename, unlink, flock • Consistency: all sync()'d changes are globally visible • Atomic metadata operations: mv, rm, mkdir Page 3 Features are also constraints
  4. 4. © Hortonworks Inc Relax constraints  scale and availability Page 4 Scaleandavailability Distance from Unix Filesystem model & API ext4 NFS +cross host locks, sync HDFS +data locality (seek+write) locks S3 +cross-site append metadata ops consistency
  5. 5. © Hortonworks Inc. HDFS: goals • Store Petabytes of web data: logs, web snapshots • Keep per-node costs down to afford more nodes • Commodity x86 servers, storage (SAS), GbE LAN • Open source software: O(1) costs • O(1) operations • Accept failure as a background noise • Support computation in each server Written for location aware applications -MapReduce, Pregel/Giraph & others that can tolerate partial failures Page 5
  6. 6. © Hortonworks Inc. HDFS: what • Open Source: • Java code on Linux, Unix, Windows • Replication rather than RAID –break file into blocks –store across servers and racks –delivers bandwidth and more locations for work • Background work handles failures –replication of under-replicated blocks –rebalancing of unbalanced servers –checksum verification of stored files Location data for work schedulers Page 6
  7. 7. © Hortonworks Inc. Page 7 DataNode DataNode DataNode DataNode ToR Switch DataNode DataNode DataNode DataNode ToR Switch Switch (Job Tracker) ToR Switch 2ary Name Node Name Node file block1 block2 block3 … Hadoop HDFS: replication is the key
  8. 8. Some of largest filesystems ever e.g. Facebook Prineville 45PB in 1 cluster, PUE 1.05
  9. 9. © Hortonworks Inc. And an emergent stack 9 Kafka
  10. 10. © Hortonworks Inc. HDFS: Enterprise Checlist •Auth: Kerberos •Snapshots (in HDFSv2) •NFS (in HDFSv2) •HA metadata server, uses "Zookeeper" Page 10
  11. 11. © Hortonworks Inc. HDFS: what next? •Exabytes in a single cluster. •Cross cluster, cross-site what constraints can be relaxed here? •More efficient cold-data storage •Evolving application needs. •Networking: 2x1GbE, 4x1GbE , 10GbE •Power budgets Page 11
  12. 12. © Hortonworks Inc. HDD  HDD+ SSD  SSD •New solid state storage technologies emerging •When will HDDs go away? •How to take advantage of mixed storage •SSD retains the HDD metaphor, hides the details (access bus, wear levelling) Page 12 We need to give the OS and DFS control of the storage, work with the application
  13. 13. © Hortonworks Inc Download and Play! Page 13
  14. 14. © Hortonworks Inc Page 14 P.S: we are hiring
  15. 15. © Hortonworks Inc. Replication handles data integrity • CRC32 checksum per 512 bytes • Verified across datanodes on write • Verified on all reads • Background verification of all blocks (~weekly) • Corrupt blocks re-replicated • All replicas corrupt  operations team intervention 2009: Yahoo! lost 19 out of 329M blocks on 20K servers –bugs now fixed Page 15
  16. 16. © Hortonworks Inc. Page 16 DataNode DataNode DataNode DataNode ToR Switch DataNode DataNode DataNode DataNode ToR Switch Switch (Job Tracker) ToR Switch 2ary Name Node Name Node file block1 block2 block3 … Rack/Switch failure