Your SlideShare is downloading. ×
Dynamic Namespace Partitioning with Giraffa File System
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dynamic Namespace Partitioning with Giraffa File System

3,281
views

Published on

Published in: Technology

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,281
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Dynamic Namespace Partitioning withThe Giraffa File SystemKonstantin V. Shvachko Plamen JeliazkovFounder, Altoscale UC San Diego June 14, 2012 Hadoop Summit 2012 AltoScale
  • 2. AltoScale IntroductionPlamen Fresh grad from UCSD Internship with Hadoop Platform Team at eBay Wrote Giraffa prototypeKonstantin Founder of Altoscale. Primary focus 1. Altoscale Workbench Hadoop & HBase cluster on a public or a private cloud 2. Giraffa Apache Hadoop PMC HDFS scalabilty2
  • 3. AltoScale ContentsBackgroundMotivationArchitectureMain problems and solutions Bootstrapping Namespace Partitioning Rename3
  • 4. AltoScale GiraffaGiraffa is a distributed, highly available file systemUtilizes features of HDFS and HBaseNew open source project in experimental stage4
  • 5. AltoScale Origin:Giraffe. Latin: Giraffa camelopardalis Family Giraffidae Genus Giraffa Species Giraffa camelopardalisOther languages Arabic Zarafa Spanish Jirafa Bulgarian жирафа Italian GiraffaFavorites of my daughter o As the Hadoop traditions require5
  • 6. AltoScale Apache HadoopA reliable, scalable, high performance distributed computing systemThe Hadoop Distributed File System (HDFS) Reliable storage layerMapReduce – distributed computation framework Simple computational modelHadoop scales computation capacity, storage capacity, and I/O bandwidth by adding commodity servers.6
  • 7. AltoScale The Design PrinciplesLinear scalability More nodes can do more work within the same time On Data size and Compute resourcesReliability and Availability 1 drive fails in 3 years. Probability of failing today 1/1000. Several drives fail on a cluster with thousands of drivesMove computation to data Minimize expensive data transfersSequential data processing Avoid random reads7
  • 8. AltoScale Collocated Hadoop ClustersHDFS – a distributed file system NameNode – namespace and block management DataNodes – block replica containerMapReduce – a framework for distributed computations JobTracker – job scheduling, resource management, lifecycle coordination TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode8
  • 9. AltoScale Hadoop Distributed File SystemThe namespace is a hierarchy of files and directories Files are divided into large blocks 128 MBNamespace (metadata) is decoupled from data Fast namespace operations, not slowed down by Direct data streaming from the source storageSingle NameNode keeps the entire name space in RAMDataNodes store block replicas as files on local drives Blocks replicated on 3 DataNodes for redundancy & availabilityHDFS client – point of entry to HDFS Contacts NameNode for metadata Serves data to applications directly from DataNodes9
  • 10. AltoScale Scalability LimitsSingle-master architecture: a constraining resourceNameNode space limit 100 million files and 200 million blocks with 64GB RAM Restricts storage capacity to 20 PB Small file problem: block-to-file ratio is shrinkingSingle NameNode limits linear performance growth A handful of clients can saturate NameNodeMapReduce framework scalability limit: 40,000 clients Corresponds to a 4,000-node cluster with 10 MapReduce slots“HDFS Scalability: The limits to growth” USENIX ;login: 201010
  • 11. AltoScale Horizontal to Vertical ScalingHorizontal scaling is limited by single-master architectureNatural growth of compute power and storage density Clusters composed of more powerful serversVertical scaling leads to cluster size shrinkingStorage capacity, Compute power, and Cost remain constant11
  • 12. AltoScale Shrinking Clusters 2008 Yahoo!Resources per node: Cores, Disks, RAM 4000-node cluster 2010 Facebook 2000 nodes 2011 eBay 1000 nodes 2013 Cluster of 500 nodes Cluster Size: Number of Nodes12
  • 13. AltoScale Scalability for Hadoop 2.0HDFS Federation Independent NameNodes sharing a common pool of DataNodes Cluster is a family of volumes with shared block storage layer User sees volumes as isolated file systems ViewFS: the client-side mount tableYarn: New MapReduce framework Dynamic partitioning of cluster resources: no fixed slots Separation of JobTracker functions 1. Job scheduling and resource allocation: centralized 2. Job monitoring and job life-cycle coordination: decentralized o Delegate coordination of different jobs to other nodes13
  • 14. AltoScale Namespace PartitioningStatic: Federation Directory sub-trees are statically assigned to disjoint volumes Relocating sub-trees without copying is challenging Scale x10: billions of filesDynamic: Files, directory sub-trees can move automatically between nodes based on their utilization or load balancing requirements Files can be relocated without copying data blocks Scale x100: 100s of billion of filesOrthogonal independent approaches. Federation of distributed namespaces is possible14
  • 15. AltoScale Distributed Namespaces TodayCeph Metadata stored on OSD MDS cache metadata: Dynamic PartitioningLustre Plans to release (2.4) distributed namespace Code readyColossus: from Google S.Quinlan and J.Dean 100 million files per metadata server Hundreds of serversVoldFS, CassandraFS, KTHFS (MySQL) Prototypes15
  • 16. AltoScale HBase OverviewTable: big, sparse, loosely structured Collection of rows, sorted by row keys Rows can have arbitrary number of columnsTable is split Horizontally into Regions Dynamic Table partitioning! Region Servers serve regions to applicationsColumns grouped into Column families Vertical partition of tablesDistributed Cache: Regions are loaded in nodes’ RAM Real-time access to data16
  • 17. AltoScale HBase APIHBaseAdmin: administrative functions Create, delete, list tables Create, update, delete columns, column families Split, compact, flushHTable: access table data Result HTable.get(Get g) // get cells of a row void HTable.put(Put p) // update a row void HTable.put(Put[] p) // batch update of rows void HTable.delete(Delete d) // delete cells/row ResultScanner getScanner(family) // scan col familyCoprocessors: Custom actions triggered by update events Like database triggers and stored procedures17
  • 18. AltoScale HBase Architecture18
  • 19. AltoScale Giraffa File SystemHDFS + HBase = Giraffa Goal: build from existing building blocks Minimize changes to existing components1. Store file & directory metadata in HBase table Dynamic table partitioning into regions Cashed in RegionServer RAM for fast access2. Store file data in HDFS DataNodes: data streaming3. Block management Handle communication with DataNodes Perform block replication19
  • 20. AltoScale Giraffa RequirementsMore files & more dataAvailability Load balancing of metadata traffic Same data streaming speed to / from DataNodes No SPOFCluster operability, management Cost of running larger clusters same as smaller ones HDFS Federated HDFS GiraffaSpace 25 PB 120 PB 1 EB = 1000 PBFiles + blocks 200 million 1 billion 100 billionConcurrent Clients 40,000 100,000 1 million20
  • 21. AltoScale FAQ: Why HDFS and HBase?Building new FS from scratch – Really hard, Takes yearsHDFS a reliable, scalable block storage Efficient Data Streaming Automatic Data RecoveryHBase a natural metadata service Distributed Cache … Dynamic Partitioning Automatic Metadata RecoverySame breed, should be “compatible” HBase stores data in HDFS: same storage for data and metadata21
  • 22. AltoScale FAQ: Why not store whole files in HBase tables?Defeats the main concept of Distributed File Systems: Decoupling of data and metadataSmall files can be stored as rows Row size is limited by Region size Large files must be splitTechnically possible to split any information into rows o Log files: into events o Video files: into frames o Random bits: into 1K blobs with an offset as a row key Different level of abstraction Requires data conversion22
  • 23. AltoScale FAQ: My Dataset is Only 1 PB Do I Still Need Giraffa?Availability Distributed access to namespace for many concurrent clients Not bottlenecked by single NameNode performance“Small files” Block-to-file ration is decreasing: 2 –> 1.5 -> 1.2 No need to aggregate small files into large archives23
  • 24. AltoScale Building BlocksSingle Table called “Namespace” stores File ID (row key) and file attributes: o name, replication, block-size, permissions, times List of blocks Block locationsGiraffa client: FileSystem implementation Obtains metadata from HBase Data exchange with DataNodesBlock manager: maintain flat namespace of blocks Block allocation, replication, removal DataNode management Storage for the HBase table24
  • 25. AltoScale Giraffa Architecture HBase Namespace 1. Giraffa client path, attrs, block[], DN[][], BM-node gets files and blocks 1 Block Management Agent from HBase 2. May directly query Block NamespaceAgent App 2 Block Management Layer Manager 3. Stream data BM BM BM 3 to or from DataNodes DN DN DN DN DN DN DN DN DN25
  • 26. AltoScale Namespace TableRow keys Identify files and directories as rows in the table Different key definitions based on locality requirement Key definition is chosen during formatting of the file system Full-path-key is the defaultColumns File attributes: o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length List of blocks of a file o Persisted in the table List of block locations for each block o Not persisted, but discovered from block reports Directory table maps dir-entry name to corresponding row key26
  • 27. AltoScale Giraffa ClientGiraffaFileSystem implements FileSystem fs.defaultFS = grfa:/// fs.grfa.impl = o.a.giraffa.GiraffaFileSystemGiraffaClient extends DFSClient NamespaceAgent replaces NameNode RPC Namespace GiraffaFileSystem Agent GiraffaClient DFSClient to NameNode to DataNodes27
  • 28. AltoScale Block ManagementBlock Manager Block allocation, deletion, replicationDataNode Manager Process DataNode block reports, heartbeats. Identify lost nodesProvide storage for HBase table Small file system to store HFilesBMServer paired on the same node with RegionServer Distributed cluster of BMServes Mostly local communication between Region and BM serversNameNode is an initial implementation of BMServer Giraffa block is a single block file with the same name as block id28
  • 29. AltoScale Three ProblemsBootstrapping HBase stores tables as files in HDFSNamespace Partitioning Retain localityAtomic Renames29
  • 30. AltoScale Bootstrapping Block Manager Server .log HBase Volume hbase/ region1 Table layout / Rare updates giraffa/ region2 blk_123_001 dn-1 dn-2 dn-3 Block Volume blk_234_002 dn-11 dn-12 dn-13 Flat namespace of blocks blk_345_003 dn-101 dn-102 dn-10330
  • 31. AltoScale Locality of ReferenceRow keys Define sorting of files and directories in the table Tree structured namespace is flattened into linear arrayOrdered list of files is self-partitioned into regionsRetain locality in linearized structureFiles in the same directory - adjacent in the table Belong to the same region with some exclusionsFiles of the same directory should be on the same node Avoid jumping cross regions for simple “ls”31
  • 32. AltoScale Partitioning Example 1Straightforward partitioning based on random hashing 1 2 3 4 1 1 5 6 T1 T2 T3 T4 id1 id2 id332
  • 33. AltoScale Partitioning Example 2Partitioning based on lexicographic full-path ordering The default 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 1 2 2 T1 T2 3 T3 4 T4 1533
  • 34. AltoScale Partitioning Example 3Partitioning based on fixed depth neighborhoods 1 2 3 4 1 1 5 6 T1 T2 T3 T4 1 1 1 1 2 2 2 3 4 15 1634
  • 35. AltoScale Atomic RenameGiraffa will implement atomic in-place rename No support for atomic file move from one directory to anotherA move can then be implemented on application level Non-atomically move the target file from the source directory to a temporary file in the target directory Atomically rename the temporary file to its original name On failure use simple 3-step recovery procedureEventually implement atomic moves PAXOS Simplified synchronization algorithms35
  • 36. AltoScale History(2008) Idea. Study of distributed systems AFS, Lustre, Ceph, PVFS, GPFS, Farsite, … Partitioning of the namespace: 4 types of partitioning(2009) Study on scalability limits NameNode optimization(2010) Design with Michael Stack Presentation at HDFS contributors meeting(2011) Plamen implements POC(2012) Rewrite open sourced as Apache Extras project http://code.google.com/a/apache-extras.org/p/giraffa/36
  • 37. AltoScale Status Design stage One node cluster running Live demo with Plamen37
  • 38. AltoScale Thank You!38