© Hortonworks Inc. 2011 - 2015
Ozone: An Object Store in HDFS
Jitendra Nath Pandey
jitendra@hortonworks.com
jitendra@apache.org
@jnathp
Page 1
© Hortonworks Inc. 2011 - 2015
About me
• Engineering Manager @Hortonworks
– Manager / Architect for HDFS at Hortonworks
• ASF Member
– PMC Member at Apache Hadoop
– PMC Member at Apache Ambari
– Committer in Apache Hive
Page 2Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Outline
• Introduction
• How ozone fits in HDFS
• Ozone architecture
• Notes on implementation
• Q & A
Page 3Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Introduction
Architecting the Future of Big Data Page 4
© Hortonworks Inc. 2011 - 2015
Storage in Hadoop Ecosystem
• File system
– The HDFS
• SQL Database
– Hive on HDFS
• NoSQL
– Hbase on HDFS
• Object Store
– We need Ozone!
Page 5Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Object Store vs File System
• Object stores offer lot more scale
– Trillions of objects is common
– Simpler semantics make it possible
• Wide range of object sizes
– A few KB to several GB
Page 6Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Ozone: Introduction
• Ozone : An object store in hadoop
– Durable
– Reliable
– Highly Scalable
– Trillions of objects
– Wide range of object sizes
– Secure
– Highly Available
– REST API as the primary access interface
© Hortonworks Inc. 2011 - 2015
Ozone Introduction
• An Ozone URL
– http://hostname/myvolume/mybucket/mykey
• An S3 URL
– http://hostname/mybucket/mykey
• A Windows Azure URL
– http://hostname/myaccount/mybucket/mykey
© Hortonworks Inc. 2011 - 2015
Definitions
• Storage Volume
– A notion similar to an account
– Allows admin controls on usage of the object store e.g. storage quota
– Different from account because no user management in HDFS
– In private clouds often a ‘user’ is managed outside the cluster
– Created and managed by admins only
• Bucket
– Consists of keys and objects
– Similar to a bucket in S3 or a container in Azure
– ACLs
© Hortonworks Inc. 2011 - 2015
Definitions
• Key
– Unique in a bucket.
• Object
– Values in a bucket
– Each corresponds to a unique key within a bucket
© Hortonworks Inc. 2011 - 2015
REST API
• POST – Creates Volumes and Buckets
– Only Admin creates volumes
– Bucket can be created by owner of the volume
• PUT – Updates Volumes and Buckets
– Only admin can change some volume settings
– Buckets have ACLs
• GET
– Lists Volumes
– List Buckets
© Hortonworks Inc. 2011 - 2015
REST API
• DELETE
– Delete Volumes
– Delete Buckets
• Keys
– PUT : Creates Keys
– GET : Gets the data back
– Streaming read and write
– DELETE : Removes the Key
© Hortonworks Inc. 2011 - 2015
Storing Buckets
• Buckets grow up to millions of objects and several terabytes
– Don’t fit in a single node
– Split into partitions or shards
• Bucket partitions and metadata are distributed and replicated
• Storage Container
– Store multiple objects
– The unit of replication
– Consistent Replicas
© Hortonworks Inc. 2011 - 2015
Ozone in HDFS
Where does it fit?
Architecting the Future of Big Data Page 14
© Hortonworks Inc. 2011 - 2015
Hdfs Federation Extended
Page 15Architecting the Future of Big Data
... ...
DN 1 DN 2 DN m
.. .. ..
Block Pools
Pool nPool kPool 1
Common Storage
BlockStorage
HDFS Namespaces &
Block Pool management
Ozone Block Pool management
© Hortonworks Inc. 2011 - 2015
Impact on HDFS
• Ozone will reuse the DN storage
– Use their own block pools so that both HDFS and Ozone can share DNs
• Ozone will reuse Block Pool Management part of the namenode
– Includes heartbeats, block reports
• Storage Container abstraction is added to DNs
– Co-exists with HDFS blocks on the DNs
– New data pipeline
© Hortonworks Inc. 2011 - 2015
HDFS Scalability
• Scalability of the File System
– Support a billion files
– Namespace scalability
– Block-space scalability
• Namespace scalability is independent of Ozone
– Partial namespace on disk
– Parallel Effort (HDFS-8286)
• Block-space scalability
– Block space constitutes a big part of namenode metadata
– Block map on disk doesn’t work
– We hope to reuse some of the lessons of Ozone’s “many small objects in a storage
container” to allow multiple blocks in “storage container”
© Hortonworks Inc. 2011 - 2015
Architecture
Architecting the Future of Big Data Page 18
© Hortonworks Inc. 2011 - 2015
How it works
• URL
– http://hostname/myvolume/mybucket/mykey
• Simple Steps
– Full bucket name : ‘myvolume/mybucket’
– Find where bucket metadata is stored
– Fetch bucket metadata
– Check ACLs
– Find where the key is stored
– Read the data
© Hortonworks Inc. 2011 - 2015
How it works
• All the data or metadata is stored in Storage Containers
– Each storage container is identified by a unique id (Think of a block id in HDFS)
– A bucket name is mapped to a container id
– A key is mapped to a container id
• Container Id is mapped to Datanodes
© Hortonworks Inc. 2011 - 2015
Components
DN
Storage
Container
Manager
Ozone
Handler
DN
Ozone
Handler
DN
Ozone
Handler
© Hortonworks Inc. 2011 - 2015
New Components
• Storage Container Manager
– Maintains locations of each container (Container Map)
– Collects heartbeats and container reports from data-nodes
– Serves the location of container upon request
– Stores key partitioning metadata
• Ozone Handler
– A module hosted by Datanodes
– Implements Ozone REST API
– Connects to Storage Container Manager for key partitioning and container lookup
– Connects to local or remote Datanodes to read/write from/to containers
– Enforces authorization checks and administrative limits
© Hortonworks Inc. 2011 - 2015
Call Flow
DN
Storage
Container
Manager
DN DN
Client
REST
Call
Ozone
Handler
Ozone
Handler
Ozone
Handler
Read Metadata Container
© Hortonworks Inc. 2011 - 2015
Call Flow..
DN
Storage
Container
Manager
DN DN
Client
Ozone
Handler
Ozone
Handler
Ozone
Handler
Redirect Read Data
© Hortonworks Inc. 2011 - 2015
Implementation
Architecting the Future of Big Data Page 25
© Hortonworks Inc. 2011 - 2015
Mapping a Key to a Container
• Keys need to be mapped to Container IDs
– Horizontal partitioning of the key space
• Partition function
– Hash Partitioning
– Minimal state to be stored
– Better distribution, no hotspots
– Range Partitioning
– Sorted keys
– Provides ordered listing
© Hortonworks Inc. 2011 - 2015
Hash Partitioning
• Key is hashed
– the hash value is mapped to the container Id
• Prefix matching
– The container id is the longest matching prefix of the key
– Storage Container Manager implements a prefix tree
• Need extendible hashing
– Minimize the number of keys to be re-hashed when a new container added
– New containers are added by splitting an existing container
© Hortonworks Inc. 2011 - 2015
Prefix Matching for Hashes
Bucket-Id:
0xab
Bitwise-Trie
Root
Trie Node Trie Node
0 1
Trie Node
0 1
Container
0xab003
Container
0xab005
Container
0xab001
10
Container
0xab002
Container
0xab000
10• Storage Container stores
one tree for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key
0xab125
Trie Node
Container
0xab000
0
Container
0xab004
1
© Hortonworks Inc. 2011 - 2015
Range Partitioning
• Range Partitioning
– The container map maintains a range index tree for each bucket.
– Each node of the tree corresponds to a key range
– Children nodes split the range of their parent nodes
– The lookup is performed by traversing down the tree to more granular ranges for a
key until we reach a leaf
© Hortonworks Inc. 2011 - 2015
Range Index Tree
Bucket-Id:
0xab
K1 – K20
K1 – K10 K11 – K20
K11-15
K16 – K20
Container
0xab003
Container
0xab005
Container
0xab001
K14 – K15K11 – K13
Container
0xab002
Container
0xab000
K6 – K10K1 – K5• Storage Container map
consists of arrays of such
trees one for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key =
K15
© Hortonworks Inc. 2011 - 2015
Storage Container
• A storage unit in the datanode
– Generalization of the HDFS Block
– Id, Generation Stamp, Size
– Unit of replication
– Consistent replicas
• Container size
– 1G - 10G
– Container size affects the scale of Storage Container Manager
– Large containers take longer to replicate an individual block
– A system property and not a data property
© Hortonworks Inc. 2011 - 2015
Storage Container Requirements
• Stores variety of data, results in different requirements
• Metadata
– Individual units of data are very small - kilobytes.
– An atomic update is important.
– get/put API is sufficient.
• Object Data
– The storage container needs to store object data with wide range of sizes
– Must support streaming APIs to read/write individual objects
© Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Storage container prototype using RocksDB
– An embeddable key-value store
• Replication
– Need ability to replicate while data is being written
– RocksDB supports snapshots and incremental backups for replication
• A hybrid use of RocksDB
– Small objects : Keys and Objects stored in RocksDB
– Large objects : Object stored in an individual file, RocksDB contains keys and file
path
© Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Transactions for consistency and reliability
– The storage containers implement a few atomic and persistent operations i.e.
transactions. The container provides reliability guarantees for these
operations.
– Commit : This operation promotes an object being written to a finalized object.
Once this operation succeeds, the container guarantees that the object is
available for reading.
– Put : This operation is useful for small writes such as metadata writes.
– Delete : deletes the object
• A new data pipeline for storage containers
© Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• HDFS Consistency Mechanism uses two pieces of block state
– Generation Stamp
– Block length
• Storage containers use following two
– Generation stamp
– Transaction id
• Storage Container must persist last executed transaction.
• Transaction id is allocated by leader of the pipeline.
© Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• Upon a restart, datanode discards all uncommitted data for a storage
container
– State synchronized to last committed transaction
• When comparing two replicas
– Replica with latest generation stamp is honored
– If same generation stamp, the replica with latest transaction id is honored
– Correctness argument: Replicas with same generation stamp and same
transaction id must be together in the same pipeline
© Hortonworks Inc. 2011 - 2015
Phased Development
• Phase 1
– Basic API
– Storage container machinery, reliability, replication.
• Phase 2
– High availability
– Security
– Multipart upload
• Phase 3
– Caching to improve latency.
– Object versioning
– Cross geo replication.
© Hortonworks Inc. 2011 - 2015
Team
• Anu Engineer
– aengineer@hortonworks.com
• Arpit Agarwal
– aagarwal@hortonworks.com
• Chris Nauroth
– cnauroth@hortonworks.com
• Jitendra Pandey
– jitendra@hortonworks.com
© Hortonworks Inc. 2011 - 2015
Special Thanks
• Sanjay Radia
• Enis Soztutar
• Suresh Srinivas
© Hortonworks Inc. 2011 - 2015
Thanks!
Architecting the Future of Big Data Page 40

Ozone: An Object Store in HDFS

  • 1.
    © Hortonworks Inc.2011 - 2015 Ozone: An Object Store in HDFS Jitendra Nath Pandey jitendra@hortonworks.com jitendra@apache.org @jnathp Page 1
  • 2.
    © Hortonworks Inc.2011 - 2015 About me • Engineering Manager @Hortonworks – Manager / Architect for HDFS at Hortonworks • ASF Member – PMC Member at Apache Hadoop – PMC Member at Apache Ambari – Committer in Apache Hive Page 2Architecting the Future of Big Data
  • 3.
    © Hortonworks Inc.2011 - 2015 Outline • Introduction • How ozone fits in HDFS • Ozone architecture • Notes on implementation • Q & A Page 3Architecting the Future of Big Data
  • 4.
    © Hortonworks Inc.2011 - 2015 Introduction Architecting the Future of Big Data Page 4
  • 5.
    © Hortonworks Inc.2011 - 2015 Storage in Hadoop Ecosystem • File system – The HDFS • SQL Database – Hive on HDFS • NoSQL – Hbase on HDFS • Object Store – We need Ozone! Page 5Architecting the Future of Big Data
  • 6.
    © Hortonworks Inc.2011 - 2015 Object Store vs File System • Object stores offer lot more scale – Trillions of objects is common – Simpler semantics make it possible • Wide range of object sizes – A few KB to several GB Page 6Architecting the Future of Big Data
  • 7.
    © Hortonworks Inc.2011 - 2015 Ozone: Introduction • Ozone : An object store in hadoop – Durable – Reliable – Highly Scalable – Trillions of objects – Wide range of object sizes – Secure – Highly Available – REST API as the primary access interface
  • 8.
    © Hortonworks Inc.2011 - 2015 Ozone Introduction • An Ozone URL – http://hostname/myvolume/mybucket/mykey • An S3 URL – http://hostname/mybucket/mykey • A Windows Azure URL – http://hostname/myaccount/mybucket/mykey
  • 9.
    © Hortonworks Inc.2011 - 2015 Definitions • Storage Volume – A notion similar to an account – Allows admin controls on usage of the object store e.g. storage quota – Different from account because no user management in HDFS – In private clouds often a ‘user’ is managed outside the cluster – Created and managed by admins only • Bucket – Consists of keys and objects – Similar to a bucket in S3 or a container in Azure – ACLs
  • 10.
    © Hortonworks Inc.2011 - 2015 Definitions • Key – Unique in a bucket. • Object – Values in a bucket – Each corresponds to a unique key within a bucket
  • 11.
    © Hortonworks Inc.2011 - 2015 REST API • POST – Creates Volumes and Buckets – Only Admin creates volumes – Bucket can be created by owner of the volume • PUT – Updates Volumes and Buckets – Only admin can change some volume settings – Buckets have ACLs • GET – Lists Volumes – List Buckets
  • 12.
    © Hortonworks Inc.2011 - 2015 REST API • DELETE – Delete Volumes – Delete Buckets • Keys – PUT : Creates Keys – GET : Gets the data back – Streaming read and write – DELETE : Removes the Key
  • 13.
    © Hortonworks Inc.2011 - 2015 Storing Buckets • Buckets grow up to millions of objects and several terabytes – Don’t fit in a single node – Split into partitions or shards • Bucket partitions and metadata are distributed and replicated • Storage Container – Store multiple objects – The unit of replication – Consistent Replicas
  • 14.
    © Hortonworks Inc.2011 - 2015 Ozone in HDFS Where does it fit? Architecting the Future of Big Data Page 14
  • 15.
    © Hortonworks Inc.2011 - 2015 Hdfs Federation Extended Page 15Architecting the Future of Big Data ... ... DN 1 DN 2 DN m .. .. .. Block Pools Pool nPool kPool 1 Common Storage BlockStorage HDFS Namespaces & Block Pool management Ozone Block Pool management
  • 16.
    © Hortonworks Inc.2011 - 2015 Impact on HDFS • Ozone will reuse the DN storage – Use their own block pools so that both HDFS and Ozone can share DNs • Ozone will reuse Block Pool Management part of the namenode – Includes heartbeats, block reports • Storage Container abstraction is added to DNs – Co-exists with HDFS blocks on the DNs – New data pipeline
  • 17.
    © Hortonworks Inc.2011 - 2015 HDFS Scalability • Scalability of the File System – Support a billion files – Namespace scalability – Block-space scalability • Namespace scalability is independent of Ozone – Partial namespace on disk – Parallel Effort (HDFS-8286) • Block-space scalability – Block space constitutes a big part of namenode metadata – Block map on disk doesn’t work – We hope to reuse some of the lessons of Ozone’s “many small objects in a storage container” to allow multiple blocks in “storage container”
  • 18.
    © Hortonworks Inc.2011 - 2015 Architecture Architecting the Future of Big Data Page 18
  • 19.
    © Hortonworks Inc.2011 - 2015 How it works • URL – http://hostname/myvolume/mybucket/mykey • Simple Steps – Full bucket name : ‘myvolume/mybucket’ – Find where bucket metadata is stored – Fetch bucket metadata – Check ACLs – Find where the key is stored – Read the data
  • 20.
    © Hortonworks Inc.2011 - 2015 How it works • All the data or metadata is stored in Storage Containers – Each storage container is identified by a unique id (Think of a block id in HDFS) – A bucket name is mapped to a container id – A key is mapped to a container id • Container Id is mapped to Datanodes
  • 21.
    © Hortonworks Inc.2011 - 2015 Components DN Storage Container Manager Ozone Handler DN Ozone Handler DN Ozone Handler
  • 22.
    © Hortonworks Inc.2011 - 2015 New Components • Storage Container Manager – Maintains locations of each container (Container Map) – Collects heartbeats and container reports from data-nodes – Serves the location of container upon request – Stores key partitioning metadata • Ozone Handler – A module hosted by Datanodes – Implements Ozone REST API – Connects to Storage Container Manager for key partitioning and container lookup – Connects to local or remote Datanodes to read/write from/to containers – Enforces authorization checks and administrative limits
  • 23.
    © Hortonworks Inc.2011 - 2015 Call Flow DN Storage Container Manager DN DN Client REST Call Ozone Handler Ozone Handler Ozone Handler Read Metadata Container
  • 24.
    © Hortonworks Inc.2011 - 2015 Call Flow.. DN Storage Container Manager DN DN Client Ozone Handler Ozone Handler Ozone Handler Redirect Read Data
  • 25.
    © Hortonworks Inc.2011 - 2015 Implementation Architecting the Future of Big Data Page 25
  • 26.
    © Hortonworks Inc.2011 - 2015 Mapping a Key to a Container • Keys need to be mapped to Container IDs – Horizontal partitioning of the key space • Partition function – Hash Partitioning – Minimal state to be stored – Better distribution, no hotspots – Range Partitioning – Sorted keys – Provides ordered listing
  • 27.
    © Hortonworks Inc.2011 - 2015 Hash Partitioning • Key is hashed – the hash value is mapped to the container Id • Prefix matching – The container id is the longest matching prefix of the key – Storage Container Manager implements a prefix tree • Need extendible hashing – Minimize the number of keys to be re-hashed when a new container added – New containers are added by splitting an existing container
  • 28.
    © Hortonworks Inc.2011 - 2015 Prefix Matching for Hashes Bucket-Id: 0xab Bitwise-Trie Root Trie Node Trie Node 0 1 Trie Node 0 1 Container 0xab003 Container 0xab005 Container 0xab001 10 Container 0xab002 Container 0xab000 10• Storage Container stores one tree for each bucket. • The containers are at the leaves. • Size = Θ(#containers) Key 0xab125 Trie Node Container 0xab000 0 Container 0xab004 1
  • 29.
    © Hortonworks Inc.2011 - 2015 Range Partitioning • Range Partitioning – The container map maintains a range index tree for each bucket. – Each node of the tree corresponds to a key range – Children nodes split the range of their parent nodes – The lookup is performed by traversing down the tree to more granular ranges for a key until we reach a leaf
  • 30.
    © Hortonworks Inc.2011 - 2015 Range Index Tree Bucket-Id: 0xab K1 – K20 K1 – K10 K11 – K20 K11-15 K16 – K20 Container 0xab003 Container 0xab005 Container 0xab001 K14 – K15K11 – K13 Container 0xab002 Container 0xab000 K6 – K10K1 – K5• Storage Container map consists of arrays of such trees one for each bucket. • The containers are at the leaves. • Size = Θ(#containers) Key = K15
  • 31.
    © Hortonworks Inc.2011 - 2015 Storage Container • A storage unit in the datanode – Generalization of the HDFS Block – Id, Generation Stamp, Size – Unit of replication – Consistent replicas • Container size – 1G - 10G – Container size affects the scale of Storage Container Manager – Large containers take longer to replicate an individual block – A system property and not a data property
  • 32.
    © Hortonworks Inc.2011 - 2015 Storage Container Requirements • Stores variety of data, results in different requirements • Metadata – Individual units of data are very small - kilobytes. – An atomic update is important. – get/put API is sufficient. • Object Data – The storage container needs to store object data with wide range of sizes – Must support streaming APIs to read/write individual objects
  • 33.
    © Hortonworks Inc.2011 - 2015 Storage Container Implementation • Storage container prototype using RocksDB – An embeddable key-value store • Replication – Need ability to replicate while data is being written – RocksDB supports snapshots and incremental backups for replication • A hybrid use of RocksDB – Small objects : Keys and Objects stored in RocksDB – Large objects : Object stored in an individual file, RocksDB contains keys and file path
  • 34.
    © Hortonworks Inc.2011 - 2015 Storage Container Implementation • Transactions for consistency and reliability – The storage containers implement a few atomic and persistent operations i.e. transactions. The container provides reliability guarantees for these operations. – Commit : This operation promotes an object being written to a finalized object. Once this operation succeeds, the container guarantees that the object is available for reading. – Put : This operation is useful for small writes such as metadata writes. – Delete : deletes the object • A new data pipeline for storage containers
  • 35.
    © Hortonworks Inc.2011 - 2015 Data Pipeline Consistency • HDFS Consistency Mechanism uses two pieces of block state – Generation Stamp – Block length • Storage containers use following two – Generation stamp – Transaction id • Storage Container must persist last executed transaction. • Transaction id is allocated by leader of the pipeline.
  • 36.
    © Hortonworks Inc.2011 - 2015 Data Pipeline Consistency • Upon a restart, datanode discards all uncommitted data for a storage container – State synchronized to last committed transaction • When comparing two replicas – Replica with latest generation stamp is honored – If same generation stamp, the replica with latest transaction id is honored – Correctness argument: Replicas with same generation stamp and same transaction id must be together in the same pipeline
  • 37.
    © Hortonworks Inc.2011 - 2015 Phased Development • Phase 1 – Basic API – Storage container machinery, reliability, replication. • Phase 2 – High availability – Security – Multipart upload • Phase 3 – Caching to improve latency. – Object versioning – Cross geo replication.
  • 38.
    © Hortonworks Inc.2011 - 2015 Team • Anu Engineer – aengineer@hortonworks.com • Arpit Agarwal – aagarwal@hortonworks.com • Chris Nauroth – cnauroth@hortonworks.com • Jitendra Pandey – jitendra@hortonworks.com
  • 39.
    © Hortonworks Inc.2011 - 2015 Special Thanks • Sanjay Radia • Enis Soztutar • Suresh Srinivas
  • 40.
    © Hortonworks Inc.2011 - 2015 Thanks! Architecting the Future of Big Data Page 40

Editor's Notes

  • #6 Hdfs as a storage system Great file system Works fantastic for map-reduce Great adoption in enterprises
  • #7 Example : Need to store all my customer documents A few million customers each with a few thousand documents Don’t need a directory structure Need REST API as the primary access mechanism Simple access semantics Very large scale (billions of documents) Wide range of object sizes File System forces to think in terms of files and directories.
  • #21 Two important questions What is the partitioning scheme? How does a storage container look like?