Ozone: An Object Store in HDFS

© Hortonworks Inc. 2011 - 2015
Ozone: An Object Store in HDFS
Jitendra Nath Pandey
jitendra@hortonworks.com
jitendra@apache.org
@jnathp
Page 1

About me
• Engineering Manager @Hortonworks
– Manager / Architect for HDFS at Hortonworks
• ASF Member
– PMC Member at Apache Hadoop
– PMC Member at Apache Ambari
– Committer in Apache Hive
Page 2Architecting the Future of Big Data

Outline
• Introduction
• How ozone fits in HDFS
• Ozone architecture
• Notes on implementation
• Q & A

Introduction
Architecting the Future of Big Data Page 4

Storage in Hadoop Ecosystem
• File system
– The HDFS
• SQL Database
– Hive on HDFS
• NoSQL
– Hbase on HDFS
• Object Store
– We need Ozone!

Object Store vs File System
• Object stores offer lot more scale
– Trillions of objects is common
– Simpler semantics make it possible
• Wide range of object sizes
– A few KB to several GB

Ozone: Introduction
• Ozone : An object store in hadoop
– Durable
– Reliable
– Highly Scalable
– Trillions of objects
– Wide range of object sizes
– Secure
– Highly Available
– REST API as the primary access interface

Ozone Introduction
• An Ozone URL
– http://hostname/myvolume/mybucket/mykey
• An S3 URL
– http://hostname/mybucket/mykey
• A Windows Azure URL
– http://hostname/myaccount/mybucket/mykey

Definitions
• Storage Volume
– A notion similar to an account
– Allows admin controls on usage of the object store e.g. storage quota
– Different from account because no user management in HDFS
– In private clouds often a ‘user’ is managed outside the cluster
– Created and managed by admins only
• Bucket
– Consists of keys and objects
– Similar to a bucket in S3 or a container in Azure
– ACLs

Definitions
• Key
– Unique in a bucket.
• Object
– Values in a bucket
– Each corresponds to a unique key within a bucket

REST API
• POST – Creates Volumes and Buckets
– Only Admin creates volumes
– Bucket can be created by owner of the volume
• PUT – Updates Volumes and Buckets
– Only admin can change some volume settings
– Buckets have ACLs
• GET
– Lists Volumes
– List Buckets

REST API
• DELETE
– Delete Volumes
– Delete Buckets
• Keys
– PUT : Creates Keys
– GET : Gets the data back
– Streaming read and write
– DELETE : Removes the Key

Storing Buckets
• Buckets grow up to millions of objects and several terabytes
– Don’t fit in a single node
– Split into partitions or shards
• Bucket partitions and metadata are distributed and replicated
• Storage Container
– Store multiple objects
– The unit of replication
– Consistent Replicas

Ozone in HDFS
Where does it fit?

Hdfs Federation Extended
... ...
DN 1 DN 2 DN m
.. .. ..
Block Pools
Pool nPool kPool 1
Common Storage
BlockStorage
HDFS Namespaces &
Block Pool management
Ozone Block Pool management

Impact on HDFS
• Ozone will reuse the DN storage
– Use their own block pools so that both HDFS and Ozone can share DNs
• Ozone will reuse Block Pool Management part of the namenode
– Includes heartbeats, block reports
• Storage Container abstraction is added to DNs
– Co-exists with HDFS blocks on the DNs
– New data pipeline

HDFS Scalability
• Scalability of the File System
– Support a billion files
– Namespace scalability
– Block-space scalability
• Namespace scalability is independent of Ozone
– Partial namespace on disk
– Parallel Effort (HDFS-8286)
• Block-space scalability
– Block space constitutes a big part of namenode metadata
– Block map on disk doesn’t work
– We hope to reuse some of the lessons of Ozone’s “many small objects in a storage
container” to allow multiple blocks in “storage container”

Architecture

How it works
• URL
– http://hostname/myvolume/mybucket/mykey
• Simple Steps
– Full bucket name : ‘myvolume/mybucket’
– Find where bucket metadata is stored
– Fetch bucket metadata
– Check ACLs
– Find where the key is stored
– Read the data

How it works
• All the data or metadata is stored in Storage Containers
– Each storage container is identified by a unique id (Think of a block id in HDFS)
– A bucket name is mapped to a container id
– A key is mapped to a container id
• Container Id is mapped to Datanodes

Components
DN
Storage
Container
Manager
Ozone
Handler
DN
Ozone
Handler
DN
Ozone
Handler

New Components
• Storage Container Manager
– Maintains locations of each container (Container Map)
– Collects heartbeats and container reports from data-nodes
– Serves the location of container upon request
– Stores key partitioning metadata
• Ozone Handler
– A module hosted by Datanodes
– Implements Ozone REST API
– Connects to Storage Container Manager for key partitioning and container lookup
– Connects to local or remote Datanodes to read/write from/to containers
– Enforces authorization checks and administrative limits

Call Flow
DN
Storage
Container
Manager
DN DN
Client
REST
Call
Ozone
Handler
Ozone
Handler
Ozone
Handler
Read Metadata Container

Call Flow..
DN
Storage
Container
Manager
DN DN
Client
Ozone
Handler
Ozone
Handler
Ozone
Handler
Redirect Read Data

Implementation

Mapping a Key to a Container
• Keys need to be mapped to Container IDs
– Horizontal partitioning of the key space
• Partition function
– Hash Partitioning
– Minimal state to be stored
– Better distribution, no hotspots
– Range Partitioning
– Sorted keys
– Provides ordered listing

Hash Partitioning
• Key is hashed
– the hash value is mapped to the container Id
• Prefix matching
– The container id is the longest matching prefix of the key
– Storage Container Manager implements a prefix tree
• Need extendible hashing
– Minimize the number of keys to be re-hashed when a new container added
– New containers are added by splitting an existing container

Prefix Matching for Hashes
Bucket-Id:
0xab
Bitwise-Trie
Root
Trie Node Trie Node
0 1
Trie Node
0 1
Container
0xab003
Container
0xab005
Container
0xab001
10
Container
0xab002
Container
0xab000
10• Storage Container stores
one tree for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key
0xab125
Trie Node
Container
0xab000
0
Container
0xab004
1

Range Partitioning
• Range Partitioning
– The container map maintains a range index tree for each bucket.
– Each node of the tree corresponds to a key range
– Children nodes split the range of their parent nodes
– The lookup is performed by traversing down the tree to more granular ranges for a
key until we reach a leaf

Range Index Tree
Bucket-Id:
0xab
K1 – K20
K1 – K10 K11 – K20
K11-15
K16 – K20
Container
0xab003
Container
0xab005
Container
0xab001
K14 – K15K11 – K13
Container
0xab002
Container
0xab000
K6 – K10K1 – K5• Storage Container map
consists of arrays of such
trees one for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key =
K15

Storage Container
• A storage unit in the datanode
– Generalization of the HDFS Block
– Id, Generation Stamp, Size
– Unit of replication
– Consistent replicas
• Container size
– 1G - 10G
– Container size affects the scale of Storage Container Manager
– Large containers take longer to replicate an individual block
– A system property and not a data property

Storage Container Requirements
• Stores variety of data, results in different requirements
• Metadata
– Individual units of data are very small - kilobytes.
– An atomic update is important.
– get/put API is sufficient.
• Object Data
– The storage container needs to store object data with wide range of sizes
– Must support streaming APIs to read/write individual objects

Storage Container Implementation
• Storage container prototype using RocksDB
– An embeddable key-value store
• Replication
– Need ability to replicate while data is being written
– RocksDB supports snapshots and incremental backups for replication
• A hybrid use of RocksDB
– Small objects : Keys and Objects stored in RocksDB
– Large objects : Object stored in an individual file, RocksDB contains keys and file
path

Storage Container Implementation
• Transactions for consistency and reliability
– The storage containers implement a few atomic and persistent operations i.e.
transactions. The container provides reliability guarantees for these
operations.
– Commit : This operation promotes an object being written to a finalized object.
Once this operation succeeds, the container guarantees that the object is
available for reading.
– Put : This operation is useful for small writes such as metadata writes.
– Delete : deletes the object
• A new data pipeline for storage containers

Data Pipeline Consistency
• HDFS Consistency Mechanism uses two pieces of block state
– Generation Stamp
– Block length
• Storage containers use following two
– Generation stamp
– Transaction id
• Storage Container must persist last executed transaction.
• Transaction id is allocated by leader of the pipeline.

Data Pipeline Consistency
• Upon a restart, datanode discards all uncommitted data for a storage
container
– State synchronized to last committed transaction
• When comparing two replicas
– Replica with latest generation stamp is honored
– If same generation stamp, the replica with latest transaction id is honored
– Correctness argument: Replicas with same generation stamp and same
transaction id must be together in the same pipeline

Phased Development
• Phase 1
– Basic API
– Storage container machinery, reliability, replication.
• Phase 2
– High availability
– Security
– Multipart upload
• Phase 3
– Caching to improve latency.
– Object versioning
– Cross geo replication.

Team
• Anu Engineer
– aengineer@hortonworks.com
• Arpit Agarwal
– aagarwal@hortonworks.com
• Chris Nauroth
– cnauroth@hortonworks.com
• Jitendra Pandey
– jitendra@hortonworks.com

Special Thanks
• Sanjay Radia
• Enis Soztutar
• Suresh Srinivas

Thanks!

Ozone: An Object Store in HDFS

More Related Content

What's hot

Similar to Ozone: An Object Store in HDFS

More from DataWorks Summit

Recently uploaded

Ozone: An Object Store in HDFS

Editor's Notes