Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© Hortonworks Inc. 2011 - 2015
Ozone: An Object Store in HDFS
Jitendra Nath Pandey
jitendra@hortonworks.com
jitendra@apach...
© Hortonworks Inc. 2011 - 2015
About me
• Engineering Manager @Hortonworks
– Manager / Architect for HDFS at Hortonworks
•...
© Hortonworks Inc. 2011 - 2015
Outline
• Introduction
• How ozone fits in HDFS
• Ozone architecture
• Notes on implementat...
© Hortonworks Inc. 2011 - 2015
Introduction
Architecting the Future of Big Data Page 4
© Hortonworks Inc. 2011 - 2015
Storage in Hadoop Ecosystem
• File system
– The HDFS
• SQL Database
– Hive on HDFS
• NoSQL
...
© Hortonworks Inc. 2011 - 2015
Object Store vs File System
• Object stores offer lot more scale
– Trillions of objects is ...
© Hortonworks Inc. 2011 - 2015
Ozone: Introduction
• Ozone : An object store in hadoop
– Durable
– Reliable
– Highly Scala...
© Hortonworks Inc. 2011 - 2015
Ozone Introduction
• An Ozone URL
– http://hostname/myvolume/mybucket/mykey
• An S3 URL
– h...
© Hortonworks Inc. 2011 - 2015
Definitions
• Storage Volume
– A notion similar to an account
– Allows admin controls on us...
© Hortonworks Inc. 2011 - 2015
Definitions
• Key
– Unique in a bucket.
• Object
– Values in a bucket
– Each corresponds to...
© Hortonworks Inc. 2011 - 2015
REST API
• POST – Creates Volumes and Buckets
– Only Admin creates volumes
– Bucket can be ...
© Hortonworks Inc. 2011 - 2015
REST API
• DELETE
– Delete Volumes
– Delete Buckets
• Keys
– PUT : Creates Keys
– GET : Get...
© Hortonworks Inc. 2011 - 2015
Storing Buckets
• Buckets grow up to millions of objects and several terabytes
– Don’t fit ...
© Hortonworks Inc. 2011 - 2015
Ozone in HDFS
Where does it fit?
Architecting the Future of Big Data Page 14
© Hortonworks Inc. 2011 - 2015
Hdfs Federation Extended
Page 15Architecting the Future of Big Data
... ...
DN 1 DN 2 DN m
...
© Hortonworks Inc. 2011 - 2015
Impact on HDFS
• Ozone will reuse the DN storage
– Use their own block pools so that both H...
© Hortonworks Inc. 2011 - 2015
HDFS Scalability
• Scalability of the File System
– Support a billion files
– Namespace sca...
© Hortonworks Inc. 2011 - 2015
Architecture
Architecting the Future of Big Data Page 18
© Hortonworks Inc. 2011 - 2015
How it works
• URL
– http://hostname/myvolume/mybucket/mykey
• Simple Steps
– Full bucket n...
© Hortonworks Inc. 2011 - 2015
How it works
• All the data or metadata is stored in Storage Containers
– Each storage cont...
© Hortonworks Inc. 2011 - 2015
Components
DN
Storage
Container
Manager
Ozone
Handler
DN
Ozone
Handler
DN
Ozone
Handler
© Hortonworks Inc. 2011 - 2015
New Components
• Storage Container Manager
– Maintains locations of each container (Contain...
© Hortonworks Inc. 2011 - 2015
Call Flow
DN
Storage
Container
Manager
DN DN
Client
REST
Call
Ozone
Handler
Ozone
Handler
O...
© Hortonworks Inc. 2011 - 2015
Call Flow..
DN
Storage
Container
Manager
DN DN
Client
Ozone
Handler
Ozone
Handler
Ozone
Han...
© Hortonworks Inc. 2011 - 2015
Implementation
Architecting the Future of Big Data Page 25
© Hortonworks Inc. 2011 - 2015
Mapping a Key to a Container
• Keys need to be mapped to Container IDs
– Horizontal partiti...
© Hortonworks Inc. 2011 - 2015
Hash Partitioning
• Key is hashed
– the hash value is mapped to the container Id
• Prefix m...
© Hortonworks Inc. 2011 - 2015
Prefix Matching for Hashes
Bucket-Id:
0xab
Bitwise-Trie
Root
Trie Node Trie Node
0 1
Trie N...
© Hortonworks Inc. 2011 - 2015
Range Partitioning
• Range Partitioning
– The container map maintains a range index tree fo...
© Hortonworks Inc. 2011 - 2015
Range Index Tree
Bucket-Id:
0xab
K1 – K20
K1 – K10 K11 – K20
K11-15
K16 – K20
Container
0xa...
© Hortonworks Inc. 2011 - 2015
Storage Container
• A storage unit in the datanode
– Generalization of the HDFS Block
– Id,...
© Hortonworks Inc. 2011 - 2015
Storage Container Requirements
• Stores variety of data, results in different requirements
...
© Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Storage container prototype using RocksDB
– An embeddabl...
© Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Transactions for consistency and reliability
– The stora...
© Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• HDFS Consistency Mechanism uses two pieces of block state
– Gen...
© Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• Upon a restart, datanode discards all uncommitted data for a st...
© Hortonworks Inc. 2011 - 2015
Phased Development
• Phase 1
– Basic API
– Storage container machinery, reliability, replic...
© Hortonworks Inc. 2011 - 2015
Team
• Anu Engineer
– aengineer@hortonworks.com
• Arpit Agarwal
– aagarwal@hortonworks.com
...
© Hortonworks Inc. 2011 - 2015
Special Thanks
• Sanjay Radia
• Enis Soztutar
• Suresh Srinivas
© Hortonworks Inc. 2011 - 2015
Thanks!
Architecting the Future of Big Data Page 40
Upcoming SlideShare
Loading in …5
×

Ozone: An Object Store in HDFS

2,975 views

Published on

hadoop summit 2015

Published in: Technology
  • Be the first to comment

Ozone: An Object Store in HDFS

  1. 1. © Hortonworks Inc. 2011 - 2015 Ozone: An Object Store in HDFS Jitendra Nath Pandey jitendra@hortonworks.com jitendra@apache.org @jnathp Page 1
  2. 2. © Hortonworks Inc. 2011 - 2015 About me • Engineering Manager @Hortonworks – Manager / Architect for HDFS at Hortonworks • ASF Member – PMC Member at Apache Hadoop – PMC Member at Apache Ambari – Committer in Apache Hive Page 2Architecting the Future of Big Data
  3. 3. © Hortonworks Inc. 2011 - 2015 Outline • Introduction • How ozone fits in HDFS • Ozone architecture • Notes on implementation • Q & A Page 3Architecting the Future of Big Data
  4. 4. © Hortonworks Inc. 2011 - 2015 Introduction Architecting the Future of Big Data Page 4
  5. 5. © Hortonworks Inc. 2011 - 2015 Storage in Hadoop Ecosystem • File system – The HDFS • SQL Database – Hive on HDFS • NoSQL – Hbase on HDFS • Object Store – We need Ozone! Page 5Architecting the Future of Big Data
  6. 6. © Hortonworks Inc. 2011 - 2015 Object Store vs File System • Object stores offer lot more scale – Trillions of objects is common – Simpler semantics make it possible • Wide range of object sizes – A few KB to several GB Page 6Architecting the Future of Big Data
  7. 7. © Hortonworks Inc. 2011 - 2015 Ozone: Introduction • Ozone : An object store in hadoop – Durable – Reliable – Highly Scalable – Trillions of objects – Wide range of object sizes – Secure – Highly Available – REST API as the primary access interface
  8. 8. © Hortonworks Inc. 2011 - 2015 Ozone Introduction • An Ozone URL – http://hostname/myvolume/mybucket/mykey • An S3 URL – http://hostname/mybucket/mykey • A Windows Azure URL – http://hostname/myaccount/mybucket/mykey
  9. 9. © Hortonworks Inc. 2011 - 2015 Definitions • Storage Volume – A notion similar to an account – Allows admin controls on usage of the object store e.g. storage quota – Different from account because no user management in HDFS – In private clouds often a ‘user’ is managed outside the cluster – Created and managed by admins only • Bucket – Consists of keys and objects – Similar to a bucket in S3 or a container in Azure – ACLs
  10. 10. © Hortonworks Inc. 2011 - 2015 Definitions • Key – Unique in a bucket. • Object – Values in a bucket – Each corresponds to a unique key within a bucket
  11. 11. © Hortonworks Inc. 2011 - 2015 REST API • POST – Creates Volumes and Buckets – Only Admin creates volumes – Bucket can be created by owner of the volume • PUT – Updates Volumes and Buckets – Only admin can change some volume settings – Buckets have ACLs • GET – Lists Volumes – List Buckets
  12. 12. © Hortonworks Inc. 2011 - 2015 REST API • DELETE – Delete Volumes – Delete Buckets • Keys – PUT : Creates Keys – GET : Gets the data back – Streaming read and write – DELETE : Removes the Key
  13. 13. © Hortonworks Inc. 2011 - 2015 Storing Buckets • Buckets grow up to millions of objects and several terabytes – Don’t fit in a single node – Split into partitions or shards • Bucket partitions and metadata are distributed and replicated • Storage Container – Store multiple objects – The unit of replication – Consistent Replicas
  14. 14. © Hortonworks Inc. 2011 - 2015 Ozone in HDFS Where does it fit? Architecting the Future of Big Data Page 14
  15. 15. © Hortonworks Inc. 2011 - 2015 Hdfs Federation Extended Page 15Architecting the Future of Big Data ... ... DN 1 DN 2 DN m .. .. .. Block Pools Pool nPool kPool 1 Common Storage BlockStorage HDFS Namespaces & Block Pool management Ozone Block Pool management
  16. 16. © Hortonworks Inc. 2011 - 2015 Impact on HDFS • Ozone will reuse the DN storage – Use their own block pools so that both HDFS and Ozone can share DNs • Ozone will reuse Block Pool Management part of the namenode – Includes heartbeats, block reports • Storage Container abstraction is added to DNs – Co-exists with HDFS blocks on the DNs – New data pipeline
  17. 17. © Hortonworks Inc. 2011 - 2015 HDFS Scalability • Scalability of the File System – Support a billion files – Namespace scalability – Block-space scalability • Namespace scalability is independent of Ozone – Partial namespace on disk – Parallel Effort (HDFS-8286) • Block-space scalability – Block space constitutes a big part of namenode metadata – Block map on disk doesn’t work – We hope to reuse some of the lessons of Ozone’s “many small objects in a storage container” to allow multiple blocks in “storage container”
  18. 18. © Hortonworks Inc. 2011 - 2015 Architecture Architecting the Future of Big Data Page 18
  19. 19. © Hortonworks Inc. 2011 - 2015 How it works • URL – http://hostname/myvolume/mybucket/mykey • Simple Steps – Full bucket name : ‘myvolume/mybucket’ – Find where bucket metadata is stored – Fetch bucket metadata – Check ACLs – Find where the key is stored – Read the data
  20. 20. © Hortonworks Inc. 2011 - 2015 How it works • All the data or metadata is stored in Storage Containers – Each storage container is identified by a unique id (Think of a block id in HDFS) – A bucket name is mapped to a container id – A key is mapped to a container id • Container Id is mapped to Datanodes
  21. 21. © Hortonworks Inc. 2011 - 2015 Components DN Storage Container Manager Ozone Handler DN Ozone Handler DN Ozone Handler
  22. 22. © Hortonworks Inc. 2011 - 2015 New Components • Storage Container Manager – Maintains locations of each container (Container Map) – Collects heartbeats and container reports from data-nodes – Serves the location of container upon request – Stores key partitioning metadata • Ozone Handler – A module hosted by Datanodes – Implements Ozone REST API – Connects to Storage Container Manager for key partitioning and container lookup – Connects to local or remote Datanodes to read/write from/to containers – Enforces authorization checks and administrative limits
  23. 23. © Hortonworks Inc. 2011 - 2015 Call Flow DN Storage Container Manager DN DN Client REST Call Ozone Handler Ozone Handler Ozone Handler Read Metadata Container
  24. 24. © Hortonworks Inc. 2011 - 2015 Call Flow.. DN Storage Container Manager DN DN Client Ozone Handler Ozone Handler Ozone Handler Redirect Read Data
  25. 25. © Hortonworks Inc. 2011 - 2015 Implementation Architecting the Future of Big Data Page 25
  26. 26. © Hortonworks Inc. 2011 - 2015 Mapping a Key to a Container • Keys need to be mapped to Container IDs – Horizontal partitioning of the key space • Partition function – Hash Partitioning – Minimal state to be stored – Better distribution, no hotspots – Range Partitioning – Sorted keys – Provides ordered listing
  27. 27. © Hortonworks Inc. 2011 - 2015 Hash Partitioning • Key is hashed – the hash value is mapped to the container Id • Prefix matching – The container id is the longest matching prefix of the key – Storage Container Manager implements a prefix tree • Need extendible hashing – Minimize the number of keys to be re-hashed when a new container added – New containers are added by splitting an existing container
  28. 28. © Hortonworks Inc. 2011 - 2015 Prefix Matching for Hashes Bucket-Id: 0xab Bitwise-Trie Root Trie Node Trie Node 0 1 Trie Node 0 1 Container 0xab003 Container 0xab005 Container 0xab001 10 Container 0xab002 Container 0xab000 10• Storage Container stores one tree for each bucket. • The containers are at the leaves. • Size = Θ(#containers) Key 0xab125 Trie Node Container 0xab000 0 Container 0xab004 1
  29. 29. © Hortonworks Inc. 2011 - 2015 Range Partitioning • Range Partitioning – The container map maintains a range index tree for each bucket. – Each node of the tree corresponds to a key range – Children nodes split the range of their parent nodes – The lookup is performed by traversing down the tree to more granular ranges for a key until we reach a leaf
  30. 30. © Hortonworks Inc. 2011 - 2015 Range Index Tree Bucket-Id: 0xab K1 – K20 K1 – K10 K11 – K20 K11-15 K16 – K20 Container 0xab003 Container 0xab005 Container 0xab001 K14 – K15K11 – K13 Container 0xab002 Container 0xab000 K6 – K10K1 – K5• Storage Container map consists of arrays of such trees one for each bucket. • The containers are at the leaves. • Size = Θ(#containers) Key = K15
  31. 31. © Hortonworks Inc. 2011 - 2015 Storage Container • A storage unit in the datanode – Generalization of the HDFS Block – Id, Generation Stamp, Size – Unit of replication – Consistent replicas • Container size – 1G - 10G – Container size affects the scale of Storage Container Manager – Large containers take longer to replicate an individual block – A system property and not a data property
  32. 32. © Hortonworks Inc. 2011 - 2015 Storage Container Requirements • Stores variety of data, results in different requirements • Metadata – Individual units of data are very small - kilobytes. – An atomic update is important. – get/put API is sufficient. • Object Data – The storage container needs to store object data with wide range of sizes – Must support streaming APIs to read/write individual objects
  33. 33. © Hortonworks Inc. 2011 - 2015 Storage Container Implementation • Storage container prototype using RocksDB – An embeddable key-value store • Replication – Need ability to replicate while data is being written – RocksDB supports snapshots and incremental backups for replication • A hybrid use of RocksDB – Small objects : Keys and Objects stored in RocksDB – Large objects : Object stored in an individual file, RocksDB contains keys and file path
  34. 34. © Hortonworks Inc. 2011 - 2015 Storage Container Implementation • Transactions for consistency and reliability – The storage containers implement a few atomic and persistent operations i.e. transactions. The container provides reliability guarantees for these operations. – Commit : This operation promotes an object being written to a finalized object. Once this operation succeeds, the container guarantees that the object is available for reading. – Put : This operation is useful for small writes such as metadata writes. – Delete : deletes the object • A new data pipeline for storage containers
  35. 35. © Hortonworks Inc. 2011 - 2015 Data Pipeline Consistency • HDFS Consistency Mechanism uses two pieces of block state – Generation Stamp – Block length • Storage containers use following two – Generation stamp – Transaction id • Storage Container must persist last executed transaction. • Transaction id is allocated by leader of the pipeline.
  36. 36. © Hortonworks Inc. 2011 - 2015 Data Pipeline Consistency • Upon a restart, datanode discards all uncommitted data for a storage container – State synchronized to last committed transaction • When comparing two replicas – Replica with latest generation stamp is honored – If same generation stamp, the replica with latest transaction id is honored – Correctness argument: Replicas with same generation stamp and same transaction id must be together in the same pipeline
  37. 37. © Hortonworks Inc. 2011 - 2015 Phased Development • Phase 1 – Basic API – Storage container machinery, reliability, replication. • Phase 2 – High availability – Security – Multipart upload • Phase 3 – Caching to improve latency. – Object versioning – Cross geo replication.
  38. 38. © Hortonworks Inc. 2011 - 2015 Team • Anu Engineer – aengineer@hortonworks.com • Arpit Agarwal – aagarwal@hortonworks.com • Chris Nauroth – cnauroth@hortonworks.com • Jitendra Pandey – jitendra@hortonworks.com
  39. 39. © Hortonworks Inc. 2011 - 2015 Special Thanks • Sanjay Radia • Enis Soztutar • Suresh Srinivas
  40. 40. © Hortonworks Inc. 2011 - 2015 Thanks! Architecting the Future of Big Data Page 40

×