1 © Hortonworks Inc. 2011–2018. All rights reserved
HDFS Scalability and Evolution: HDDS and
Ozone
Sanjay Radia,
Founder, Chief Architect, Hortonworks
Anu Engineer
HDFS Engineering, Hortonworks
2 © Hortonworks Inc. 2011–2018. All rights reserved
About the Speakers
Sanjay Radia
• Chief Architect, Founder, Hortonworks
• Apache Hadoop PMC and Committer
Part of the original Hadoop team at
Yahoo! since 2007
• Chief Architect of Hadoop Core at Yahoo!
Prior
• Data center automation, virtualization, Java,
HA, OSs, File Systems
• Startup, Sun Microsystems, INRIA…
• Ph.D., University of Waterloo
Anu Engineer
• Senior Member of HDFS Engineering
at Hortonworks
• Lead developer of Ozone/HDDS
• Apache Hadoop PMC and Committer
Prior
• Part of the founding team that created
Windows Azure
Page 2Architecting the Future of Big Data
3 © Hortonworks Inc. 2011–2018. All rights reserved
HDFS – What it does well and not so well
HDFS does well
• Scaling – IO + PBs + clients
• Horizontal scaling – IO + PBs
• Fast IO – scans and writes
• Number of concurrent clients 60K++
• Low latency metadata operations
• Fault tolerant storage layer
• Locality
• Replicas/Reliability and parallelism
• Layering – Namespace layer and storage
layer
• Security
But scaling Namespace is limited
to 500M files (192G Heap)
• Scaling Namespace – 500M FILES
• Scaling Block space
• Scaling Block reports
• Scaling DN’s block management
• Need further scaling of client/RPC 150K++
Ironically, Namespace in mem
is strength and weakness
4 © Hortonworks Inc. 2011–2018. All rights reserved
Proof points OF Scaling Data, IO, Clients/RPC
• Proof points of large data and large clusters
• Single Organizations have over 600PB in HDFS
• Single clusters with over 200PB using federation
• Large clusters over 4K multi-core nodes bombarding a single NN
• Federation is the currents caling solution (both Namespace & Operations)
• In deployment at Twitter, Yahoo, FB, and elsewhere
Metadata in memory the strength of the original GFS and HDFS design
But also its weakness in scaling number of files and blocks
5 © Hortonworks Inc. 2011–2018. All rights reserved
Scaling HDFS
- with HDDS and Ozone
6 © Hortonworks Inc. 2011–2018. All rights reserved
HDFS Layering
DN 1 DN 2 DN m
.. .. ..
NS1
...
NS k
Block Management Layer
Block Pool kBlock Poo 1
NN-1 NN-k
Common Storage
BlockStorageNamespace
7 © Hortonworks Inc. 2011–2018. All rights reserved
Solutions to Scaling Files, Blocks, Clients/RPC
Scale Namespace
• Hierarchical file system
– Cache only workingSet of
namespace in memory
– Partition:
- Distributed namespace (transparent
automatic partitioning)
- Volumes (static partitioning)
• Flat Key-Value store
– Cache only workingSet of
namespace in memory
– Partition/Shard the space (easy to
hash)
Scale Metadata Clients/RPC
• Multi-thread namespace manager
• Partitioning/Sharding
Slow NN startup
• Cache only workingSet in mem
• Shard/partition namespace
Scale Block Management
• Containers of blocks (2GB-16GB+)
• Will significantly reduce BlockMap
• Reduce Number of Block/Container
reports
8 © Hortonworks Inc. 2011–2018. All rights reserved
Scaling HDFS
Must Scale both the Namespace and the Block Layer
• Scaling one is not sufficient
Scalable Block layer: Hadoop Distributed Data Storage (HDDS)
• Containers of blocks
• Replicated as a group
• Reduces Block Map
Scale Namespace: Several approaches (not exclusive)
• Partial namespace in memory
• Shard namespace
• Use flat namespace (KV namespace) – easier to implement and scale – Ozone
9 © Hortonworks Inc. 2011–2018. All rights reserved
Scale Storage Layer:
Container of Blocks
HDDS
Flat KV
Namespace:
Ozone
New
Hdfs
OzoneFS:
Hadoop
Compatible
FS
Hierarchical
Namespace:
New Scalable
NN
Evolution towards new HDFS
10 © Hortonworks Inc. 2011–2018. All rights reserved
How it all Fits Together
Old HDFS NN
All namespace in
memory
Storage&IONamespace
HDFS Block storage on DataNodes
(Bid -> Data)
Physical Storage - Shared DataNodes and physical
storage shared between
Old HDFS and HDDS
Block Reports
BlockMap
(Bid ->IPAddress of DN
File = Bid[]
Ozone Master
K-V Flat
Namespace
File (Object) = Bid[]
Bid = Cid+ LocalId
New HDFS NN
(scalable)
Hierarchical
Namespace
File = Bid[]
Bid = Cid+ LocalId
Container Management
& Cluster Membership
HDDS Container Storage on DataNodes
(Bid -> Data, but blocks grouped in containers)
HDDS
HDDS – Clean
Separation of
Block layer
DataNodes
ContainerMap
(CId ->IPAddress of DNContainer Reports
NewExisting HDFS
11 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone FS
Ozone/HDDS can be used separately, or also with HDFS
• Initially HDFS is the default FS
• Has many features
• so cannot be replaced by OzoneFS on day one
• Ozone FS sits on side as additional namespace, sharing DNs
• For applications work with Hadoop Compatible FS
on K-V Store – Hive, Spark …
• How is Ozone FS accessed?
• Use direct URIs for either HDFS or OzoneFS
• Mount in HDFS or in ViewFS
HDFS
Default
FS
12 © Hortonworks Inc. 2011–2018. All rights reserved
Scalable Block Layer:
Hadoop Distributed Data Storage (HDDS)
Container: Containers of blocks (2GB-16GB+)
• Replicated as a group
• Each Container has a unique ContainerId
– Every block within a container has a block id
- BlockId = ContainerId, LocalId
CM – Container manager
• Cluster membership
• Receives container reports from DNs
• Manages container replication
• Maintained Container Map (Cid->IPAddr)
Data Nodes – HDFS and HDDS can share DNs
• DataNodes contain a set of containers (just like
they used to contain blocks)
• DataNodes send Container-reports (like block
reports) to CM (Container Manager)
Block Pools
• Just like blocks were in block pools, containers are
also in container pools
– This allow independent namespaces to carve out their
block space
HDDS: Separate layer from namespace layer (strictly separate, not almost)
13 © Hortonworks Inc. 2011–2018. All rights reserved
HDDS+Ozone – Addressing some of the limitations of HDFS
• Scale Block Management
• Containers of block (2 GB to 16GB)
• 2-4gb block containers initially => 40-80x
reduction in BR and CM block map
• Reduce BR on DNs, Masters, Network
• Scale Namespace
• Key Space Manager caches only working set in
memory
• Future scaling:
• Flat namespace is easy to shard (Bucket are
natural sharding points)
• Scale Num of Metadata Clients/Rpc
• No single global lock like NN
• Metadata operations are simpler
• Sharding will help further
 Fault Tolerance
– Blocks – inherits HDFS’s block-layer FT
– Namespace – uses Raft rather then Journal Nodes
•HA Easier
 Manageability
– GC/Overloaded Master is not longer an issue
• caches working set
– Journal nodes disappear – Raft is used
– Faster and more predictable failover
– Fast start up
• Faster upgrades
• Faster failover
14 © Hortonworks Inc. 2011–2018. All rights reserved
Will OzoneFS’s Key-Value Store work with Hadoop Apps
• Two years ago – NO!
• Today - Yes!
• Hive, Spark and others are making sure they work on Cloud K-V Object Stores via HCFS
• Even customers are ensuring that their apps work on Cloud K-V Object Stores via HCFS
• Lack of real directories and their ACLs: Fake directories + Buckets ACLs
• Lack of eventual consistency in S3 is being worked around – S3Gaurd (Note: OzoneFS is consistent)
• Lack of rename in S3 is being worked around
• Various direct output committers (early versions had issues)
• Netflix Direct Commiter; being replaced by Iceberg
• Via Metastore (Databricks has proprietary version, Hive’s approach)
15 © Hortonworks Inc. 2011–2018. All rights reserved
Details of HDDS
16 © Hortonworks Inc. 2011–2018. All rights reserved
Container Structure (Using RocksDB)
Container
Index
Chunk
data file
Chunk data
file
Chunk data
file
Chunk data
file
Key 1
LSM
LevelDB/RocksDB
Key N
Chunk Data
File Name
Offset Length
 An embedded LSM/KVStore (RocksDB)
 BlockId is the key,
– filename of local chunk file is value
 Optimizations
– Small blocks (< 1MB) can be stored directly in
rocksDB
– Compaction for block data to avoid lots of files
• But this can be evolved over time
17 © Hortonworks Inc. 2011–2018. All rights reserved
Replication of Container
• Use RAFT replication instead of data pipeline, for both data and metadata
• Proven to be correct
• Traditionally Raft used for small updates and transactions, fits well for metadata
• Performance considerations
• When writing the meta data into raft-journal, put the data directly in container
storage
• Raft-journal in separate disk – fast contagious writes without seeking
• Data spread across the other disks
• Client uses Raft protocol to write data to the DNs storing the container
Page 17
18 © Hortonworks Inc. 2011–2018. All rights reserved
Open and Closed Containers
Open – active writers
• Need at least( NumSpindles * Data nodes) open active containers
• Clients can get locality on writes
• Data is spread across all data nodes
• Improved IO and better chance of getting locality
• Keep DNs and ALL spindles busy
Closed – typically when full or had a failure in the past
• Why close a container on failures
• We originally considered keeping it open and bringing in a new DN
• Wait for the data to copy?
• Decided to close it, and have it replicated
• Can open later or can merge with other closed container – under design
Page 18
19 © Hortonworks Inc. 2011–2018. All rights reserved
Details of Ozone
20 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone Master
DN1 DN2 DNn
Ozone Master
K-V
Namespace
File (Object) = Bid[]
Bid = Cid+ LocalId
CM
ContainerMap
(CId ->IPAddress of DN
Client
RocksDB
bId[]= Open(Key,..)
GetBlockLocations(Bid)
$$$
$$$ - Container Map Cache
$$$
Read, Write, …
21 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone APIs
• Key: /VolumeName/BucketId/ObjectKey e.g /Home/John/foo/bar/zoo)
• ACLs at Volume and Bucket level (the other dirs are fake)
• Future sharding at bucket level
• => Ozone is Consistent (unlike S3)
Ozone Object API (RPC)
S3 Connector
Hadoop FileSystem and Hadoop
FileContext Connectors
22 © Hortonworks Inc. 2011–2018. All rights reserved
Where does the Ozone Master run?
Which Node?
• On a separate node with large enough memory for caching the working set
• Caching the working set is important for large number of concurrent clients
• This option would give predictable performance for large clusters
• On the Datanodes
• How much memory for caching,
• Note: tasks and other services run on DN since they are typically also compute nodes
Where is Storage for the Ozone KV Metadata?
• Local disk
• If on DN then is it dedicated disk or shared with DN?
• Use the container storage (Its using RocksDB anyway)
• Spread Ozone volumes across containers to gain performance,
• but this maylimit volume size & force more Ozone volumes than Admin wants
23 © Hortonworks Inc. 2011–2018. All rights reserved
Quadra – Lun-like Raw-Block Storage
Used for creating mountable disk FS volume
24 © Hortonworks Inc. 2011–2018. All rights reserved
Quadra: Raw-Block Storage Volume (Lun)
Lun-like storage service where the blocks are stored on HDDS
• Volume: A raw-block device that can be used to create a mountable disk on Linux.
• Raw-Blocks - those of the native FS that will use the Lun Volume
• Raw-block size is dictated by the native fs like ext4 (4K)
• Raw-Blocks are unit of IO operations by native file systems.
• Raw-Block is the unit of read/write/update to HDDS
• Ozone and Quadra share HDDS as a common storage backend
• Current prototype: 1 raw-block = 1 HDDS block (but this will change later)
Can be used in Kubernetes for container state
25 © Hortonworks Inc. 2011–2018. All rights reserved
Quadra
SCM
DN DNDN DN
Containers
Host
Kern
el
User
SCSI Initiator
Linux
Distro
Quadra
Volume
Manager
Volume
API
JSCSI
Ref: http://jscsi.org/
Quadra
Plugin
HDDS : Ozone’s storage layer
26 © Hortonworks Inc. 2011–2018. All rights reserved
Quadra Volume Manager
• Quadra volumes are virtual block devices that are mounted on hosts over
SCSI
• Volume Manager tracks the name of volumes and storage containers
assigned to them
• Volume Manager talks to CM to get containers allocated for a volume
• Volume Manager assigns leases for clients to mount a volume
• Volume Manager state is persisted and replicated via RAFT
27 © Hortonworks Inc. 2011–2018. All rights reserved
Status
HDDS: Block container
• 2-4gb block containers initially
– Reduction of 40-80 in BR and block map
– Reduce BR pressure in on NN/OzoneMaster
• Initial version to scale to 10s billions of blocks
Ozone Master
• Implemented using RocksDB (just like the HDDS in DNs)
• Initial version to scale to 10 billion objects
Current Status and Steps to GA
• Stabilize HDDS and Ozone
• Measure and improve performance
• Add HA for Ozone Master and Container Manager
• Add security – Security design completed and published
After GA
• Further stabilization and performance improvements
• Transparent encryption
• Erasure codes
• Snapshots (or their equivalent)
• ..
28 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• HDFS scale proven in real production systems
• 4K+ clusters
• Raw Storage >200PB in single federated NN cluster and >30PB in non-federated clusters
• Scales to 60K+ concurrent clients bombarding the NN
• But very large number of small files is a challenge (500M files)
• HDDS + Ozone: Scalable Hadoop Storage
• Retains
• HDFS block storage Fault-tolerance
• HDFS Horizonal scaling for Storage, IO
• HDFS’s move computation to Storage
• HDDS: Block containers:
• Initially scale to 10B blocks, later to 100B+ blocks (HDFS-7240)
• Ozone – Flat KV namespace + Hadoop Compatible FS (OzoneFS)
• initially scale to 10B files (HDFS-13074)
• Community working on a Hierarchal Namespace on HDDS (HDFS-10419)
29 © Hortonworks Inc. 2011–2018. All rights reserved
Thank You
Q&A

Ozone and HDFS’s evolution

  • 1.
    1 © HortonworksInc. 2011–2018. All rights reserved HDFS Scalability and Evolution: HDDS and Ozone Sanjay Radia, Founder, Chief Architect, Hortonworks Anu Engineer HDFS Engineering, Hortonworks
  • 2.
    2 © HortonworksInc. 2011–2018. All rights reserved About the Speakers Sanjay Radia • Chief Architect, Founder, Hortonworks • Apache Hadoop PMC and Committer Part of the original Hadoop team at Yahoo! since 2007 • Chief Architect of Hadoop Core at Yahoo! Prior • Data center automation, virtualization, Java, HA, OSs, File Systems • Startup, Sun Microsystems, INRIA… • Ph.D., University of Waterloo Anu Engineer • Senior Member of HDFS Engineering at Hortonworks • Lead developer of Ozone/HDDS • Apache Hadoop PMC and Committer Prior • Part of the founding team that created Windows Azure Page 2Architecting the Future of Big Data
  • 3.
    3 © HortonworksInc. 2011–2018. All rights reserved HDFS – What it does well and not so well HDFS does well • Scaling – IO + PBs + clients • Horizontal scaling – IO + PBs • Fast IO – scans and writes • Number of concurrent clients 60K++ • Low latency metadata operations • Fault tolerant storage layer • Locality • Replicas/Reliability and parallelism • Layering – Namespace layer and storage layer • Security But scaling Namespace is limited to 500M files (192G Heap) • Scaling Namespace – 500M FILES • Scaling Block space • Scaling Block reports • Scaling DN’s block management • Need further scaling of client/RPC 150K++ Ironically, Namespace in mem is strength and weakness
  • 4.
    4 © HortonworksInc. 2011–2018. All rights reserved Proof points OF Scaling Data, IO, Clients/RPC • Proof points of large data and large clusters • Single Organizations have over 600PB in HDFS • Single clusters with over 200PB using federation • Large clusters over 4K multi-core nodes bombarding a single NN • Federation is the currents caling solution (both Namespace & Operations) • In deployment at Twitter, Yahoo, FB, and elsewhere Metadata in memory the strength of the original GFS and HDFS design But also its weakness in scaling number of files and blocks
  • 5.
    5 © HortonworksInc. 2011–2018. All rights reserved Scaling HDFS - with HDDS and Ozone
  • 6.
    6 © HortonworksInc. 2011–2018. All rights reserved HDFS Layering DN 1 DN 2 DN m .. .. .. NS1 ... NS k Block Management Layer Block Pool kBlock Poo 1 NN-1 NN-k Common Storage BlockStorageNamespace
  • 7.
    7 © HortonworksInc. 2011–2018. All rights reserved Solutions to Scaling Files, Blocks, Clients/RPC Scale Namespace • Hierarchical file system – Cache only workingSet of namespace in memory – Partition: - Distributed namespace (transparent automatic partitioning) - Volumes (static partitioning) • Flat Key-Value store – Cache only workingSet of namespace in memory – Partition/Shard the space (easy to hash) Scale Metadata Clients/RPC • Multi-thread namespace manager • Partitioning/Sharding Slow NN startup • Cache only workingSet in mem • Shard/partition namespace Scale Block Management • Containers of blocks (2GB-16GB+) • Will significantly reduce BlockMap • Reduce Number of Block/Container reports
  • 8.
    8 © HortonworksInc. 2011–2018. All rights reserved Scaling HDFS Must Scale both the Namespace and the Block Layer • Scaling one is not sufficient Scalable Block layer: Hadoop Distributed Data Storage (HDDS) • Containers of blocks • Replicated as a group • Reduces Block Map Scale Namespace: Several approaches (not exclusive) • Partial namespace in memory • Shard namespace • Use flat namespace (KV namespace) – easier to implement and scale – Ozone
  • 9.
    9 © HortonworksInc. 2011–2018. All rights reserved Scale Storage Layer: Container of Blocks HDDS Flat KV Namespace: Ozone New Hdfs OzoneFS: Hadoop Compatible FS Hierarchical Namespace: New Scalable NN Evolution towards new HDFS
  • 10.
    10 © HortonworksInc. 2011–2018. All rights reserved How it all Fits Together Old HDFS NN All namespace in memory Storage&IONamespace HDFS Block storage on DataNodes (Bid -> Data) Physical Storage - Shared DataNodes and physical storage shared between Old HDFS and HDDS Block Reports BlockMap (Bid ->IPAddress of DN File = Bid[] Ozone Master K-V Flat Namespace File (Object) = Bid[] Bid = Cid+ LocalId New HDFS NN (scalable) Hierarchical Namespace File = Bid[] Bid = Cid+ LocalId Container Management & Cluster Membership HDDS Container Storage on DataNodes (Bid -> Data, but blocks grouped in containers) HDDS HDDS – Clean Separation of Block layer DataNodes ContainerMap (CId ->IPAddress of DNContainer Reports NewExisting HDFS
  • 11.
    11 © HortonworksInc. 2011–2018. All rights reserved Ozone FS Ozone/HDDS can be used separately, or also with HDFS • Initially HDFS is the default FS • Has many features • so cannot be replaced by OzoneFS on day one • Ozone FS sits on side as additional namespace, sharing DNs • For applications work with Hadoop Compatible FS on K-V Store – Hive, Spark … • How is Ozone FS accessed? • Use direct URIs for either HDFS or OzoneFS • Mount in HDFS or in ViewFS HDFS Default FS
  • 12.
    12 © HortonworksInc. 2011–2018. All rights reserved Scalable Block Layer: Hadoop Distributed Data Storage (HDDS) Container: Containers of blocks (2GB-16GB+) • Replicated as a group • Each Container has a unique ContainerId – Every block within a container has a block id - BlockId = ContainerId, LocalId CM – Container manager • Cluster membership • Receives container reports from DNs • Manages container replication • Maintained Container Map (Cid->IPAddr) Data Nodes – HDFS and HDDS can share DNs • DataNodes contain a set of containers (just like they used to contain blocks) • DataNodes send Container-reports (like block reports) to CM (Container Manager) Block Pools • Just like blocks were in block pools, containers are also in container pools – This allow independent namespaces to carve out their block space HDDS: Separate layer from namespace layer (strictly separate, not almost)
  • 13.
    13 © HortonworksInc. 2011–2018. All rights reserved HDDS+Ozone – Addressing some of the limitations of HDFS • Scale Block Management • Containers of block (2 GB to 16GB) • 2-4gb block containers initially => 40-80x reduction in BR and CM block map • Reduce BR on DNs, Masters, Network • Scale Namespace • Key Space Manager caches only working set in memory • Future scaling: • Flat namespace is easy to shard (Bucket are natural sharding points) • Scale Num of Metadata Clients/Rpc • No single global lock like NN • Metadata operations are simpler • Sharding will help further  Fault Tolerance – Blocks – inherits HDFS’s block-layer FT – Namespace – uses Raft rather then Journal Nodes •HA Easier  Manageability – GC/Overloaded Master is not longer an issue • caches working set – Journal nodes disappear – Raft is used – Faster and more predictable failover – Fast start up • Faster upgrades • Faster failover
  • 14.
    14 © HortonworksInc. 2011–2018. All rights reserved Will OzoneFS’s Key-Value Store work with Hadoop Apps • Two years ago – NO! • Today - Yes! • Hive, Spark and others are making sure they work on Cloud K-V Object Stores via HCFS • Even customers are ensuring that their apps work on Cloud K-V Object Stores via HCFS • Lack of real directories and their ACLs: Fake directories + Buckets ACLs • Lack of eventual consistency in S3 is being worked around – S3Gaurd (Note: OzoneFS is consistent) • Lack of rename in S3 is being worked around • Various direct output committers (early versions had issues) • Netflix Direct Commiter; being replaced by Iceberg • Via Metastore (Databricks has proprietary version, Hive’s approach)
  • 15.
    15 © HortonworksInc. 2011–2018. All rights reserved Details of HDDS
  • 16.
    16 © HortonworksInc. 2011–2018. All rights reserved Container Structure (Using RocksDB) Container Index Chunk data file Chunk data file Chunk data file Chunk data file Key 1 LSM LevelDB/RocksDB Key N Chunk Data File Name Offset Length  An embedded LSM/KVStore (RocksDB)  BlockId is the key, – filename of local chunk file is value  Optimizations – Small blocks (< 1MB) can be stored directly in rocksDB – Compaction for block data to avoid lots of files • But this can be evolved over time
  • 17.
    17 © HortonworksInc. 2011–2018. All rights reserved Replication of Container • Use RAFT replication instead of data pipeline, for both data and metadata • Proven to be correct • Traditionally Raft used for small updates and transactions, fits well for metadata • Performance considerations • When writing the meta data into raft-journal, put the data directly in container storage • Raft-journal in separate disk – fast contagious writes without seeking • Data spread across the other disks • Client uses Raft protocol to write data to the DNs storing the container Page 17
  • 18.
    18 © HortonworksInc. 2011–2018. All rights reserved Open and Closed Containers Open – active writers • Need at least( NumSpindles * Data nodes) open active containers • Clients can get locality on writes • Data is spread across all data nodes • Improved IO and better chance of getting locality • Keep DNs and ALL spindles busy Closed – typically when full or had a failure in the past • Why close a container on failures • We originally considered keeping it open and bringing in a new DN • Wait for the data to copy? • Decided to close it, and have it replicated • Can open later or can merge with other closed container – under design Page 18
  • 19.
    19 © HortonworksInc. 2011–2018. All rights reserved Details of Ozone
  • 20.
    20 © HortonworksInc. 2011–2018. All rights reserved Ozone Master DN1 DN2 DNn Ozone Master K-V Namespace File (Object) = Bid[] Bid = Cid+ LocalId CM ContainerMap (CId ->IPAddress of DN Client RocksDB bId[]= Open(Key,..) GetBlockLocations(Bid) $$$ $$$ - Container Map Cache $$$ Read, Write, …
  • 21.
    21 © HortonworksInc. 2011–2018. All rights reserved Ozone APIs • Key: /VolumeName/BucketId/ObjectKey e.g /Home/John/foo/bar/zoo) • ACLs at Volume and Bucket level (the other dirs are fake) • Future sharding at bucket level • => Ozone is Consistent (unlike S3) Ozone Object API (RPC) S3 Connector Hadoop FileSystem and Hadoop FileContext Connectors
  • 22.
    22 © HortonworksInc. 2011–2018. All rights reserved Where does the Ozone Master run? Which Node? • On a separate node with large enough memory for caching the working set • Caching the working set is important for large number of concurrent clients • This option would give predictable performance for large clusters • On the Datanodes • How much memory for caching, • Note: tasks and other services run on DN since they are typically also compute nodes Where is Storage for the Ozone KV Metadata? • Local disk • If on DN then is it dedicated disk or shared with DN? • Use the container storage (Its using RocksDB anyway) • Spread Ozone volumes across containers to gain performance, • but this maylimit volume size & force more Ozone volumes than Admin wants
  • 23.
    23 © HortonworksInc. 2011–2018. All rights reserved Quadra – Lun-like Raw-Block Storage Used for creating mountable disk FS volume
  • 24.
    24 © HortonworksInc. 2011–2018. All rights reserved Quadra: Raw-Block Storage Volume (Lun) Lun-like storage service where the blocks are stored on HDDS • Volume: A raw-block device that can be used to create a mountable disk on Linux. • Raw-Blocks - those of the native FS that will use the Lun Volume • Raw-block size is dictated by the native fs like ext4 (4K) • Raw-Blocks are unit of IO operations by native file systems. • Raw-Block is the unit of read/write/update to HDDS • Ozone and Quadra share HDDS as a common storage backend • Current prototype: 1 raw-block = 1 HDDS block (but this will change later) Can be used in Kubernetes for container state
  • 25.
    25 © HortonworksInc. 2011–2018. All rights reserved Quadra SCM DN DNDN DN Containers Host Kern el User SCSI Initiator Linux Distro Quadra Volume Manager Volume API JSCSI Ref: http://jscsi.org/ Quadra Plugin HDDS : Ozone’s storage layer
  • 26.
    26 © HortonworksInc. 2011–2018. All rights reserved Quadra Volume Manager • Quadra volumes are virtual block devices that are mounted on hosts over SCSI • Volume Manager tracks the name of volumes and storage containers assigned to them • Volume Manager talks to CM to get containers allocated for a volume • Volume Manager assigns leases for clients to mount a volume • Volume Manager state is persisted and replicated via RAFT
  • 27.
    27 © HortonworksInc. 2011–2018. All rights reserved Status HDDS: Block container • 2-4gb block containers initially – Reduction of 40-80 in BR and block map – Reduce BR pressure in on NN/OzoneMaster • Initial version to scale to 10s billions of blocks Ozone Master • Implemented using RocksDB (just like the HDDS in DNs) • Initial version to scale to 10 billion objects Current Status and Steps to GA • Stabilize HDDS and Ozone • Measure and improve performance • Add HA for Ozone Master and Container Manager • Add security – Security design completed and published After GA • Further stabilization and performance improvements • Transparent encryption • Erasure codes • Snapshots (or their equivalent) • ..
  • 28.
    28 © HortonworksInc. 2011–2018. All rights reserved Summary • HDFS scale proven in real production systems • 4K+ clusters • Raw Storage >200PB in single federated NN cluster and >30PB in non-federated clusters • Scales to 60K+ concurrent clients bombarding the NN • But very large number of small files is a challenge (500M files) • HDDS + Ozone: Scalable Hadoop Storage • Retains • HDFS block storage Fault-tolerance • HDFS Horizonal scaling for Storage, IO • HDFS’s move computation to Storage • HDDS: Block containers: • Initially scale to 10B blocks, later to 100B+ blocks (HDFS-7240) • Ozone – Flat KV namespace + Hadoop Compatible FS (OzoneFS) • initially scale to 10B files (HDFS-13074) • Community working on a Hierarchal Namespace on HDDS (HDFS-10419)
  • 29.
    29 © HortonworksInc. 2011–2018. All rights reserved Thank You Q&A