1. Ceph Design Summit – Jewel
-- Hadoop over RGW with SSD Cache Status update
yuan.zhou@intel.com
7/2015
2. Content
• Hadoop over RGW with SSD cache design
• Status update since Infernalis
• Performance of Hadoop over Swift
3. Rack 2
Server 1
RGW
(vanilla)
RGWFS
FileSystem
Interface
M/R
Server 2
RGW
(vanilla)
RGWFS
FileSystem
Interface
M/R
RGW-Proxy
(NEW!)
RGWFS
FileSystem
Interface
Scheduler
1
2
3
RGW
Service
1. Scheduler ask RGW service
where a particular block
locates (control path)
• RGW-Proxy returns the closest
active RGW instance(s)
2. Scheduler allocates a task on
the server that is near to the
data
3. Task access data from
nearby (data path)
4. RGW get/put data from the
CT, and CT would get/put
data from BT if necessary
(data path)
Ceph RGW with SSD cache
OSD OSD OSD OSD OSD OSD
Ceph
RADOS
Rack 1
4
Gateway
Gateway
OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD) Cache
Tier
Base
Tier
Isolated
network MON
4. Status update
• RGW-Proxy(done)
• Restful service based on Python WSGI
• Give out the block location(s) on restful request, where the locations are sorted by the
distance between RGW instances and the data OSD
• curl http://RGW-proxy/con/test1/1
• http://192.168.6.115/con/test1, http://192.168.6.114/con/test1
• RGWFS(70% done)
1. Forked from SwiftFS, RGWFS can talk to single RGW
• But only to single RGW instance
• Each Get/Put will need to go through the RGW
2. Now with RGW-proxy, it can talk to multiple RGW instances
3. We also add ‘block level location aware read’ feature
• Performance testing
• Baseline performance with HDFS and Swift is done
5. • New filesystem URL rgw://
1. Forked from Hadoop-8545(SwiftFS)
2. Hadoop is able to talk to a RGW cluster with this plugin
3. A new ‘block concept’ was added since Swift doesn’t
support blocks
• Thus scheduler could use multiple tasks to access the same file
• Based on the location, RGWFS is able to read from the closest RGW
through range GET API
• But for PUT all the traffic still goes through single RGW
RGWFS – a new adaptor for HCFS
RGWFS
FileSystem
Interface
Scheduler
RGW://
RGW
RADOS
RGW
file1
6. 1. Before get/put, RGWFS would try to get the
location of each block from RGW-Proxy
1. One topology file of the cluster is generated
2. RGW-Proxy would get the manifest from the head
object first(librados + getxattr)
3. Then based on the crushmap RGW-proxy can get
the location of each object block(ceph osd map)
4. RGW-proxy could get the closest RGW the data
osd info and the topology file(simple lookup in the
topology file)
RGW-Proxy
(NEW!)
RGW-Proxy – Give out the closest RGW instance
RGWFS
FileSystem
Interface
Scheduler
RGW://
RGW
RADOS
RGW
file1
7. 1. Setting up RGW on a Cache Tier thus we could
use SSD as the cache.
1. With some dedicated chunk size: e.g., 64MB
considering the data are quite big usually
2. rgw_max_chunk_size, rgw_obj_strip_size
2. Based on the account/container/object name,
RGW could get/put the content.
1. Using Range Read to get each chunk
3. We’ll use write-through mode here as a start
point to bypass the data consistency issue.
RGW(vanillia) – Serve the data requests
RGWFS
FileSystem
Interface
Scheduler
RGW
(modified)
RGW
(modified)
RGW-Proxy
(NEW!)RGW
Service
OSD OSD OSD OSD OSD OSD
Ceph
RADOS
Gateway
Gateway
OSD OSD OSD OSD
Cache
Tier
Base
Tier
MON
8. HDFS vs Swift
HDFS
Swift
with
list-enpoint
Swift
without
list-enpoint
Host
Data Node
MapReduce
Host
Data Node
MapReduce
…
Host
Object-Server
MapReduce
Host
Object-Server
MapReduce
Host
Data Node
MapReduce
Host
Object-Server
MapReduce
…
Proxy-Server
Host
Object-Server
MapReduce
Host
Object-Server
MapReduce
Host
Object-Server
MapReduce
…
Proxy-Server
Name Node
IMPACT
• List Endpoint impact is huge
• Swift overhead comes from “Rename”
1X
1.25X
1.67X
Less is better
9. Rename in Reduce Task
• The output of the reduce function is written to a temporary location in
HDFS. After completing, the output will automatically renamed from its
temporary location to its final location.
• Object storage cannot support rename, swiftfs use “copy and delete” for
rename function.
HDFS Rename -> Change METADATA in Name Node
Swift Rename -> Copy new object and Delete the older one in Swift
10. Next Step
• Finish the development(70% done) and complete the performance
testing work
• Based on the performance of Swift, we’ll need to solve the heavy
rename issue, may need to patch RGW
• Open source code repo(WIP)
• https://github.com/intel-bigdata/MOC
10
12. Deployment Consideration Matrix
12
Storage
Compute
Distro/Plugin
Data Processing API
Vanilla CDH HDP MapRSpark
VM Container Bare-metal
Tenant vs. Admin
provisioned
Disaggregated vs. Collocated HDFS vs. other options
Traditional
EDP
(Sahara native)
3rd party
APIs
Storm
Performance results in the next section
13. Storage Architecture Tenant provisioned (in VM)
HDFS in the same VMs of computing tasks vs. in the
different VMs
Ephemeral disk vs. Cinder volume
Admin provided
Logically disaggregated from computing tasks
Physical collocation is a matter of deployment
For network remote storage, Neutron DVR is very
useful feature
A disaggregated (and centralized) storage system has
significant values
No data silos, more business opportunities
Could leverage Manila service
Allow to create advanced solutions (.e.g. in-memory
overlayer)
More vendor specific optimization opportunities
13
#2 #4#3#1
Host
HDFS
VM
Comput-
ing Task
VM
Comput-
ing Task
Host
HDFS
VM
Comput-
ing Task
VM
HDFS
Host
HDFS
VM
Comput-
ing Task
VM
Comput-
ing Task
Legacy
NFS
GlusterFS Ceph*
External
HDFS Swift
HDFS
Scenario #1: computing and data service collocate in the VMs
Scenario #2: data service locates in the host world
Scenario #3: data service locates in a separate VM world
Scenario #4: data service locates in the remote network
14. Compute Engine
14
Pros Cons
VM
• Best support in OpenStack
• Strong security
• Slow to provision
• Relatively high runtime performance overhead
Container
• Light-weight, fast provisioning
• Better runtime performance than VM
• Nova-docker readiness
• Cinder volume support is not ready yet
• Weaker security than VM
• Not the ideal way to use container
Bare-Metal
• Best performance and QoS
• Best security isolation
• Ironic readiness
• Worst efficiency (e.g. consolidation of workloads with
different behaviors)
• Worst flexibility (e.g. migration)
• Worst elasticity due to slow provisioning
Container seems to be promising but still need better support
Determining the appropriate cluster size is always a challenge to tenants
e.g. small flavor with more nodes or large flavor with less nodes
Editor's Notes
A new adaptor for RGW(RGWFS): basically it will follow SwiftFS using a restful requests to do read/write the data. [Java]
Need to change the path..
A RGW-Proxy to do the master selection: it’s just like the Swift-Proxy which will give out the RGW instance and the object address(ip, port, account/container/obj) [python or Java]
Based on the container/object name, this daemon would give out the closest RGW instance(thus we know the closest rack).
We’ll use the SSD as the cache layer here. It’s actually one Cache Tier of below Base Tier.
A RGW cache coherence algorithm: it’s working among the RGW instances which will keep the cache coherence. [C++]
Based on the container/object name, RGW instance would read the object out from Ceph cluster.