Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop over rgw


Published on

Hadoop over Ceph RG

Published in: Internet
  • Be the first to comment

Hadoop over rgw

  1. 1. Ceph Design Summit – Jewel -- Hadoop over RGW with SSD Cache Status update 7/2015
  2. 2. Content • Hadoop over RGW with SSD cache design • Status update since Infernalis • Performance of Hadoop over Swift
  3. 3. Rack 2 Server 1 RGW (vanilla) RGWFS FileSystem Interface M/R Server 2 RGW (vanilla) RGWFS FileSystem Interface M/R RGW-Proxy (NEW!) RGWFS FileSystem Interface Scheduler 1 2 3 RGW Service 1. Scheduler ask RGW service where a particular block locates (control path) • RGW-Proxy returns the closest active RGW instance(s) 2. Scheduler allocates a task on the server that is near to the data 3. Task access data from nearby (data path) 4. RGW get/put data from the CT, and CT would get/put data from BT if necessary (data path) Ceph RGW with SSD cache OSD OSD OSD OSD OSD OSD Ceph RADOS Rack 1 4 Gateway Gateway OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD) Cache Tier Base Tier Isolated network MON
  4. 4. Status update • RGW-Proxy(done) • Restful service based on Python WSGI • Give out the block location(s) on restful request, where the locations are sorted by the distance between RGW instances and the data OSD • curl http://RGW-proxy/con/test1/1 •, • RGWFS(70% done) 1. Forked from SwiftFS, RGWFS can talk to single RGW • But only to single RGW instance • Each Get/Put will need to go through the RGW 2. Now with RGW-proxy, it can talk to multiple RGW instances 3. We also add ‘block level location aware read’ feature • Performance testing • Baseline performance with HDFS and Swift is done
  5. 5. • New filesystem URL rgw:// 1. Forked from Hadoop-8545(SwiftFS) 2. Hadoop is able to talk to a RGW cluster with this plugin 3. A new ‘block concept’ was added since Swift doesn’t support blocks • Thus scheduler could use multiple tasks to access the same file • Based on the location, RGWFS is able to read from the closest RGW through range GET API • But for PUT all the traffic still goes through single RGW RGWFS – a new adaptor for HCFS RGWFS FileSystem Interface Scheduler RGW:// RGW RADOS RGW file1
  6. 6. 1. Before get/put, RGWFS would try to get the location of each block from RGW-Proxy 1. One topology file of the cluster is generated 2. RGW-Proxy would get the manifest from the head object first(librados + getxattr) 3. Then based on the crushmap RGW-proxy can get the location of each object block(ceph osd map) 4. RGW-proxy could get the closest RGW the data osd info and the topology file(simple lookup in the topology file) RGW-Proxy (NEW!) RGW-Proxy – Give out the closest RGW instance RGWFS FileSystem Interface Scheduler RGW:// RGW RADOS RGW file1
  7. 7. 1. Setting up RGW on a Cache Tier thus we could use SSD as the cache. 1. With some dedicated chunk size: e.g., 64MB considering the data are quite big usually 2. rgw_max_chunk_size, rgw_obj_strip_size 2. Based on the account/container/object name, RGW could get/put the content. 1. Using Range Read to get each chunk 3. We’ll use write-through mode here as a start point to bypass the data consistency issue. RGW(vanillia) – Serve the data requests RGWFS FileSystem Interface Scheduler RGW (modified) RGW (modified) RGW-Proxy (NEW!)RGW Service OSD OSD OSD OSD OSD OSD Ceph RADOS Gateway Gateway OSD OSD OSD OSD Cache Tier Base Tier MON
  8. 8. HDFS vs Swift HDFS Swift with list-enpoint Swift without list-enpoint Host Data Node MapReduce Host Data Node MapReduce … Host Object-Server MapReduce Host Object-Server MapReduce Host Data Node MapReduce Host Object-Server MapReduce … Proxy-Server Host Object-Server MapReduce Host Object-Server MapReduce Host Object-Server MapReduce … Proxy-Server Name Node IMPACT • List Endpoint impact is huge • Swift overhead comes from “Rename” 1X 1.25X 1.67X Less is better
  9. 9. Rename in Reduce Task • The output of the reduce function is written to a temporary location in HDFS. After completing, the output will automatically renamed from its temporary location to its final location. • Object storage cannot support rename, swiftfs use “copy and delete” for rename function. HDFS Rename -> Change METADATA in Name Node Swift Rename -> Copy new object and Delete the older one in Swift
  10. 10. Next Step • Finish the development(70% done) and complete the performance testing work • Based on the performance of Swift, we’ll need to solve the heavy rename issue, may need to patch RGW • Open source code repo(WIP) • 10
  11. 11. Q&A 23
  12. 12. Deployment Consideration Matrix 12 Storage Compute Distro/Plugin Data Processing API Vanilla CDH HDP MapRSpark VM Container Bare-metal Tenant vs. Admin provisioned Disaggregated vs. Collocated HDFS vs. other options Traditional EDP (Sahara native) 3rd party APIs Storm Performance results in the next section
  13. 13. Storage Architecture Tenant provisioned (in VM)  HDFS in the same VMs of computing tasks vs. in the different VMs  Ephemeral disk vs. Cinder volume  Admin provided  Logically disaggregated from computing tasks  Physical collocation is a matter of deployment  For network remote storage, Neutron DVR is very useful feature  A disaggregated (and centralized) storage system has significant values  No data silos, more business opportunities  Could leverage Manila service  Allow to create advanced solutions (.e.g. in-memory overlayer)  More vendor specific optimization opportunities 13 #2 #4#3#1 Host HDFS VM Comput- ing Task VM Comput- ing Task Host HDFS VM Comput- ing Task VM HDFS Host HDFS VM Comput- ing Task VM Comput- ing Task Legacy NFS GlusterFS Ceph* External HDFS Swift HDFS Scenario #1: computing and data service collocate in the VMs Scenario #2: data service locates in the host world Scenario #3: data service locates in a separate VM world Scenario #4: data service locates in the remote network
  14. 14. Compute Engine 14 Pros Cons VM • Best support in OpenStack • Strong security • Slow to provision • Relatively high runtime performance overhead Container • Light-weight, fast provisioning • Better runtime performance than VM • Nova-docker readiness • Cinder volume support is not ready yet • Weaker security than VM • Not the ideal way to use container Bare-Metal • Best performance and QoS • Best security isolation • Ironic readiness • Worst efficiency (e.g. consolidation of workloads with different behaviors) • Worst flexibility (e.g. migration) • Worst elasticity due to slow provisioning  Container seems to be promising but still need better support  Determining the appropriate cluster size is always a challenge to tenants  e.g. small flavor with more nodes or large flavor with less nodes