SlideShare a Scribd company logo
1 of 16
HERD: Using RDMA
Efficiently for Key-Value
Services
Presented by Hanjun Xiao
CIS 800/003, Jan 27th, 2015
Background
 Key-Value stores (Omitted)
 Remote Direct Memory Access (RDMA)
 Remote Network Interface Controller (RNIC)
 Comparison with ‘classical Ethernet’
 Hardware-based network stack
 Kernel bypass
 Nearly same cost
More about RDMA
 Memory semantics vs. Channel semantics
 One-sided: READ/WRITE
 Two-sided: SEND/RECV
 Queue pair
 Send/Receive queue
 Completion queue
 Transport types
 Reliable Connection (RC)
 Unreliable Connection (UC)
 Unreliable Datagrams (UD)
Asymmetric system model
 Server is the bottleneck
 Not at maximum performance possible
 Current state-of-the-art
 Aim for low CPU use (Pilaf)
 Use RDMA reads as a building block (Pilaf, FaRM)
 Channel verbs are thought to be slower
 What’s NOT the goal
 Fault-tolerance
Approach
 Scalable (~200) and consistency
 Build on MICA (NSDI ‘14, to be presented)
 High-performance
 Be critical about conventional thoughts
 READ-based?
 Identify the bottleneck and workaround
 Comprehensive experiments
 CPU-interrupts
 RTT
Life of a WRITE
Optimized WRITEs
 Inlined
 Unreliable
 Unsignalled
WRITE is better!
Inbound throughput Outbound throughput
More optimization
 Batching
 Pipelining
 Prefetching
 Pre-allocation
Results: Throughput
Latency vs. Throughput
Latency vs. Throughput
Limitations
 Processes pinned to cores
 Asymmetric system model
 Lack of generality
 More like an engineering effort?
Conclusions
 Some design principles
 Experiment and analyze
 Bypass the bottleneck
 Optimize for the common case
 Challenge the convention thoughts
 Holistic architectural design of single data-center
 Distributed storage, CPUs, I/O path
 May be good to take CIS 501!
Future work
 “There is no best design. Name your goals. ”
 Recipes for various combinations of goals
 Examine bottlenecks
Thank you!
Discussion

More Related Content

Viewers also liked

San disk axel rosenberg
San disk axel rosenbergSan disk axel rosenberg
San disk axel rosenbergBigDataExpo
 
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안NAIM Networks, Inc.
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver인구 강
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonDataStax Academy
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxLF Events
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RSimon Huang
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressOdinot Stanislas
 
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)Ontico
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
Professional basic selling skills
Professional basic selling skillsProfessional basic selling skills
Professional basic selling skillsshehzad Chohan
 

Viewers also liked (15)

Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
 
San disk axel rosenberg
San disk axel rosenbergSan disk axel rosenberg
San disk axel rosenberg
 
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
 
Virtualization Acceleration
Virtualization Acceleration Virtualization Acceleration
Virtualization Acceleration
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
 
Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 
Herd
HerdHerd
Herd
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Professional basic selling skills
Professional basic selling skillsProfessional basic selling skills
Professional basic selling skills
 

Similar to HERD-Hanjun

Physical And Data Link Layers
Physical And Data Link LayersPhysical And Data Link Layers
Physical And Data Link Layerstmavroidis
 
Supply frame high availability in web content delivery
Supply frame high availability in web content deliverySupply frame high availability in web content delivery
Supply frame high availability in web content deliveryAleksandar Bilanovic
 
Protocols for Fast Delivery of Large Data Volumes
Protocols for Fast Delivery of Large Data VolumesProtocols for Fast Delivery of Large Data Volumes
Protocols for Fast Delivery of Large Data VolumesDilum Bandara
 
Invalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresInvalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresAntonios Katsarakis
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Networking Essentials Lesson 01 - Eric Vanderburg
 Networking Essentials Lesson 01 - Eric Vanderburg Networking Essentials Lesson 01 - Eric Vanderburg
Networking Essentials Lesson 01 - Eric VanderburgEric Vanderburg
 
Rdma presentation-kisti-v2
Rdma presentation-kisti-v2Rdma presentation-kisti-v2
Rdma presentation-kisti-v2balmanme
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-ExpressDVClub
 
Wireshark
WiresharkWireshark
Wiresharkbtohara
 
Designing High Availability Networks, Systems, and Software for the Universit...
Designing High Availability Networks, Systems, and Softwarefor the Universit...Designing High Availability Networks, Systems, and Softwarefor the Universit...
Designing High Availability Networks, Systems, and Software for the Universit...Shumon Huque
 
komdat1
komdat1komdat1
komdat1pasca
 
komdat1
komdat1komdat1
komdat1pasca
 
Robust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc Networks
Robust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc NetworksRobust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc Networks
Robust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc NetworksJAYAPRAKASH JPINFOTECH
 

Similar to HERD-Hanjun (20)

Physical And Data Link Layers
Physical And Data Link LayersPhysical And Data Link Layers
Physical And Data Link Layers
 
OSI layers
OSI layersOSI layers
OSI layers
 
Thaker q3 2008
Thaker q3 2008Thaker q3 2008
Thaker q3 2008
 
Tcp ip
Tcp ipTcp ip
Tcp ip
 
Supply frame high availability in web content delivery
Supply frame high availability in web content deliverySupply frame high availability in web content delivery
Supply frame high availability in web content delivery
 
Protocols for Fast Delivery of Large Data Volumes
Protocols for Fast Delivery of Large Data VolumesProtocols for Fast Delivery of Large Data Volumes
Protocols for Fast Delivery of Large Data Volumes
 
Thaker q3 2008
Thaker q3 2008Thaker q3 2008
Thaker q3 2008
 
Invalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresInvalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated Datastores
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Networking Essentials Lesson 01 - Eric Vanderburg
 Networking Essentials Lesson 01 - Eric Vanderburg Networking Essentials Lesson 01 - Eric Vanderburg
Networking Essentials Lesson 01 - Eric Vanderburg
 
Rdma presentation-kisti-v2
Rdma presentation-kisti-v2Rdma presentation-kisti-v2
Rdma presentation-kisti-v2
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
 
Wireshark
WiresharkWireshark
Wireshark
 
Designing High Availability Networks, Systems, and Software for the Universit...
Designing High Availability Networks, Systems, and Softwarefor the Universit...Designing High Availability Networks, Systems, and Softwarefor the Universit...
Designing High Availability Networks, Systems, and Software for the Universit...
 
komdat1
komdat1komdat1
komdat1
 
komdat1
komdat1komdat1
komdat1
 
18 internet protocols
18 internet protocols18 internet protocols
18 internet protocols
 
Fundamentals
FundamentalsFundamentals
Fundamentals
 
CCNA ppt
CCNA pptCCNA ppt
CCNA ppt
 
Robust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc Networks
Robust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc NetworksRobust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc Networks
Robust Defense Scheme Against Selective DropAttack in Wireless Ad Hoc Networks
 

HERD-Hanjun

  • 1. HERD: Using RDMA Efficiently for Key-Value Services Presented by Hanjun Xiao CIS 800/003, Jan 27th, 2015
  • 2. Background  Key-Value stores (Omitted)  Remote Direct Memory Access (RDMA)  Remote Network Interface Controller (RNIC)  Comparison with ‘classical Ethernet’  Hardware-based network stack  Kernel bypass  Nearly same cost
  • 3. More about RDMA  Memory semantics vs. Channel semantics  One-sided: READ/WRITE  Two-sided: SEND/RECV  Queue pair  Send/Receive queue  Completion queue  Transport types  Reliable Connection (RC)  Unreliable Connection (UC)  Unreliable Datagrams (UD)
  • 4. Asymmetric system model  Server is the bottleneck  Not at maximum performance possible  Current state-of-the-art  Aim for low CPU use (Pilaf)  Use RDMA reads as a building block (Pilaf, FaRM)  Channel verbs are thought to be slower  What’s NOT the goal  Fault-tolerance
  • 5. Approach  Scalable (~200) and consistency  Build on MICA (NSDI ‘14, to be presented)  High-performance  Be critical about conventional thoughts  READ-based?  Identify the bottleneck and workaround  Comprehensive experiments  CPU-interrupts  RTT
  • 6. Life of a WRITE
  • 7. Optimized WRITEs  Inlined  Unreliable  Unsignalled
  • 8. WRITE is better! Inbound throughput Outbound throughput
  • 9. More optimization  Batching  Pipelining  Prefetching  Pre-allocation
  • 13. Limitations  Processes pinned to cores  Asymmetric system model  Lack of generality  More like an engineering effort?
  • 14. Conclusions  Some design principles  Experiment and analyze  Bypass the bottleneck  Optimize for the common case  Challenge the convention thoughts  Holistic architectural design of single data-center  Distributed storage, CPUs, I/O path  May be good to take CIS 501!
  • 15. Future work  “There is no best design. Name your goals. ”  Recipes for various combinations of goals  Examine bottlenecks