Offloading Deep Dive
Efstathios Efstathiou
Agenda
Introduction
Definition of offloading (DB view)
Offloading techniques we can use
Demo-Time ☺
Findings
Q&A
Introduction
About me
Married
Linux since 1998
Oracle since 2000
OCM & OCP
Master Database Engineer @BIT since 2014
Definition of offloading (DB view)
In general:
«Everything, that saves resources on
the database server»
Definition of offloading (DB view)
Examples of offloading implementations
NIC (TCP/IP Offload, iSCSI Offload, Infiniband RDMA, NVMe)
Storage Adatapets (RAID Calculation, SCSI)
Math Co-Processors
FPGAs
DMA-Engines
Distributed Computing (e.g. using MPI)
Remote DB Engine (Hadoop Connector, Gluent)
Definition of offloading (DB view)
How is it done the Exadata?
Offloading via DMA-Engine of the Infiniband HCA
Enables Remote-DMA (RDMA) Operations (DB to Cell)
The storage cell can be acessed at near zero cpu cost
Latency of a DMA operation is higher than PIO via CPU therefore good for large
amounts of data e.g. DWH, but worse for OLTP
The task can be distributed
Order e.g. to execute a sub-query on a node via MPI-call and to transmit the start
or end memory address to the requester (DB server)
The DB server now only needs to merge the partial results.
The DB server is in this sense more acting as a client
Offloading techniques we can use
The following devices have a DMA engine:
RDMA-enabled network adapters and Infiniband cards
Intel IOATDMA chip on Xeon boards (for NVMe SSDs
PCIe switch cards
PLX-based NVMe controllers
Or the PCIe chip in your Intel Xeon computer ;-)
Lowest latency
Offloading techniques we can use
The following protocols have (R) DMA support:
iSCSI over RMDA
NFS over RDMA
NVMe over Fabrics (RDMA-based) or RDMA Block Device
Needs the least CPU
Good starting point
Offloading techniques we can use
Comparison (Native PCIe fabric vs. NVMe over Fabrics)
Native PCIe fabric has significantly less latency
Setup with PCIe-JBOF is less complex than NVMe over Fabrics
Throughput is identical
Offloading techniques we can use
That PCIe is quite cool… What other tricks can it do?
DMA-Engine like Infiniband
Connect multiple PCIe root complexes via Non-Transparent Bridge
Network protocol IPoPCIe analogous to IPoIB, but performs way better
Device Sharing via I / O Virtualization (SR-IOV, MR-IOV)
Offloading techniques we can use
How do we get the system really fast?
Answer: Memory!
The only question is:
Which memory?
Where is it located?
How is it structured?
Demo-Time ☺
Demo 1: Device Sharing
Description
Host 1 has a SR-IOV capable NIC
Host 1 initializes a Virtual Function
Through Non-Transparent Bridge
(NTB) Host 2 can access that
function by loading the device driver
for the NIC
https://www.youtube.com/watch?v=GPh0Ms3dfPo
Demo-Time ☺
Demo 1: Device Sharing
Expected behaviour
Works as designed ☺
Depending on the approach PCIe switch chip, there is device driver dependencies
Demo-Time ☺
Demo 2: DMA-Transfer
Description
Host 1 and Host2 are fitted with a
PCIe Switch based host card and
connected back to back
PLXSDK comes with a Sample
Program supporting PIO and DMA
transfer
We measure the overall throughput
and cpu load
https://www.youtube.com/watch?v=LNPBr3WvuNg
Demo-Time ☺
Demo 2: DMA-Transfer
Expected behaviour
Large data transfer benefits from DMA (DWH) ☺
Small, time critical transfers have less latency with PIO (OLTP)
You’ll need both modes
Demo-Time ☺
Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC
Description
Database and Memory hosts are fitted
with a PCIe Switch based host card and
connected to a central PCIe Switch
Memory hosts’s physical DRAM is
expanded with OptaneGrid 3DXpoint
into an SDM Pool (mirrored via PCIe
NTB)
Database Servers expose a tiered
PMEM Device using local DRAM
(mirrored via PCIe NTB) and the remote
SDM Pool accessed over PCIe NTB)
ASM High Redudancy on top of PMEM
Devices with preferred mirror read and
device mapper path swapping
db0 db1 db2
mem0 mem1 mem2
SDM
DRAM
Optane
GRID
SDM
DRAM
Optane
GRID
SDM
DRAM
Optane
GRID
ASM
PMEM
DRAM
Expansion
PMEM
DRAM
Expansion
PMEM
DRAM
Expansion
PCIe Switch
RAC
NTB
Domain
Demo-Time ☺
Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC
16 GB/s throughput per licensable core (4cores, 8 threads per db node)
85 % of native aggregated memory controller performance
Findings
Generic offloading is possible per se, but different than expected :
Fabric Attached Memory
Yes, the DB is running in memory (mirrored)
Question is:
In which server’s memory (local or remote)?
How do we acccess it (local memory extension or DMA call)?
How is it constructed (DRAM or Software Defined Memory)?
Using the right PCIe-Switch and storage module combination you
get it to work
Any PCIe-capable host can use Fabric Attached Memory per se
An OpenMCCA-compatible PCIe switch (PLX 9700) and high-performance M.2 SSDs
such as Optane Memory or fast NVMe modules are required
Q&A
Thanks to our supporters
Contact Information
elgreco@linux.com
Thanks

Offloading for Databases - Deep Dive

  • 1.
  • 2.
    Agenda Introduction Definition of offloading(DB view) Offloading techniques we can use Demo-Time ☺ Findings Q&A
  • 3.
    Introduction About me Married Linux since1998 Oracle since 2000 OCM & OCP Master Database Engineer @BIT since 2014
  • 4.
    Definition of offloading(DB view) In general: «Everything, that saves resources on the database server»
  • 5.
    Definition of offloading(DB view) Examples of offloading implementations NIC (TCP/IP Offload, iSCSI Offload, Infiniband RDMA, NVMe) Storage Adatapets (RAID Calculation, SCSI) Math Co-Processors FPGAs DMA-Engines Distributed Computing (e.g. using MPI) Remote DB Engine (Hadoop Connector, Gluent)
  • 6.
    Definition of offloading(DB view) How is it done the Exadata? Offloading via DMA-Engine of the Infiniband HCA Enables Remote-DMA (RDMA) Operations (DB to Cell) The storage cell can be acessed at near zero cpu cost Latency of a DMA operation is higher than PIO via CPU therefore good for large amounts of data e.g. DWH, but worse for OLTP The task can be distributed Order e.g. to execute a sub-query on a node via MPI-call and to transmit the start or end memory address to the requester (DB server) The DB server now only needs to merge the partial results. The DB server is in this sense more acting as a client
  • 7.
    Offloading techniques wecan use The following devices have a DMA engine: RDMA-enabled network adapters and Infiniband cards Intel IOATDMA chip on Xeon boards (for NVMe SSDs PCIe switch cards PLX-based NVMe controllers Or the PCIe chip in your Intel Xeon computer ;-) Lowest latency
  • 8.
    Offloading techniques wecan use The following protocols have (R) DMA support: iSCSI over RMDA NFS over RDMA NVMe over Fabrics (RDMA-based) or RDMA Block Device Needs the least CPU Good starting point
  • 9.
    Offloading techniques wecan use Comparison (Native PCIe fabric vs. NVMe over Fabrics) Native PCIe fabric has significantly less latency Setup with PCIe-JBOF is less complex than NVMe over Fabrics Throughput is identical
  • 10.
    Offloading techniques wecan use That PCIe is quite cool… What other tricks can it do? DMA-Engine like Infiniband Connect multiple PCIe root complexes via Non-Transparent Bridge Network protocol IPoPCIe analogous to IPoIB, but performs way better Device Sharing via I / O Virtualization (SR-IOV, MR-IOV)
  • 11.
    Offloading techniques wecan use How do we get the system really fast? Answer: Memory! The only question is: Which memory? Where is it located? How is it structured?
  • 12.
    Demo-Time ☺ Demo 1:Device Sharing Description Host 1 has a SR-IOV capable NIC Host 1 initializes a Virtual Function Through Non-Transparent Bridge (NTB) Host 2 can access that function by loading the device driver for the NIC https://www.youtube.com/watch?v=GPh0Ms3dfPo
  • 13.
    Demo-Time ☺ Demo 1:Device Sharing Expected behaviour Works as designed ☺ Depending on the approach PCIe switch chip, there is device driver dependencies
  • 14.
    Demo-Time ☺ Demo 2:DMA-Transfer Description Host 1 and Host2 are fitted with a PCIe Switch based host card and connected back to back PLXSDK comes with a Sample Program supporting PIO and DMA transfer We measure the overall throughput and cpu load https://www.youtube.com/watch?v=LNPBr3WvuNg
  • 15.
    Demo-Time ☺ Demo 2:DMA-Transfer Expected behaviour Large data transfer benefits from DMA (DWH) ☺ Small, time critical transfers have less latency with PIO (OLTP) You’ll need both modes
  • 16.
    Demo-Time ☺ Demo 3:Fabric Attached Memory (PCIe) and Oracle RAC Description Database and Memory hosts are fitted with a PCIe Switch based host card and connected to a central PCIe Switch Memory hosts’s physical DRAM is expanded with OptaneGrid 3DXpoint into an SDM Pool (mirrored via PCIe NTB) Database Servers expose a tiered PMEM Device using local DRAM (mirrored via PCIe NTB) and the remote SDM Pool accessed over PCIe NTB) ASM High Redudancy on top of PMEM Devices with preferred mirror read and device mapper path swapping db0 db1 db2 mem0 mem1 mem2 SDM DRAM Optane GRID SDM DRAM Optane GRID SDM DRAM Optane GRID ASM PMEM DRAM Expansion PMEM DRAM Expansion PMEM DRAM Expansion PCIe Switch RAC NTB Domain
  • 17.
    Demo-Time ☺ Demo 3:Fabric Attached Memory (PCIe) and Oracle RAC 16 GB/s throughput per licensable core (4cores, 8 threads per db node) 85 % of native aggregated memory controller performance
  • 18.
    Findings Generic offloading ispossible per se, but different than expected : Fabric Attached Memory Yes, the DB is running in memory (mirrored) Question is: In which server’s memory (local or remote)? How do we acccess it (local memory extension or DMA call)? How is it constructed (DRAM or Software Defined Memory)? Using the right PCIe-Switch and storage module combination you get it to work Any PCIe-capable host can use Fabric Attached Memory per se An OpenMCCA-compatible PCIe switch (PLX 9700) and high-performance M.2 SSDs such as Optane Memory or fast NVMe modules are required
  • 19.
  • 20.
    Thanks to oursupporters
  • 21.
  • 22.