Offloading for Databases - Deep Dive

Offloading Deep Dive
Efstathios Efstathiou

Agenda
Introduction
Definition of offloading (DB view)
Offloading techniques we can use
Demo-Time ☺
Findings
Q&A

Introduction
About me
Married
Linux since 1998
Oracle since 2000
OCM & OCP
Master Database Engineer @BIT since 2014

In general:
«Everything, that saves resources on
the database server»

Examples of offloading implementations
NIC (TCP/IP Offload, iSCSI Offload, Infiniband RDMA, NVMe)
Storage Adatapets (RAID Calculation, SCSI)
Math Co-Processors
FPGAs
DMA-Engines
Distributed Computing (e.g. using MPI)
Remote DB Engine (Hadoop Connector, Gluent)

How is it done the Exadata?
Offloading via DMA-Engine of the Infiniband HCA
Enables Remote-DMA (RDMA) Operations (DB to Cell)
The storage cell can be acessed at near zero cpu cost
Latency of a DMA operation is higher than PIO via CPU therefore good for large
amounts of data e.g. DWH, but worse for OLTP
The task can be distributed
Order e.g. to execute a sub-query on a node via MPI-call and to transmit the start
or end memory address to the requester (DB server)
The DB server now only needs to merge the partial results.
The DB server is in this sense more acting as a client

The following devices have a DMA engine:
RDMA-enabled network adapters and Infiniband cards
Intel IOATDMA chip on Xeon boards (for NVMe SSDs
PCIe switch cards
PLX-based NVMe controllers
Or the PCIe chip in your Intel Xeon computer ;-)
Lowest latency

The following protocols have (R) DMA support:
iSCSI over RMDA
NFS over RDMA
NVMe over Fabrics (RDMA-based) or RDMA Block Device
Needs the least CPU
Good starting point

Comparison (Native PCIe fabric vs. NVMe over Fabrics)
Native PCIe fabric has significantly less latency
Setup with PCIe-JBOF is less complex than NVMe over Fabrics
Throughput is identical

That PCIe is quite cool… What other tricks can it do?
DMA-Engine like Infiniband
Connect multiple PCIe root complexes via Non-Transparent Bridge
Network protocol IPoPCIe analogous to IPoIB, but performs way better
Device Sharing via I / O Virtualization (SR-IOV, MR-IOV)

How do we get the system really fast?
Answer: Memory!
The only question is:
Which memory?
Where is it located?
How is it structured?

Demo-Time ☺
Demo 1: Device Sharing
Description
Host 1 has a SR-IOV capable NIC
Host 1 initializes a Virtual Function
Through Non-Transparent Bridge
(NTB) Host 2 can access that
function by loading the device driver
for the NIC
https://www.youtube.com/watch?v=GPh0Ms3dfPo

Demo-Time ☺
Demo 1: Device Sharing
Expected behaviour
Works as designed ☺
Depending on the approach PCIe switch chip, there is device driver dependencies

Demo-Time ☺
Demo 2: DMA-Transfer
Description
Host 1 and Host2 are fitted with a
PCIe Switch based host card and
connected back to back
PLXSDK comes with a Sample
Program supporting PIO and DMA
transfer
We measure the overall throughput
and cpu load
https://www.youtube.com/watch?v=LNPBr3WvuNg

Demo-Time ☺
Demo 2: DMA-Transfer
Expected behaviour
Large data transfer benefits from DMA (DWH) ☺
Small, time critical transfers have less latency with PIO (OLTP)
You’ll need both modes

Demo-Time ☺
Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC
Description
Database and Memory hosts are fitted
with a PCIe Switch based host card and
connected to a central PCIe Switch
Memory hosts’s physical DRAM is
expanded with OptaneGrid 3DXpoint
into an SDM Pool (mirrored via PCIe
NTB)
Database Servers expose a tiered
PMEM Device using local DRAM
(mirrored via PCIe NTB) and the remote
SDM Pool accessed over PCIe NTB)
ASM High Redudancy on top of PMEM
Devices with preferred mirror read and
device mapper path swapping
db0 db1 db2
mem0 mem1 mem2
SDM
DRAM
Optane
GRID
SDM
DRAM
Optane
GRID
SDM
DRAM
Optane
GRID
ASM
PMEM
DRAM
Expansion
PMEM
DRAM
Expansion
PMEM
DRAM
Expansion
PCIe Switch
RAC
NTB
Domain

Demo-Time ☺
Demo 3: Fabric Attached Memory (PCIe) and Oracle RAC
16 GB/s throughput per licensable core (4cores, 8 threads per db node)
85 % of native aggregated memory controller performance

Findings
Generic offloading is possible per se, but different than expected :
Fabric Attached Memory
Yes, the DB is running in memory (mirrored)
Question is:
In which server’s memory (local or remote)?
How do we acccess it (local memory extension or DMA call)?
How is it constructed (DRAM or Software Defined Memory)?
Using the right PCIe-Switch and storage module combination you
get it to work
Any PCIe-capable host can use Fabric Attached Memory per se
An OpenMCCA-compatible PCIe switch (PLX 9700) and high-performance M.2 SSDs
such as Optane Memory or fast NVMe modules are required

Contact Information
elgreco@linux.com

Offloading for Databases - Deep Dive

More Related Content

What's hot

Similar to Offloading for Databases - Deep Dive

Recently uploaded

Offloading for Databases - Deep Dive