FlashGrid, NVMe and 100Gbit
Infiniband – The Exadata Killer?
CEO, DATABASE-SERVERS.COM
Agenda
Goal / Objectives
Key components
Kernel, protocols, hardware
FlashGrid stack vs. Exadata
How Exadata works
How FlashGrid works
Benchmarks
Single server
RAC
Conclusion
Q&A
Introduction
About Myself
Married with children
Linux since 1998
Oracle since 2000
OCM & OCP
Chairman of NGENSTOR since 2014
CEO of DATABASE-SERVERS since 2015
Introduction
About DATABASE-SERVERS
Founded in 2015
Custom license-optimized whitebox servers
Standard Operating Environment Toolkit for Oracle/Redhat/Suse Linux
CLX Cloud Systems based on OVM/LXC stack
Performance kits for all brands (HP/Oracle/Dell/Lenovo)
Watercooling
Overclocking
HBA/SSD upgrades
Tuning/Redesign of Oracle engineered systems (ODA, Exadata)
Storage extension with NGENSTOR arrays
Performance kits
Goal / Objectives
Requirements
Maximized I/O throughput and random I/O capabilities at
the least possible CPU usage on the database server
Use only commodity hardware and technologies available
today
No closed source components like Exadata Storage Server
software
Be as compliant as possible to the standard Oracle
software stack
Key components: Linux Kernel
Why Oracle has the UEK3/4 Kernel
The current Linux distributions are focussed on stability and
certification matrices:
Kernel versions are frozen and features slowly/selectively
backported.
Oracle needs more frequent updates in selected areas for it’s
engineered system’s performance, especially in the following areas:
Infiniband Stack (OFED)
Network and Block I/O layer
Oracle’s Solution
Compile a newer/patched version of the Linux mainline kernel
against their Centos fork called Oracle Linux
Key components: Linux Kernel
UEK3 most important new feature
Multiple SCSI/Block command queues
The Linux storage stack doesn't scale:
~ 250,000 to 500.000 IOPS per LUN
~ 1,000,000 IOPS per HBA
High completion latency
High lock contention and cache line bouncing
Bad NUMA scaling
The request layer can't handle high IOPS or low latency devices
SCSI drivers are tied into the request framework
SCSI-MQ/BLK-MQ are replacements for the request layer
Key components: Linux Kernel
Command chain structure: old vs. new
old new
Key components: Linux Kernel
IOPS performance
Key components: Protocols
Infiniband /RDMA
Oracle uses Infiniband as strategic interconnect
component in all of it’s engineered systems
Main purposes:
Lower CPU utilization due to hardware offload
Lower latency for small messages like RAC interconnect (solves
scalability issues)
Oracle created the RDS protocol, that runs on top of
Infiniband/RDMA
When you connect to an Exadata Storage Server, you open an RDS
Socket ;-)
The distributed Database Approach used by Oracle via the iDB relies
on this framework
Key components: Protocols
Components used by Oracle engineered systems
Key components: Protocols
What Oracle says about RDS
Key components: Protocols
TCP/IP vs. Infiniband
TCP/IP Infiniband
Key components: Protocols
Non-Volative Memory Express (NVMe)
NVM Express is a standardized high performance software
interface for PCIe SSDs
Lower latency: Direct connection to CPU
Scalable performance: 1 GB/s per lane – 4 GB/s, 8 GB/s, … in one
SSD
Industry standards: NVM Express and PCI Express (PCIe) 3.0
Oracle Exadata Storage Servers X5 use NVMe SSD
DC P3600
2.6 GB/s read
270’000 IOPS @8k
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-p3600-series.html
Key components: Protocols
Performance comparison
Traditional stack NVMe stack
Key components: Hardware
In order to keep up with Exadata we need:
Preferrably database servers with the same or similiar
specs as Oracle’s
Intel Xeon E5-V3 for a X5-2 like performance
Intel Xeon E7-V3 for a X5-8 like performance
A high speed network card
Preferrably with hardware protocol offloading
Determine best operation mode (=> see next slide)
High bandwidth
Low latency
NVMe Drives
Internal or external
Key components: Hardware
Network card operation modes compared
iSCSI over RDMA iSCSI hardware
offload (ASIC)
Key Komponents: Summary
Cooking a decent Exadata killer requires the right
ingredients:
Kernel 3.18+ for BLK-MQ / SCSI-MQ support (e.g UEK3)
Decent database servers
RDMA capable high speed network with hardware protocol
offloading
High speed flash drives
RDS-Support linked properly for you Oracle Home
Clusterware and RAC Support for RDS Over Infiniband (Doc ID
751343.1)
FlashGrid Stack vs. Exadata
General workload categorization
OLTP
Driving number is transactions per second / IOPS
Small random IOPS (typically 8k)
Many random IOPS => alot of CPU burns away
Latency is important
DWH
Driving number is processing/elapsed time
High sequential troughput (GB/s)
Large, merged I/O => low cpu, high adapter saturation
FlashGrid Stack vs. Exadata
Common problems
Database server per core efficiency
A lot of waits/interrupts for network/storage
Adapter remote bandwidth issues
Even 2x40 Gbit are not enough to move terabytes of data to the
RDBMS engine
I/O Subsystem bottlenecks
Storage array cannot provide enough bandwidth or IOPS
FlashGrid Stack vs. Exadata
Common stack
Oracle Linux
Exadata X5: Oracle Linux 6.7
FlashGrid: Oracle Linux 7.1 or higher
Oracle Grid Infrastructure and ASM
Both recommend 12c
Infiniband / RDMA
Exadata X5: uses QDR (2x40Gbit Cards)
FlashGrid: multiple card vendors supported (Mellanox, Chelsio,
Intel, Solarflare)
FlashGrid Stack vs. Exadata
How Exadata works (simplyfied)
«Distributed Database»
Idea from the 90’s (central DB, remote DB over DB-Link)
Anyone recalls this and the driving site hint?
Work can be split/offloaded to remote databases (eg. joins)
Exadata
Database
Server aka
Client
Exadata Storage Server
Exadata Storage Server
FlashGrid Stack vs. Exadata
Core advantages of Exadata Storage Cells
We can distribute work amongst them in the same way we
would within a distributed database using DB-Links
We can use multiple data processing engines
Oracle instance on the DB Server
Engines on the Exadata Storage Cells
We save CPU and bandwidth on the effective database server
We have to transfer and process less data, as it is «pre-processed»
and also often pre-cached in the Storage Cells cache structures
But:
We have to license the Exadata Storage Cells
We have quite a vendor lock-in due the Exadata Storage Cell’s
unique architecture and proprietary IP
FlashGrid Stack vs. Exadata
How FlashGrid works 1/3
FlashGrid Stack vs. Exadata
How FlashGrid works 2/3
Hooks into your existing Oracle Grid Infrastructure stack
Basically a shared nothing storage cluster
Exports local NVMe Drives via iSCSI (either Infiniband or TCP/IP)
Uses Oracle ASM to mirror the local disks exported to all nodes
Creates a mapping for each server to use local NMVe drives for
reads instead going over the network.
Think of it like setting ASM preferred mirror read per server
Scales using Oracle RAC
The more nodes local disk, the more global bandwidth
FlashGrid Stack vs. Exadata
How FlashGrid works 3/3
FlashGrid Stack vs. Exadata
Core advantages of FlashGrid
Open source
Considerably cheaper than Exadata
Scale up using Oracle RAC just like Exadata
We are not bound to the restrictions of an engineered
system in terms of hard- and software combination
100Gbit Infiniband? => yes
Linux Containers with Oracle 12c? => yes
But:
Has no query offloading (yet) like Exadata Storage Cells
Benchmarks
Setup
Tool = CALIBRATE_IO
Official Oracle Tool
No warm up/pre-runs to avoid caching
Not perfect, but good enough to measure IOPS/MBs
https://db-blog.web.cern.ch/blog/luca-canali/2014-05-closer-look-calibrateio
Oracle Engineered system
Oracle Exadata X5-2 Quarter Rack, HC
Single x86 system
HP ML 350 G9
2x Xeon E5-V3 2699 (18 core, like X5-2)
3x HP P840 SAS RAID Controller with 4 GB cache, 48 OCZ Intrepid SATA SSDs (Test1+3)
Remote access via iSCSI over RDMA (Test2)
6x Intel P3608 NVMe drives connected (Test4)
FlashGrid
2x HP ML 350 G9
2x Xeon E5-V3 2699 (18 core, like X5-2)
6x Intel P3608 NVMe drives connected
2x Infiniband 100 Gbit Adapter
Benchmarks
Oracle Exadata X5-2 Quarter Rack, HC
No cache, no performance for high capacity Storage Cells …
Benchmarks
HP ML 350 G9, 48x SATA disks 1/3
SCSI-MQ=off Hyperthreading=off
Limited by per port speed/controller
and number of controllers
Benchmarks
HP ML 350 G9, 48x SATA disks 2/3
SCSI-MQ=on Hyperthreading=on
Limited by NIC count and port speed
Benchmarks
HP ML 350 G9, 48x SATA disks 3/3
SCSI-MQ=on Hyperthreading=on
Limited by per port speed/controller
and number of controllers
Benchmarks
HP ML 350 G9, NVMe drives
SCSI-MQ=on Hyperthreading=on
Limited by number of NVMe drives
and RAC Nodes. Exadata uses 56
NVMe drives, 7 storage cells and 4
RAC nodes
Benchmarks
FlashGrid, 2xHP ML 350 G9, NVMe drives
Limited by number of NVMe drives
and RAC Nodes. Exadata uses 112
NVMe drives, 14 storage cells and 8
RAC nodes
To match the read performance of Exadata Full Rack EF , we need
to scale up to 8xHP ML 350 G9 RAC nodes
Conclusion
Having chased Oracle Exadata’s performance for quite
some years I can conclude that:
Commodity servers
Can keep up with Oracle engineered systems thanks to the newest
network and flash technology
FlashGrid
Offers execellent raw performance
Is simpler to maintain
Has cheaper TCO
Is just good enough for the majority of clients
Oracle Exadata
Has still a few areas, where it offers unmatched performance thanks
to it’s proprietary IP, even tough it’s value is declining
Thanks to our partners
Q&A
Contact
elgreco@linux.com
efstathios.efstathiou@database-servers.com
Thank You

SOUG_GV_Flashgrid_V4

  • 1.
    FlashGrid, NVMe and100Gbit Infiniband – The Exadata Killer? CEO, DATABASE-SERVERS.COM
  • 2.
    Agenda Goal / Objectives Keycomponents Kernel, protocols, hardware FlashGrid stack vs. Exadata How Exadata works How FlashGrid works Benchmarks Single server RAC Conclusion Q&A
  • 3.
    Introduction About Myself Married withchildren Linux since 1998 Oracle since 2000 OCM & OCP Chairman of NGENSTOR since 2014 CEO of DATABASE-SERVERS since 2015
  • 4.
    Introduction About DATABASE-SERVERS Founded in2015 Custom license-optimized whitebox servers Standard Operating Environment Toolkit for Oracle/Redhat/Suse Linux CLX Cloud Systems based on OVM/LXC stack Performance kits for all brands (HP/Oracle/Dell/Lenovo) Watercooling Overclocking HBA/SSD upgrades Tuning/Redesign of Oracle engineered systems (ODA, Exadata) Storage extension with NGENSTOR arrays Performance kits
  • 5.
    Goal / Objectives Requirements MaximizedI/O throughput and random I/O capabilities at the least possible CPU usage on the database server Use only commodity hardware and technologies available today No closed source components like Exadata Storage Server software Be as compliant as possible to the standard Oracle software stack
  • 6.
    Key components: LinuxKernel Why Oracle has the UEK3/4 Kernel The current Linux distributions are focussed on stability and certification matrices: Kernel versions are frozen and features slowly/selectively backported. Oracle needs more frequent updates in selected areas for it’s engineered system’s performance, especially in the following areas: Infiniband Stack (OFED) Network and Block I/O layer Oracle’s Solution Compile a newer/patched version of the Linux mainline kernel against their Centos fork called Oracle Linux
  • 7.
    Key components: LinuxKernel UEK3 most important new feature Multiple SCSI/Block command queues The Linux storage stack doesn't scale: ~ 250,000 to 500.000 IOPS per LUN ~ 1,000,000 IOPS per HBA High completion latency High lock contention and cache line bouncing Bad NUMA scaling The request layer can't handle high IOPS or low latency devices SCSI drivers are tied into the request framework SCSI-MQ/BLK-MQ are replacements for the request layer
  • 8.
    Key components: LinuxKernel Command chain structure: old vs. new old new
  • 9.
    Key components: LinuxKernel IOPS performance
  • 10.
    Key components: Protocols Infiniband/RDMA Oracle uses Infiniband as strategic interconnect component in all of it’s engineered systems Main purposes: Lower CPU utilization due to hardware offload Lower latency for small messages like RAC interconnect (solves scalability issues) Oracle created the RDS protocol, that runs on top of Infiniband/RDMA When you connect to an Exadata Storage Server, you open an RDS Socket ;-) The distributed Database Approach used by Oracle via the iDB relies on this framework
  • 11.
    Key components: Protocols Componentsused by Oracle engineered systems
  • 12.
    Key components: Protocols WhatOracle says about RDS
  • 13.
    Key components: Protocols TCP/IPvs. Infiniband TCP/IP Infiniband
  • 14.
    Key components: Protocols Non-VolativeMemory Express (NVMe) NVM Express is a standardized high performance software interface for PCIe SSDs Lower latency: Direct connection to CPU Scalable performance: 1 GB/s per lane – 4 GB/s, 8 GB/s, … in one SSD Industry standards: NVM Express and PCI Express (PCIe) 3.0 Oracle Exadata Storage Servers X5 use NVMe SSD DC P3600 2.6 GB/s read 270’000 IOPS @8k http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-p3600-series.html
  • 15.
    Key components: Protocols Performancecomparison Traditional stack NVMe stack
  • 16.
    Key components: Hardware Inorder to keep up with Exadata we need: Preferrably database servers with the same or similiar specs as Oracle’s Intel Xeon E5-V3 for a X5-2 like performance Intel Xeon E7-V3 for a X5-8 like performance A high speed network card Preferrably with hardware protocol offloading Determine best operation mode (=> see next slide) High bandwidth Low latency NVMe Drives Internal or external
  • 17.
    Key components: Hardware Networkcard operation modes compared iSCSI over RDMA iSCSI hardware offload (ASIC)
  • 18.
    Key Komponents: Summary Cookinga decent Exadata killer requires the right ingredients: Kernel 3.18+ for BLK-MQ / SCSI-MQ support (e.g UEK3) Decent database servers RDMA capable high speed network with hardware protocol offloading High speed flash drives RDS-Support linked properly for you Oracle Home Clusterware and RAC Support for RDS Over Infiniband (Doc ID 751343.1)
  • 19.
    FlashGrid Stack vs.Exadata General workload categorization OLTP Driving number is transactions per second / IOPS Small random IOPS (typically 8k) Many random IOPS => alot of CPU burns away Latency is important DWH Driving number is processing/elapsed time High sequential troughput (GB/s) Large, merged I/O => low cpu, high adapter saturation
  • 20.
    FlashGrid Stack vs.Exadata Common problems Database server per core efficiency A lot of waits/interrupts for network/storage Adapter remote bandwidth issues Even 2x40 Gbit are not enough to move terabytes of data to the RDBMS engine I/O Subsystem bottlenecks Storage array cannot provide enough bandwidth or IOPS
  • 21.
    FlashGrid Stack vs.Exadata Common stack Oracle Linux Exadata X5: Oracle Linux 6.7 FlashGrid: Oracle Linux 7.1 or higher Oracle Grid Infrastructure and ASM Both recommend 12c Infiniband / RDMA Exadata X5: uses QDR (2x40Gbit Cards) FlashGrid: multiple card vendors supported (Mellanox, Chelsio, Intel, Solarflare)
  • 22.
    FlashGrid Stack vs.Exadata How Exadata works (simplyfied) «Distributed Database» Idea from the 90’s (central DB, remote DB over DB-Link) Anyone recalls this and the driving site hint? Work can be split/offloaded to remote databases (eg. joins) Exadata Database Server aka Client Exadata Storage Server Exadata Storage Server
  • 23.
    FlashGrid Stack vs.Exadata Core advantages of Exadata Storage Cells We can distribute work amongst them in the same way we would within a distributed database using DB-Links We can use multiple data processing engines Oracle instance on the DB Server Engines on the Exadata Storage Cells We save CPU and bandwidth on the effective database server We have to transfer and process less data, as it is «pre-processed» and also often pre-cached in the Storage Cells cache structures But: We have to license the Exadata Storage Cells We have quite a vendor lock-in due the Exadata Storage Cell’s unique architecture and proprietary IP
  • 24.
    FlashGrid Stack vs.Exadata How FlashGrid works 1/3
  • 25.
    FlashGrid Stack vs.Exadata How FlashGrid works 2/3 Hooks into your existing Oracle Grid Infrastructure stack Basically a shared nothing storage cluster Exports local NVMe Drives via iSCSI (either Infiniband or TCP/IP) Uses Oracle ASM to mirror the local disks exported to all nodes Creates a mapping for each server to use local NMVe drives for reads instead going over the network. Think of it like setting ASM preferred mirror read per server Scales using Oracle RAC The more nodes local disk, the more global bandwidth
  • 26.
    FlashGrid Stack vs.Exadata How FlashGrid works 3/3
  • 27.
    FlashGrid Stack vs.Exadata Core advantages of FlashGrid Open source Considerably cheaper than Exadata Scale up using Oracle RAC just like Exadata We are not bound to the restrictions of an engineered system in terms of hard- and software combination 100Gbit Infiniband? => yes Linux Containers with Oracle 12c? => yes But: Has no query offloading (yet) like Exadata Storage Cells
  • 28.
    Benchmarks Setup Tool = CALIBRATE_IO OfficialOracle Tool No warm up/pre-runs to avoid caching Not perfect, but good enough to measure IOPS/MBs https://db-blog.web.cern.ch/blog/luca-canali/2014-05-closer-look-calibrateio Oracle Engineered system Oracle Exadata X5-2 Quarter Rack, HC Single x86 system HP ML 350 G9 2x Xeon E5-V3 2699 (18 core, like X5-2) 3x HP P840 SAS RAID Controller with 4 GB cache, 48 OCZ Intrepid SATA SSDs (Test1+3) Remote access via iSCSI over RDMA (Test2) 6x Intel P3608 NVMe drives connected (Test4) FlashGrid 2x HP ML 350 G9 2x Xeon E5-V3 2699 (18 core, like X5-2) 6x Intel P3608 NVMe drives connected 2x Infiniband 100 Gbit Adapter
  • 29.
    Benchmarks Oracle Exadata X5-2Quarter Rack, HC No cache, no performance for high capacity Storage Cells …
  • 30.
    Benchmarks HP ML 350G9, 48x SATA disks 1/3 SCSI-MQ=off Hyperthreading=off Limited by per port speed/controller and number of controllers
  • 31.
    Benchmarks HP ML 350G9, 48x SATA disks 2/3 SCSI-MQ=on Hyperthreading=on Limited by NIC count and port speed
  • 32.
    Benchmarks HP ML 350G9, 48x SATA disks 3/3 SCSI-MQ=on Hyperthreading=on Limited by per port speed/controller and number of controllers
  • 33.
    Benchmarks HP ML 350G9, NVMe drives SCSI-MQ=on Hyperthreading=on Limited by number of NVMe drives and RAC Nodes. Exadata uses 56 NVMe drives, 7 storage cells and 4 RAC nodes
  • 34.
    Benchmarks FlashGrid, 2xHP ML350 G9, NVMe drives Limited by number of NVMe drives and RAC Nodes. Exadata uses 112 NVMe drives, 14 storage cells and 8 RAC nodes To match the read performance of Exadata Full Rack EF , we need to scale up to 8xHP ML 350 G9 RAC nodes
  • 35.
    Conclusion Having chased OracleExadata’s performance for quite some years I can conclude that: Commodity servers Can keep up with Oracle engineered systems thanks to the newest network and flash technology FlashGrid Offers execellent raw performance Is simpler to maintain Has cheaper TCO Is just good enough for the majority of clients Oracle Exadata Has still a few areas, where it offers unmatched performance thanks to it’s proprietary IP, even tough it’s value is declining
  • 36.
    Thanks to ourpartners
  • 37.
  • 38.
  • 39.