Your SlideShare is downloading. ×
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
ION performance brief   hp dl980-8b
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ION performance brief hp dl980-8b

448

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
448
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Performance Brief for the HP DL980 (Database Server) and DL380 (ION Data Accelerator™) 4.24.2013
  • 2. Copyright Notice The information contained in this document is subject to change without notice. Fusion-io MAKES NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Except to correct same after receipt of reasonable notice, Fusion-io shall not be liable for errors contained herein or for incidental and/or consequential damages in connection with the furnishing, performance, or use of this material. The information contained in this document is protected by copyright. © 2013, Fusion-io, Inc. All rights reserved. Fusion-io, the Fusion-io logo and ioDrive are registered trademarks of Fusion-io in the United States and other countries. The names of other organizations and products referenced herein are the trademarks or service marks (as applicable) of their respective owners. Unless otherwise stated herein, no association with any other organization or product referenced herein is intended or should be inferred. Fusion-io: 2855 E. Cottonwood Parkway, Box 100 Salt Lake City, UT 84121 USA (801) 424-5500
  • 3. CONTENTS Introduction ............................................................................................................................................. 1 HARDWARE ................................................................................................................................... 2 ION Data Accelerator System ................................................................................................... 2 Initiator System........................................................................................................................ 2 Storage Configuration.............................................................................................................................. 3 INITIATOR HBA PLACEMENT........................................................................................................... 3 ION DATA ACCELERATOR STORAGE POOL CONFIGURATION ........................................................ 5 ION VOLUME CONFIGURATION...................................................................................................... 5 ION LUN CONFIGURATION............................................................................................................. 6 MULTIPATH VERIFICATION ............................................................................................................. 8 Initiator BIOS Tuning ..............................................................................................................................11 UPDATING THE BIOS FOR NUMA DETECTION...............................................................................12 POWER MANAGEMENT OPTIONS.................................................................................................12 SYSTEM OPTIONS.........................................................................................................................14 ADVANCED OPTIONS...................................................................................................................16 Setting the Addressing Mode.................................................................................................16 Disabling x2APIC....................................................................................................................17 Initiator Tuning on Linux ........................................................................................................................18 MULTIPATHING ............................................................................................................................18 DISABLING PROCESSOR C-STATES IN LINUX.................................................................................18 IONTUNER RPM............................................................................................................................19 Block Device Tuning with udev Rules .....................................................................................20 Disabling the cpuspeed Daemon............................................................................................21 Pinning interrupts ..................................................................................................................21 VERIFYING THREAD PINNING........................................................................................................22 Oracle Tuning.........................................................................................................................................25 HUGEPAGES.................................................................................................................................25 SYSCTL PARAMETERS ..................................................................................................................25 ORACLE INITIALIZATION PARAMETERS.........................................................................................26 fio Performance Testing .........................................................................................................................27 PRECONDITIONING FLASH STORAGE ...........................................................................................27 TESTING THREAD CPU AFFINITY...................................................................................................27
  • 4. TEST COMMANDS .......................................................................................................................27 RESULTS.......................................................................................................................................30 SEQUENTIAL R/W THROUGHPUT AND IOPS..................................................................................31 RANDOM MIX R/W IOPS ..............................................................................................................32 RANDOM MIX R/W THROUGHPUT ...............................................................................................32 Oracle Performance Testing....................................................................................................................34 TEST SETUP ..................................................................................................................................34 TEST COMMANDS .......................................................................................................................36 RESULTS.......................................................................................................................................37 Oracle Database Testing.........................................................................................................................38 READ WORKLOAD TEST – QUEST BENCHMARK FACTORY...........................................................38 OLTP WORKLOAD TEST – HEAVY INSERT SCRIPT..........................................................................43 TRANSACTIONS TEST – SWINGBENCH .........................................................................................47 Conclusions............................................................................................................................................48 Glossary .................................................................................................................................................49 Appendix A: Tuning Checklist ................................................................................................................50 Appendix B: Speeding up Oracle Database Performance with ioMemory – an HP Session.......................52 ARCHITECTURE OVERVIEW ..........................................................................................................52 ABOUT ION DATA ACCELERATOR................................................................................................53 ION Data Accelerator Software ..............................................................................................53 Fusion-Powered Storage Stack...............................................................................................53 Why ION Data Accelerator? ...................................................................................................54 ABOUT ION DATA ACCELERATOR HA (HIGH AVAILABILITY) ........................................................54 PERFORMANCE TEST RESULTS: HP DL380 / HP DL980..................................................................55 OVERVIEW OF THE ION DATA ACCELERATOR GUI.......................................................................57 COMPARATIVE SOLUTIONS..........................................................................................................60 BEST PRACTICES ..........................................................................................................................61 BENCHMARK TEST CONFIGURATION ...........................................................................................62 RAW PERFORMANCE TEST RESULTS WITH FIO .............................................................................63 Total IOPS ..............................................................................................................................63 Average Completion Latency (Microseconds) .........................................................................64 Raw I/O Test: 70% Read, 30% Write.....................................................................................64 Raw I/O Test: 100% Read at 8KB...........................................................................................65 Raw I/O Test: Read Latency (Microseconds)............................................................................65 ORACLE WORKLOAD TESTS.........................................................................................................66
  • 5. Introduction ________________________________________________________________________ This document describes methods used to maximize performance for Oracle Database Server running on an HP DL980 and for ION Data Accelerator running on an HP DL380. These methods should provide a foundation for tuning methods with a variety of tests and customer applications. The non-uniform memory access (NUMA) architecture of the DL980 presents challenges in minimizing data transfers between multiple processor nodes, while efficiently distributing I/O processing across available resources. Without any tuning, a configuration capable of as much as 700,000 IOPS may instead achieve no more than 160,000 IOPS. Likewise, a system capable of bandwidths of up to 7 GB/s may be limited to 3.5 GB/s. Testing performed with an un-tuned initiator may reflect poorly on ION Data Accelerator performance, when in reality the ION Data Accelerator software is not the problem. The goals of this document are to • Provide an example of what is possible with a specific configuration. • Provide the tools necessary to improve performance on a variety of DL980 configurations, or with other initiator servers used with ION Data Accelerator. Depending on the ioDrives and HBAs used, as well as fabric connectivity, you may need to vary the tuning described in this document. A script has been provided to perform the most complex tuning operations, but the steps performed by the script are fully described so you can adapt them for a variety of configurations. These tuning methods were originally used to maximize performance at HP European Performance Center in Böblingen. A similar configuration was recreated at Fusion-io in San Jose, and the performance results described in this document are the results of that testing. Though there were minor variations between the two configurations, similar performance was achieved. For more details on the features and functionality of ION Data Accelerator, refer to the ION Data Accelerator User Guide. 1
  • 6. HARDWARE This section describes the hardware components used in the performance testing of the ION Data Accelerator appliance with its initiator. ION Data Accelerator System • DL380p Gen8 server • 2 x Intel Xeon E5-2640 CPUs (6 cores each, 2.5 GHz) • 64GB RAM • 3 x 2.41TB ioDrive2 Duos • 1 x QLogic 8Gbit Fibre Channel quad-port HBA • 2 x QLogic 8Gbit Fibre Channel dual-port HBAs • ION Data Accelerator 2.0.0 build 349 (VSL 3.2.3 build 950) Initiator System • HP DL980 Gen7 server • 8 x Intel Xeon E7-4870 CPUs (10 cores each, 2.4 GHz) • 256 GB RAM • 3 x Emulex 8 Gbit Fibre Channel dual-port HBAs • 1 x QLogic 8 Gbit Fibre Channel dual-port HBA • Red Hat Enterprise Linux 6.3 • Oracle Database 11g Enterprise Edition 64-bit Release 11.2.0.3.0 with ASM 2
  • 7. Storage Configuration ________________________________________________________________________ INITIATOR HBA PLACEMENT The NUMA architecture of the DL980 must be considered when choosing where to place HBAs. PCIe slots 7, 8, 9, 10, and 11 are attached to the I/O hub nearest to CPU sockets 0 and 1. PCIe slots 1, 2, 3, 4, 5, and 6 are attached to the I/O hub nearest to CPU sockets 2 and 3. PCIe slots 12, 13, 14, 15, and 16 are attached to the I/O hub nearest to CPU sockets 4 and 5. In the configurations used at HP Böblingen and Fusion-io San Jose, two HBAs were placed in slots from 1 through 6, and two HBAs were placed in slots from 7 through 11. In that configuration, I/O 3
  • 8. traffic is split between two I/O hubs. By using multiple I/O Hubs, more CPU cores can access data from the HBAs at a low cost, but there is a risk of transferring data between I/O hubs, which may cause poor performance. It is important to configure volume access such that no single volume is accessed from multiple I/O hubs. Note that even though a PCIe slot may be equidistant from two nodes, there is still less latency between cores within a node than between CPU cores on separate nodes attached to the same I/O hub. Although the diagram above shows slots 12 through 16 attached to CPU sockets 6 and 7, other documentation from HP suggests that these slots are attached to nodes 4 and 5. If using the expansion slots, it is best to manually check the location of the PCIe slots. You can use lspci to find the bus addresses of HBAs in the system: # lspci | grep "Fibre Channel" 0b:00.0 Fibre Channel: ... 0b:00.1 Fibre Channel: ... 11:00.0 Fibre Channel: ... 11:00.1 Fibre Channel: ... 54:00.0 Fibre Channel: ... 54:00.1 Fibre Channel: ... 60:00.0 Fibre Channel: ... 60:00.1 Fibre Channel: ... You can also use dmidecode to determine the PCI slot associated with each bus address: # dmidecode -t slot ... Handle 0x0908, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 9 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long ID: 9 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:0b:00.0 ... Handle 0x090A, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot11 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long 4
  • 9. ID: 11 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:11:00.0 ... Handle 0x0901, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 2 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long ID: 2 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:54:00.0 ... Handle 0x0905, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 6 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long ID: 6 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:60:00.0 ION DATA ACCELERATOR STORAGE POOL CONFIGURATION A RAID 0 set was created using all three ioDrive2 Duo cards present in the ION Data Accelerator system. This was done by using the following CLI command to create a storage profile for maximum performance: admin@/> profile:create max_performance ION VOLUME CONFIGURATION Eight volumes of equal size were created from the storage pool, using the following CLI commands: admin@/> volume:create volume0 841 pool_md0 admin@/> volume:create volume1 841 pool_md0 5
  • 10. admin@/> volume:create volume2 841 pool_md0 admin@/> volume:create volume3 841 pool_md0 admin@/> volume:create volume4 841 pool_md0 admin@/> volume:create volume5 841 pool_md0 admin@/> volume:create volume6 841 pool_md0 admin@/> volume:create volume7 841 pool_md0 For ION Data Accelerator configurations with many ioDrives, it may be necessary to use 16 or more volumes to achieve maximum performance. ION LUN CONFIGURATION To provide sufficient performance as well as redundancy, LUN access should be provided through multiple ION Data Accelerator targets and multiple initiator cards. Additionally, because of the NUMA architecture characteristics of the DL980, it may be best to localize access for each volume to a single I/O hub. Volumes should be exposed so that traffic is distributed evenly across all ports. The diagram below shows the link configuration that was used at HP Böblingen. Figure 1. Link configuration used at HP Boblingen Four ports on the ION Data Accelerator system were connected to eight ports on the DL980 initiator, through a switch. On the initiator, two dual-port cards were placed in I/O hub 1 and in I/O hub 2. Exports were created on the four ports of the ION Data Accelerator to the four ports on each I/O hub of the initiator. Each volume was exported on two links: 6
  • 11. • Volume 0: t1 to i1, t4 to i4 • Volume 1: t2 to i2, t3 to i3 • Volume 2: t3 to i7, t2 to i6 • Volume 3: t1 to i5, t4 to i8 The same access pattern was repeated with every set of four subsequent volumes. Notice that access to each volume is localized to a single I/O hub on the initiator. The diagram below shows the link configuration that was used at Fusion-io San Jose. Figure 2. Link configuration used at Fusion-io San Jose Because a switch was unavailable, eight ports on the ION Data Accelerator system were directly connected to eight ports on the initiator. Each volume was exported on two links: • Volume 0: t1 to i1, t6 to i4 • Volume 1: t3 to i5, t8 to i8 • Volume 2: t2 to i2, t5 to i3 • Volume 3: t4 to i6, t7 to i7 The same access pattern was repeated with every set of four subsequent volumes. Notice that access to each volume is once again localized to a single I/O hub on the initiator. The following CLI commands were used to create initiator groups and LUNs on the ION Data Accelerator system at Fusion-io San Jose: 7
  • 12. admin@/> inigroup:create i1 10:00:00:90:fa:14:a1:fc admin@/> inigroup:create i2 10:00:00:90:fa:14:a1:fd admin@/> inigroup:create i3 10:00:00:90:fa:14:f9:d4 admin@/> inigroup:create i4 10:00:00:90:fa:14:f9:d5 admin@/> inigroup:create i5 10:00:00:90:fa:1b:03:c8 admin@/> inigroup:create i6 10:00:00:90:fa:1b:03:c9 admin@/> inigroup:create i7 21:00:00:24:ff:46:bf:ca admin@/> inigroup:create i8 21:00:00:24:ff:46:bf:cb admin@/> lun:create -b 512 volume0 i1 21:00:00:24:ff:69:d3:4c admin@/> lun:create -b 512 volume0 i6 21:00:00:24:ff:46:c0:b5 admin@/> lun:create -b 512 volume1 i3 21:00:00:24:ff:69:d3:4e admin@/> lun:create -b 512 volume1 i8 21:00:00:24:ff:45:f4:ad admin@/> lun:create -b 512 volume2 i2 21:00:00:24:ff:69:d3:4d admin@/> lun:create -b 512 volume2 i5 21:00:00:24:ff:46:c0:b4 admin@/> lun:create -b 512 volume3 i4 21:00:00:24:ff:69:d3:4f admin@/> lun:create -b 512 volume3 i7 21:00:00:24:ff:45:f4:ac admin@/> lun:create -b 512 volume4 i1 21:00:00:24:ff:69:d3:4c admin@/> lun:create -b 512 volume4 i6 21:00:00:24:ff:46:c0:b5 admin@/> lun:create -b 512 volume5 i3 21:00:00:24:ff:69:d3:4e admin@/> lun:create -b 512 volume5 i8 21:00:00:24:ff:45:f4:ad admin@/> lun:create -b 512 volume6 i2 21:00:00:24:ff:69:d3:4d admin@/> lun:create -b 512 volume6 i5 21:00:00:24:ff:46:c0:b4 admin@/> lun:create -b 512 volume7 i4 21:00:00:24:ff:69:d3:4f admin@/> lun:create -b 512 volume7 i7 21:00:00:24:ff:45:f4:ac MULTIPATH VERIFICATION When the steps above have been completed and dm-multipath has been started on the initiator, the multipath command may be used to verify the configuration. # multipath –ll mpathhes (23337613362643333) dm-2 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 1:0:0:0 sdd 8:48 active ready running `- 2:0:0:0 sdf 8:80 active ready running mpathhez (23330633436333064) dm-7 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 4:0:0:1 sdk 8:160 active ready running `- 7:0:0:1 sdq 65:0 active ready running 8
  • 13. mpathhey (23437373930653063) dm-4 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 0:0:0:1 sdc 8:32 active ready running `- 3:0:0:1 sdi 8:128 active ready running mpathhex (26433343437616137) dm-8 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 5:0:0:1 sdm 8:192 active ready running `- 6:0:0:1 sdo 8:224 active ready running mpathhew (23061313364323662) dm-5 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 1:0:0:1 sde 8:64 active ready running `- 2:0:0:1 sdg 8:96 active ready running mpathhev (26432353466383337) dm-6 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 4:0:0:0 sdj 8:144 active ready running `- 7:0:0:0 sdp 8:240 active ready running mpathheu (23637366232363564) dm-3 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 0:0:0:0 sdb 8:16 active ready running `- 3:0:0:0 sdh 8:112 active ready running mpathhet (23632393433663839) dm-9 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 5:0:0:0 sdl 8:176 active ready running `- 6:0:0:0 sdn 8:208 active ready running Notice that there are eight multipath devices, each comprised of two LUNs. Each path has a number associated with it, of the form <host>:0:0:<lun#>. The host numbers correspond to specific PCI device ports. A PCI device address can be correlated to a host number by looking in sysfs: # ls -d /sys/bus/pci/devices/*/host* /sys/bus/pci/devices/0000:11:00.0/host0 9
  • 14. /sys/bus/pci/devices/0000:11:00.1/host1 /sys/bus/pci/devices/0000:0b:00.0/host2 /sys/bus/pci/devices/0000:0b:00.1/host3 /sys/bus/pci/devices/0000:54:00.0/host4 /sys/bus/pci/devices/0000:54:00.1/host5 /sys/bus/pci/devices/0000:60:00.0/host6 /sys/bus/pci/devices/0000:60:00.1/host7 For example, multipath device mpathhet has paths through hosts 5 and 6 (shown by the numbers 5:0:0:0 and 6:0:0:0), which correspond to devices 0000:54:00.1 and 0000:60:00.0. The output from the dmidecode command used in the Initiator HBA Placement section shows that this volume is exposed through HBAs in slots 2 and 6, which are both in the same I/O hub. It is important that each volume presented in multipath is accessed only through HBAs in the same I/O hub. 10
  • 15. Initiator BIOS Tuning ________________________________________________________________________ The following settings should be applied on the HP DL980 initiator, using the ROM-Based Setup Utility (RBSU) on boot. To enter the RBSU, press F9 during boot (when the F9 Setup option appears on the screen). 11
  • 16. UPDATING THE BIOS FOR NUMA DETECTION In the DL980 BIOS version dated 05/01/2012, a change was made to the SLIT node distances. This may affect performance, so it is recommended that the latest version of the BIOS be used. Incorrect SLIT node distances are a common issue with early BIOS revisions on many platforms. The BIOS version can be determined from the main BIOS screen. Alternatively, numactl can be used to verify that the node distances match the table below: # numactl –hardware ... node distances: node 0 1 2 3 4 5 6 7 0: 10 12 17 17 19 19 19 19 1: 12 10 17 17 19 19 19 19 2: 17 17 10 12 19 19 19 19 3: 17 17 12 10 19 19 19 19 4: 19 19 19 19 10 12 17 17 5: 19 19 19 19 12 10 17 17 6: 19 19 19 19 17 17 10 12 7: 19 19 19 19 17 17 12 10 POWER MANAGEMENT OPTIONS To enable maximum performance, disable the HP power management options. 1. Select Power Management Options > HP Power Profile > Maximum Performance. 12
  • 17. 13
  • 18. 2. Verify that C-states have been disabled by selecting Power Management Options > Advanced Power Management Options > Minimum Processor Idle Power Core State. “No C-states” should be highlighted in the menu. C-states may also need to be disabled in Linux, as explained later in this document. SYSTEM OPTIONS Intel Hyperthreading may or may not be beneficial to ION Data Accelerator performance. In this test setup, Hyperthreading was enabled. Other system options were set as described below. 1. Enable hyperthreading by selecting System Options > Processor Options > Intel Hyperthreading Options > Enabled. 14
  • 19. 2. Disable Virtualization if it is not required, by selecting System Options >Processor Options > Intel Virtualization Technology > Disabled. 15
  • 20. 3. Disable VT-d (Virtualization Technology for Directed I/O) by selecting System Options > Processor Options > Intel VT-d > Disabled. ADVANCED OPTIONS Setting the Addressing Mode The preferred addressing mode depends on the operating system and the amount of memory used. For all RHEL 5.x installations, use 40-bit addressing. For RHEL 6.x installations, use 40-bit addressing when 1TB or less memory is present; otherwise, 44-bit addressing must be used to take advantage of all available memory. To disable 44-bit addressing, select Advanced Options > Advanced System ROM Options > Address Mode 44-bit > Disabled. 16
  • 21. For RHEL 6.x installations using greater than 1 TB of memory, use 44-bit addressing: Advanced Options > Advanced System ROM Options > Address Mode 44-bit > Enabled. At HP Böblingen, the DL980 contained 1TB of memory, so 40-bit addressing was sufficient. Disabling x2APIC To verify that x2APIC is disabled, select Advanced Options > Advanced System ROM Options > x2APIC Options. The “Disabled” option should be highlighted; select it if it is not. 17
  • 22. Initiator Tuning on Linux ________________________________________________________________________ The following settings should be configured in Linux. In some cases, a reboot must be applied in order for changes to take effect. MULTIPATHING Typically, the preferred queuing technique is to send I/O to the path with the least number of I/Os currently queued. The following is an example of how the multipath.conf file can be configured, using a path_selector of “queue-length 0”: device { vendor "FUSIONIO" product "*" path_selector "queue-length 0" rr_min_io_rq 1 rr_weight uniform no_path_retry 20 failback 60 path_grouping_policy multibus path_checker tur } Another approach that may provide better results is setting path_selector to “round-robin”. The round-robin value uses fewer CPU cycles, but it does not correct for unbalanced performance characteristics of multiple paths, or any additional load from other devices that may be slowing down one of the paths. DISABLING PROCESSOR C-STATES IN LINUX For newer Linux kernels (2.6.32 or later) disabling CPU idle power states can boost performance. 18
  • 23. However, these must be disabled at boot time rather than in the BIOS. To disable CPU states, add intel_idle.max_cstate=0 processor.max_cstate=0 boot parameters to the /boot/grub/grub.conf file as follows: title Red Hat Enterprise Linux (2.6.32-279.el6.x86_64) root (hd0,0) kernel /vmlinuz-2.6.32-279.el6.x86_64 ro root=/dev/mapper/vg_rhel980- lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_rhel980/lv_root rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_rhel980/lv_swap KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet intel_idle.max_cstate=0 processor.max_cstate=0 initrd /initramfs-2.6.32-279.el6.x86_64.img One way to verify that the CPU states have been disabled entirely is to verify that the CPU state sysfs files do not exist: # ls /sys/devices/system/cpu/cpu0/cpuidle ls: cannot access /sys/devices/system/cpu/cpu0/cpuidle: No such file or directory IONTUNER RPM The tuning suggestions in this section can be performed in one step by installing the iontuner RPM. The RPM is made available on the Fusion-io internal network: https://confluence.int.fusionio.com/display/ION/Documentation#Documentation- IONPerformanceBrief,HPDL980(INTERNAL-ONLY) The RPM can be installed with the following command (the RPM version may be different): # rpm –Uvh iontuner-0.0.2-1.noarch.rpm If ION LUNs have already been detected by the initiator, a reboot or reload of device drivers may be necessary after the RPM install. This servers to complete the tuning that is performed upon device discovery. If in doubt about LUN discovery, reboot. The tuning described in the following sub-sections is done by the iontuner RPM, and it does not need to be performed manually if the RPM has been installed. Detailed steps are provided here in order to completely describe the RPM function and to assist those who may need to adjust the steps for unsupported platforms. 19
  • 24. Block DeviceTuning with udev Rules Str The tuning in this section is performed by the iontuner RPM. To improve I/O performance, you should tune the I/O scheduling queues on all devices in the data path. This includes both the individual SCSI devices (/dev/sd*) and the device-mapper devices (/dev/dm-*). Three settings changes have been proven to provide a performance benefit under some workloads: 1) Always use the noop queue with ION Data Accelerator devices: # echo noop > /sys/block/<device>/queue/scheduler 2) Use strict block-request affinity. This forces the handling of I/O completion to occur on the same CPU where the request was initiated. # echo 2 > /sys/block/<device>/queue/rq_affinity Strict block-request affinity is not available on RHEL 5, and on some kernels, group affinity will be used where strict affinity is not supported. After setting the file to ‘2’, a read of the file will return ‘1’ if only CPU group affinity is available. 3) To get more consistent performance results, disable entropy pool contribution: # echo 0 > /sys/block/<device>/queue/add_random The methods described above must be run after multipath devices are configured and detected by the initiator, and they will not persist through a reboot. This is because Linux provides the udev rules mechanism, which allows for some sysfs parameters to be set upon device discovery, both at boot time and run time. The iontuner RPM installs the following rules in /etc/udev/rules.d/99-iontuner.rules: ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/vendor}=="FUSIONIO", ATTR{queue/scheduler}="noop", ATTR{queue/rq_affinity}="2", ATTR{queue/add_random}="0" ACTION=="add|change", KERNEL=="dm-*", PROGRAM="/bin/bash -c 'cat /sys/block/$name/slaves/*/device/vendor | grep FUSIONIO'", ATTR{queue/scheduler}="noop", ATTR{queue/rq_affinity}="2", ATTR{queue/add_random}="0" The first rule applies scheduler, rq_affinity, and add_random changes to all SCSI block devices (/dev/sd*) whose vender is FUSIONIO. The second rule applies scheduler, rq_affinity, and add_random changes to all DM multipath devices (/dev/dm-*) that are created on top of block devices whose vendor is FUSIONIO. 20
  • 25. Disabling the cpuspeed Daemon Str The tuning in this section is performed by the iontuner RPM. Disabling the cpuspeed daemon on Linux can boost overall performance. To disable the cpuspeed daemon immediately, run this command: # service cpuspeed stop To prevent the cpuspeed daemon from running after a reboot, run this command: # chkconfig cpuspeed off Pinning interrupts Str The tuning in this section is performed by the iontuner RPM. To minimize data transfer and synchronization throughout the system, I/O interrupts should be handled on a socket close to the HBA’s I/O hub. When manually configuring IRQs, the irqbalance daemon must first be disabled. To disable the irqbalance daemon immediately, run this command: # service irqbalance stop To prevent the irqbalance daemon from running after a reboot, run this command: # chkconfig irqbalance off IRQs should be pinned for each driver that handles interrupts for ION device I/O. Typically, this is just the HBA driver. Driver IRQs can be identified in /proc/interrupts by the matching IRQ numbers to the driver prefix listed in the same row. The following table shows some common drivers and the prefix necessary to identify driver IRQs: Driver Prefix QLogic FC qla Brocade FC bfa Emulex FC lpfc Emulex iSCSI beiscsi,eth The iontuner RPM installs the iontuner service init script. This runs at boot time to distribute IRQs across the CPU cores local to HBA’s I/O hub. Below is an example of the commands issued at 21
  • 26. startup:: echo 00000000,00000000,00000000,00000000,00000001 > /proc/irq/114/smp_affinity echo 00000000,00000000,00000000,00000000,00000002 > /proc/irq/115/smp_affinity echo 00000000,00000000,00000000,00000000,00000004 > /proc/irq/116/smp_affinity echo 00000000,00000000,00000000,00000000,00000008 > /proc/irq/117/smp_affinity echo 00000000,00000000,00000000,00000000,00000010 > /proc/irq/118/smp_affinity echo 00000000,00000000,00000000,00000000,00000020 > /proc/irq/119/smp_affinity echo 00000000,00000000,00000000,00000000,00000040 > /proc/irq/120/smp_affinity echo 00000000,00000000,00000000,00000000,00000080 > /proc/irq/121/smp_affinity echo 00000000,00000000,00000000,00000000,00100000 > /proc/irq/134/smp_affinity echo 00000000,00000000,00000000,00000000,00200000 > /proc/irq/135/smp_affinity echo 00000000,00000000,00000000,00000000,00400000 > /proc/irq/136/smp_affinity echo 00000000,00000000,00000000,00000000,00800000 > /proc/irq/137/smp_affinity echo 00000000,00000000,00000000,00000000,01000000 > /proc/irq/122/smp_affinity echo 00000000,00000000,00000000,00000000,02000000 > /proc/irq/123/smp_affinity echo 00000000,00000000,00000000,00000000,04000000 > /proc/irq/124/smp_affinity echo 00000000,00000000,00000000,00000000,08000000 > /proc/irq/125/smp_affinity Affinity is set by writing to the /proc/irq/<irq#>/smp_affinity file for a given IRQ. Each IRQ is assigned affinity to a different CPU core on a node nearest to the IRQ’s PCIe device. In smp_affinity files, each core is represented by a single bit, starting with the least significant bit mapping to CPU 0. The IRQs associated with each device driver can be found by reading the /proc/interrupts file. There are ten CPU cores per node. In the example above, eight interrupts (the first eight entries) for the devices in slots 9 and 11 are mapped to node 0, and eight interrupts (the last eight entries) for the devices in slots 2 and 6 are mapped to node 2. On the DL980, each PCIe slot can be efficiently assigned to either of the nodes corresponding to its I/O hub. However, it is important that all processes related to that device be assigned to the same node. Because these settings will not persist through a reboot, the iontuner service runs each time the system is booted. VERIFYING THREAD PINNING Str The tuning in this section was not necessary in the DL980/RHEL 6.3 testing. It is included because it is unknown at this time whether it may be necessary on other platforms. To further minimize data transfer and synchronization times throughout the system, it may be beneficial to place critical I/O driver threads on the same socket as the interrupts and HBA. This may only be necessary with some drivers. For instance, this is helpful with QLogic drivers but is not necessary when using Emulex drivers because no critical work is performed in Emulex driver threads. In the case of the DL980 running RHEL 6.3, the QLogic driver threads always ran on cores local to the HBAs, even though they were not pinned. 22
  • 27. To check where QLogic driver threads are executing, run the following command: # ps –eo comm,psr | grep qla qla2xxx_6_dpc 20 qla2xxx_7_dpc 20 The number beside each process indicates the core it is currently executing on. The numbers “6” and “7” in the above example correspond to specific PCI device host numbers. You can correlate a PCI device to a host number by looking in sysfs: # ls -d /sys/bus/pci/devices/*/host* /sys/bus/pci/devices/0000:11:00.0/host0 /sys/bus/pci/devices/0000:11:00.1/host1 /sys/bus/pci/devices/0000:0b:00.0/host2 /sys/bus/pci/devices/0000:0b:00.1/host3 /sys/bus/pci/devices/0000:54:00.0/host4 /sys/bus/pci/devices/0000:54:00.1/host5 /sys/bus/pci/devices/0000:60:00.0/host6 /sys/bus/pci/devices/0000:60:00.1/host7 The CPUs local to each PCI device can also be found in sysfs: # cat /sys/bus/pci/devices/0000:54:00.0/local_cpulist 20-29,100-109 If the device thread is not executing on one of the listed cores, run the following command: # /usr/sbin/iontuner.py --pinqladriver The output from the script shows the commands it issued: taskset -pc 20-29,100-109 947 taskset -pc 20-29,100-109 942 The script assigns CPU affinity for each discovered PID through the taskset command, using the following parameters: # taskset –pc <CPU mask> <PID> PIDs can be discovered through the ps command, but each driver has its own naming convention for these processes. For example, the following command will show QLogic driver threads: # ps -eo comm,pid | grep qla qla2xxx_6_dpc 942 qla2xxx_7_dpc 947 The driver thread should be pinned to the set of cores listed in the device local_cpulist. 23
  • 28. On the DL980, although every I/O hub is local to two NUMA nodes, only the CPU cores from the lower numbered node are shown as local to each PCI device. In this example, the first range (20- 29) corresponds to the CPU cores in NUMA node 2, and the second range (100-109) corresponds to the hyper-threading cores for NUMA node 2. The second CPU core range will only be present if hyper-threading is enabled. Though the device is also local to NUMA node 3, it is generally sufficient to pin all devices to one of the two NUMA nodes, provided there are enough CPU resources on a single node. Splitting pinning between the two nodes requires extreme precision. Pinning resources from one device on two separate nodes can create poor performance. Though both nodes may be local to the device, they are not local to each other. These settings will not persist through a reboot. 24
  • 29. Oracle Tuning ________________________________________________________________________ The following settings are specific to tuning for Oracle. A reboot must be applied in order for system settings to take effect. HUGEPAGES Configuring HugePages reduces the overhead of utilizing large amounts of memory by reducing the page table size of the Oracle System Global Area (SGA). The default HugePage size is 2 MB, compared with the typical page size of 4 KB. With a page size of 2 MB, a 10 GB SGA will have only 5120 pages compared to 2.6 million pages without HugePages. HugePages can be configured in /etc/sysctl.conf: vm.nr_hugepages=55612 vm.hugetlb_shm_group=501 The number of HugePages used here is based on a recommendation from Oracle. The group should be set to the group ID of Oracle. This can be determined using the id command. # id –g oracle 501 After a reboot, the number of available HugePages can be verified. # cat /proc/meminfo | grep HugePages_Total HugePages_Total: 55612 SYSCTL PARAMETERS The following parameters were configured for Oracle in /etc/sysctl.conf: kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 net.core.rmem_default = 4194304 net.core.rmem_max = 4194304 25
  • 30. net.core.wmem_default = 262144 net.ipv4.ip_local_port_range = 9000 65500 fs.file-max = 6815744 net.core.wmem_max = 1048576 fs.aio-max-nr = 1048576 ORACLE INITIALIZATION PARAMETERS The following parameters were set in the /opt/oracle/product/11.2.0/dbs/initorcl.ora file: *.db_block_size=8192 *.db_recovery_file_dest_size=2000G *.processes=6000 *.db_writer_processes=16 *.dml_locks=80000 *.filesystemio_options='SETALL' *.open_cursors=8192 *.optimizer_capture_sql_plan_baselines=FALSE *.parallel_degree_policy='AUTO' *.parallel_threads_per_cpu=2 *.pga_aggregate_target=8G *.sga_max_size=50G *.sga_target=50G *.use_large_pages='only' _enable_NUMA_support=TRUE The _enable_NUMA_support parameter enables Oracle NUMA optimizations. The use_large_pages parameter ensures that each NUMA segment will be backed by HugePages. 26
  • 31. fio Performance Testing ________________________________________________________________________ After performing the configuration described in this document, the fio tool can be used to verify the synthetic performance of the ION Data Accelerator configuration. PRECONDITIONING FLASH STORAGE Running tests immediately after a low-level format of the flash storage is not a meaningful test for the ION Data Accelerator system or any other flash-based storage system. It is always recommended that preconditioning be performed prior to measuring performance. When comparing multiple flash storage solutions, it is necessary to perform the same preconditioning on each system. Improper preconditioning can lead to extremely unrealistic performance comparisons. Preconditioning can be performed by writing a random data pattern to the entire address range of the device, using a consistent block size. A block size of 1MB is recommended. TESTING THREAD CPU AFFINITY Earlier, this document described how to align all I/O to a given LUN on a single socket. This was done by HBA placement, restricted LUN access, target-initiator connections, IRQ affinity, and driver thread affinity. The final component is to force the test threads accessing that LUN onto the same NUMA node as all of the other components. Configuring this will vary depending on the test used. For the fio test, the cpus_allowed parameter can be used as shown in the examples below. TEST COMMANDS The iontuner RPM provides a script that may be used to generate fio job files with optimal NUMA tuning parameters. The RPM is made available on the Fusion-io internal network in the same location as this document: 27
  • 32. https://confluence.int.fusionio.com/display/ION/Documentation#Documentation- IONPerformanceBrief,HPDL980(INTERNAL-ONLY) A fio job file can be created using the following command format: # /usr/sbin/iontuner.py --setupfio=’<parameters>’ The script generates a job file using fio parameters that have been shown to provide optimal performance results. They also provide efficient pinning for all test threads. In addition to the built-in parameters, options specified in the <parameters> field as a comma-separated list are also added to the job file. This option should be used to specify read/write balance, random vs. sequential I/O, test length, and any other parameters specific to the workload being tested. For example, the following command can be used to generate a random 4KB read test: # /usr/sbin/iontuner.py -- setupfio='rw=randrw,bs=4k,rwmixread=100,runtime=600,loops=10000,numjobs=1' This command generates the following job file in /root/iontuner-fio.ini: [global] rw=randrw bs=4k rwmixread=100 runtime=600 loops=10000 numjobs=1 iodepth=256 group_reporting=1 thread=1 exitall=1 sync=0 direct=1 randrepeat=0 norandommap=1 ioengine=libaio gtod_reduce=1 iodepth_batch=64 iodepth_batch_complete=64 iodepth_batch_submit=64 [dm-10] filename=/dev/dm-10 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 [dm-8] filename=/dev/dm-8 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 28
  • 33. [dm-9] filename=/dev/dm-9 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 [dm-6] filename=/dev/dm-6 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 [dm-7] filename=/dev/dm-7 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 [dm-4] filename=/dev/dm-4 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 [dm-5] filename=/dev/dm-5 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 [dm-3] filename=/dev/dm-3 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 The numjobs parameter must be tuned specifically for each configuration. Though one job per volume was optimal in this configuration, for ION Data Accelerator configurations with many ioDrives it may be necessary to use four or more jobs per volume to achieve maximum performance. The cpus_allowed parameter is used to specify a list of CPUs on which each test thread may run. Earlier sections of this document described how to align all I/O to a given volume on a single socket by HBA placement, restricted LUN access, target-initiator connections, IRQ affinity, and driver thread affinity. This final component forces the test threads accessing that volume onto the same NUMA node as all of the other components. To manually determine which CPUs a multipath device should be pinned to, first the host number must be obtained from the multipath command: # multipath –l mpathgzu (26364646430613766) dm-3 FUSIONIO,ION LUN size=174G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw 29
  • 34. `-+- policy='queue-length 0' prio=0 status=active |- 2:0:0:0 sdm 8:192 active undef running `- 1:0:0:0 sdg 8:96 active undef running ... The first number listed with each underlying sd* device indicates the host number. The host number can be correlated to a PCI device by looking in sysfs: # ls -d /sys/bus/pci/devices/*/host* /sys/bus/pci/devices/0000:11:00.1/host1 /sys/bus/pci/devices/0000:0b:00.0/host2 ... The CPUs local to each PCI device can also be found in sysfs: # cat /sys/bus/pci/devices/0000:11:00.1/local_cpulist 0-9,80-89 # cat /sys/bus/pci/devices/0000:0b:00.0/local_cpulist 0-9,80-89 If the devices are pathed properly, the local CPU list for each underlying device should be identical. These CPUs should be listed in the cpus_allowed parameter of fio. Information on the other fio parameters used here is available in the fio man page. In addition to creating a job file, the script will output the command that can be used to run a fio test with the job file. To run the test, copy the output of the script onto the command line: # fio ./iontuner-fio.ini The fio test will execute and generate test results to the terminal. RESULTS The following fio test results are captured in this section, all on the HP DL980 initiator: • Sequential R/W throughput and IOPS • Random mix R/W IOPS • Random mix R/W throughput All tests were performed with the following elements: • 3 x 2.41TB ioDrive2 Duos • 1 x RAID 0 pool • 8 ION volumes, 2 LUNs per volume 30
  • 35. • 8 direct-connect FC8 target-initiator links, 2 LUNs per initiator-target link • 1 dm-multipath device per volume • 1 worker/device, queue depth=256/worker Preconditioning was performed prior to the set of tests for each block size by using fio to write to the entire range of the device with a 1 MB block size. SEQUENTIAL R/W THROUGHPUT AND IOPS 31
  • 36. RANDOM MIX R/W IOPS RANDOM MIX R/W THROUGHPUT 32
  • 37. The results above indicate performance measured and reported by fio, and for selected tests the numbers were compared with the output of the iostat command. The numbers were comparable. Performance results can vary dramatically depending on the number of ION Data Accelerator volumes used, the number of paths to each volume, and the number of test threads run per volume (determined by the fio numjobs parameter). For this particular configuration, tests were run on a variety of volume, path, and thread counts before determining that 8 volumes, 2 paths per volume, and 1 thread per volume was optimal. This configuration was chosen because it provided the best results for random read IOPS. Depending on the specifics of a configuration and the workload chosen for optimization, other combinations may provide better results. The above tests report the fastest random read IOPS at around 700,000 IOPS. However, to test initiator capabilities, some benchmarks were performed immediately after formatting the ioDrives. For example, this test achieved 800,000 IOPS: # /usr/sbin/iontuner.py -- setupfio='rw=randrw,bs=4k,rwmixread=100,runtime=600,loops=10000,numjobs=1' Running immediately after a format is not a meaningful test for the ION Data Accelerator system itself, as reads are not serviced out of flash. Still, this indicated that given more ioDrives in the ION Data Accelerator, it is likely the DL980 could have achieved even higher performance numbers. Similarly, the fastest reported combined read and write bandwidth is 6900 MB/s. Shortly after the cards were formatted, greater throughput was possible from the initiator: # /usr/sbin/iontuner.py -- setupfio='rw=randrw,bs=1m,rwmixread=50,runtime=600,loops=10000,numjobs=1’ This test achieved 3740 MB/s read bandwidth and 3750 MB/s write bandwidth, for a total bandwidth of 7490 MB/s. A final indicator of performance limited by the ioDrives is reduced mixed bandwidth performance at some block sizes. This is comparable to test results seen with a single ioDrive in a local server. Writing data to the full address range prior to testing is a necessary step to achieve realistic results with an ION Data Accelerator test. These final tests are proof that it is unlikely that the NUMA architecture of the DL980 was the limiting factor in these fio results. The DL980 appeared to fully exercise the performance capabilities of the ION Data Accelerator. 33
  • 38. Oracle Performance Testing ________________________________________________________________________ Oracle Orion is a tool for predicting the performance of an Oracle database without having to install Oracle or create a database. It simulates Oracle database I/O workloads using the same I/O software stack as Oracle. Tuning for Orion is very similar to tuning for fio. By running simultaneous copies of Orion’s advanced test, it is possible to approximate workloads similar to fio. Alternatively, the Online Transaction Processing (OLTP) and Data Warehouse (DSS) tests can be used to attempt to synthetically approximate user workloads. Orion can also be used to test mixed large and small block sizes. TEST SETUP The Orion tests were run as root, but it was necessary to set the ORACLE_HOME environmental variable. To find this variable, run the following commands from an Oracle user shell: # su – oracle $ echo $ORACLE_HOME /opt/oracle/product/11.2.0/db_1 $ exit To make the variable permanent, run the following command in the terminal or add it to ~/.bashrc (the specific Oracle version will vary): # ORACLE_HOME=/opt/oracle/product/11.2.0/db_1 The iontuner RPM provides a script that can be used to generate Orion test commands with optimal NUMA tuning parameters. The RPM is available on the Fusion-io internal network in the same location as this document: https://confluence.int.fusionio.com/display/ION/Documentation#Documentation- IONPerformanceBrief,HPDL980(INTERNAL-ONLY) 34
  • 39. Orion .lun files can be created using the following command: # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME -- setuporion='<parameters>’ The script generates commands that have been shown to provide optimal performance results and efficient pinning for all test threads. For example, the following command can be used to generate a 4KB read IOPS test: # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600' The script generates .lun files saved in the current directory and outputs the following commands: taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-6 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-7 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-4 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-5 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-2 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-3 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-0 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-1 -run advanced -matrix point –type rand -num_large 0 - num_small 2048 -write 0 -size_small 4 -duration 600 & For this configuration, the best results were obtained by creating a separate .lun file for each volume and running a single Orion test on each volume. Splitting the volumes into separate .lun files made it possible for taskset to run each Orion test and assign it affinity to the CPUs local to the devices being tested. The local CPUs can be determined with the multipath command using the same method described in FIO Test Commands later in this document. You can copy and paste the taskset commands into the terminal to run them in parallel. Because the output from Orion displays only the maximum performance of each instance (which may individually occur at different times), the iostat command should be used to read performance as viewed from the initiator devices: 35
  • 40. # iostat –x /dev/dm-* TEST COMMANDS The fio tests used for 8KB IOPS were approximated with the following commands: # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type seq -num_large 0 -num_small 2048 -write 100 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 100 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 75 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 50 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 25 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 8 -duration 600' The fio tests used for 512KB bandwidth were approximated with the following commands: # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type seq -num_large 2048 -num_small 0 -write 100 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 0 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 100 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 75 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 50 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 25 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 –write 0 -size_large 512 -duration 600' For running the DSS test, the iontuner.lun file was created with all eight volumes specified. The 36
  • 41. DSS test was run with the following command: # taskset -c 0-9,80-89,20-29,100-109 ./orion -testname iontuner -run dss Because all devices were used in a single command, the CPUs local to all of the HBAs were specified to taskset. The OLTP test was run with the following command: # taskset -c 0-9,80-89,20-29,100-109 ./orion -testname iontuner -run oltp RESULTS When running Orion advanced tests that approximated fio tests for 8KB and 512KB block sizes, the results were almost identical to fio. There was more variation between runs than between the two utilities. Because the previous state of the ioDrives has a large impact on the performance of any test, it is necessary when comparing test runs to sequence tests in a consistent order and begin with the same initial ioDrive conditioning. Providing Orion results for these tests would only bring attention to minor variations that provide no additional information about the tuning of the DL980. Additionally with the advanced tests, there was an unexpected behavior of Orion: for block sizes larger than 512KB, it seems that 512KB accesses are always generated to the devices. The DSS test resulted in a maximum bandwidth of 6039 MB/s. There are many variations to the Orion test that could be experimented with. To get an accurate measurement of maximum performance, it is necessary to run multiple copies of the test and evaluate the results from iostat. With any of the test options that run multiple test points (advanced, OLTP, DSS) there is no guarantee that all of the test copies will synchronously run each test point. This may invalidate results. 37
  • 42. Oracle Database Testing ________________________________________________________________________ For Oracle database testing, a number of tools were used to show the maximum capabilities of the system under a variety of workloads. READ WORKLOAD TEST – QUEST BENCHMARK FACTORY For a more realistic Oracle test, a Windows server was connected to the DL980 via an additional Fibre Channel link. An Oracle disk group was created containing all of the ION Data Accelerator volumes. Quest Benchmark Factory was used to create a database on the disk group with the following configuration: • Size: 300GB • Logging Mode: ARCHIVELOG The Oracle components below were placed in one ASM disk group, +DATA, which consisted of 8 LUNs (each 800 GB) enabled with multipathing: • Redo – 20 redo log members, each 2048 MB in size • Archivelogs – placed in the default FRA • FRA – db_recovery_file_dest=’+DATA’, db_recovery_file_dest_size=’3000G’ • UNDO, data, and temporary tablespaces The ASM +DATA disk group was created with external redundancy and with a default 1MB AU size. SYS, SYSTEM, and second UNDO tablespaces were created in the ADMIN disk group. This was done in order to easily drop and recreate the TEST data and disk groups without having to recreate the database. 38
  • 43. For a read workload test, Quest Benchmark Factory > Database Scalability Job > TPC-H Power Test was used. 39
  • 44. The test was configured for 50 users. 40
  • 45. Performance was evaluated on the DL980 while TPC-H Power Test was running. Oracle Enterprise Manager was used to show read bandwidth during the test. During the test Oracle showed a read bandwidth of just over 6000 MB/s. An Automated Workload Repository (AWR) report was generated during the test. The following excerpts provide details on the I/O performed by the test. 41
  • 46. The AWR report function summary shows a total read bandwidth of 5.8 GB/s averaged over the length of the test. The file statistics show the breakdown of I/O for each file. 42
  • 47. Using ‘iostat –mx /dev/dm-*’, a snapshot of bandwidth from the ION volumes was verified. An approximate read bandwidth of 755MB/s was seen on each of the eight volumes, for a total read bandwidth of 6043MB/s from the ION Data Accelerator server. The avgrq-sz column shows that the average request size was between 512 and 1024 sectors (256 KB and 512 KB). These results are consistent with the bandwidth of approximately 6100MB/s seen from fio in this block size range. However, it is important to recognize that Oracle performs data transfers of many sizes simultaneously, so the synthetic fixed block size results of fio are not a direct comparison, only an approximation of the capability at this workload. OLTP WORKLOAD TEST – HEAVY INSERT SCRIPT Performance was evaluated while running a custom OLTP load generated by a script running heavy insert database transactions on the DL980. Oracle Enterprise Manager was used to show bandwidth and IOPS during the test. 43
  • 48. During the test Oracle showed a total bandwidth of approximately 4000 MB/s. 44
  • 49. An AWR report was generated during the test. The following excerpts provide details on the I/O performed by the test. The AWR report function summary shows a total read bandwidth of 884 MB/s and write bandwidth of 2.6 GB/s averaged over the length of the test, or 3.5 GB/s combined. The file statistics show the breakdown of I/O for each file. 45
  • 50. Using ‘iostat –mx /dev/dm-*’, a snapshot of bandwidth from the ION volumes was verified. A read bandwidth of 952 MB/s and write bandwidth of 2505 MB/s was seen, for a total bandwidth of 3457 MB/s from the ION Data Accelerator server. The workload is 22% read and 78% write I/O. The avgrq-sz column shows that the average request size was around 123 sectors, or 61KB. The result from the fio test for a 25% read workload and 64KB block size was 3705MB/s, which is consistent with the results of this test. Once again, it is important to recognize that Oracle performs data transfers of many sizes simultaneously, so the synthetic fixed block size results of fio are not a direct comparison, only an approximation of the capability at this workload. 46
  • 51. TRANSACTIONS TEST – SWINGBENCH An Order Entry Sample OLTP Test was run in Swingbench on the DL980. The test was configured with 100 users and transaction delay disabled. Because of some difficulties with Swingbench that were not related to performance, hyper-threading was disabled for this test. The test resulted in an average of 934,359 transactions per minute (TPM) and a maximum of 1,150,103 TPM. Oracle transactions vary greatly in the I/O they produce on the backend storage. A specific TPM number such as the one provided by Swingbench is only useful when compared to a number produced by a Swingbench test with the same parameters. 47
  • 52. Conclusions ________________________________________________________________________ Prior to tuning, it is possible that performance on a NUMA system such as the HP DL980 will appear to be lower than that of systems with less complex architectures. The script used throughout this document for NUMA-specific tuning will be made available to simplify and standardize this tuning process. Synthetic benchmarks such as fio or Orion provide direct measurement of ION Data Accelerator storage capabilities. The flexibility of these tools is extremely useful when tuning storage configurations and initiator system parameters. The comparable results achieved by fio and Orion indicate that either of these tools is sufficient. The configuration used at Fusion-io in San Jose was capable of sustaining 700,000 random IOPS and up to 7GB/s in bandwidth, but there were indicators that the DL980 would have been capable of sustaining even greater numbers when used in combination with more ioDrives in the ION Data Accelerator. However, synthetic benchmark performance alone does not guarantee user application performance. Additional system parameters must be tuned for Oracle, and appropriate tests must be used to identify the maximum performance for each specific workload. Oracle produced a read bandwidth of up to 6GB/s and a mixed bandwidth of nearly 3.5GB/s. While these numbers may seem to be lower than those seen by fio, they are very comparable to the results of an fio test with a similar read/write balance and average block size. The close proximity of the Oracle results to the fio results indicates that Oracle has been tuned to take full advantage of the performance of the storage. Tests in Swingbench were measured at up to 1,150,103 TPM, but this number is only useful when compared to other Swingbench results. NUMA support is an active topic in Linux development. As newer distributions become available and their built-in tools improve, it is likely that less manual tuning will be necessary. While tuning with this script provided is not currently persistent, methods are being investigated to provide automatic tuning at boot time as well as upon device discovery. When configured properly, the DL980 is a very powerful Oracle initiator for use with the ION Data Accelerator. 48
  • 53. Glossary ________________________________________________________________________ Initiator - An initiator of I/O is analogous to a client in a client/server system. Initiators use a SCSI transport protocol to access block storage over a network. A database or mail server is an initiator, for example. LUN – Logical Unit Number. Targets furnish containers for I/O that are a contiguous array of blocks identified by logical unit number. A LUN is usually synonymous with physical disk drive, since initiators perceive it as such. For ION Data Accelerator, a LUN is a volume that has been exported to one or more initiators. Pool –an aggregation of IoMemory or RAIDset block devices. Block devices can be added to a pool. Target – the opposite of an initiator, is a receiver of I/O operations, analogous to a server in a client/server system. The target for I/O is the provider of (network) storage - a SAN disk array is a traditional target. ION Data Accelerator is an all-flash storage target by comparison. Volume – a logical construct identifying a unit of data storage. A volume is allocated to allow for expandability within the space constraints of a pool. For ION Data Accelerator, a volume is not necessarily directly linked to a physical device. 49
  • 54. Appendix A: Tuning Checklist ________________________________________________________________________ The following is a complete checklist of the tuning steps described in the document that can be used as a quick reference: 1. Check initiator HBA slot locations. 2. Check ION storage profile. 3. Verify that a sufficient number of ION volumes are used. 4. Verify that a sufficient number of LUN paths are used. 5. Verify that LUN paths are distributed so all fabric resources are balanced. 6. Verify that all LUNs for each volume are presented only to HBAs within one NUMA node. 7. Update the BIOS and verify that NUMA distances are detected properly. 8. Set the BIOS power profile to Maximum Performance. 9. Verify that cstates are disabled in the BIOS. 10. Enable Hyperthreading in the BIOS settings. 11. Disable virtualization and VT-d in the BIOS if not needed. 12. Check the addressing mode in the BIOS. 13. Disable x2APIC in the BIOS. 14. Verify multipath path_selector is queue-length 15. Disable processor cstates with boot parameters. 16. Install the iontuner RPM (tunes block devices with udev rules, disables the cpuspeed 50
  • 55. daemon, disables the irqbalance daemon, and pins IRQs). 17. Use fio or Orion commands generated by iontuner when testing baseline performance. 18. Configure HugePages for Oracle. 19. Configure sysctl parameters for Oracle. 20. Configure Oracle initialization parameters, including _enable_NUMA_support and use_large_pages. 51
  • 56. Appendix B: Speeding up Oracle Database Performance with ioMemory – an HP Session ________________________________________________________________________ This appendix is adapted from a session presented at the HP ExpertOne Technology & Solutions Summit, Dec. 2012 in Frankfurt, Germany. ARCHITECTURE OVERVIEW The diagram below shows the basic topology for shared NAND flash storage using the ION Data Accelerator connected to database servers. Fabric Node 1 Node 2 I/O bottlenecks in a shared storage system can be removed by strategically placing transaction logs, the TempDB, hot (frequently accessed) tables, or the entire database on ioMemory in the ION Data Accelerator. 52
  • 57. ABOUT ION DATA ACCELERATOR An ION Data Accelerator system consists of the following basic components: ION Data Accelerator Software – runs as a GUI or CLI, transforming tier 1 servers into an open shared flash resource. Up to 20x performance improvement has been achieved, compared to traditional disk-based shared storage systems. Fusion ioMemory – is proven, tested, reliable, and fast, with thousands of satisfied customers worldwide. Open System Platforms – ION Data Accelerator software runs on a variety of tier 1 servers, providing industry-leading performance, reliability, and capacity. Hundreds of thousands of these servers are deployed in enterprises today. Supported network protocols include Fibre Channel, SRP/InfiniBand, and iSCSI. ION Data Accelerator Software The ION Data Accelerator software running on the host server • Is optimized for ioMemory • Works on industry-standard servers • Supports JBOD, RAID 0, and RAID 10 modes (including spare drives) • Provides GUI, CLI, SMIS, and SNMP access • Is easy to configure • Enables software-defined storage Fusion-Powered Storage Stack The following diagram shows how the elements of a Fusion-powered software/hardware stack. Your application Transforms the server into a storage target Virtual Storage Layer, a purpose-built flash access layer Fast, reliable, cost-effective flash memory in a PCIe form factor Tier 1 serverServer ioMemory VSL ION Software Application 53
  • 58. Why ION Data Accelerator? ION Data Accelerator provides the following advantages: • It is a highly efficient shared storage target. • With its low latency, high IOPS, and high bandwidth it can accelerate writes and reads in a variety of environments, including SAP, SQL, Navision, Oracle, VMware, etc. • It outperforms even cache hits from storage array vendors. Because of the increased performance that ION Data Accelerator achieves, customers can • Support more concurrent users. • Lower response times. • Run queries and reports faster • Finish batch jobs in shorter time • Increase application stability ABOUT ION DATA ACCELERATOR HA (HIGH AVAILABILITY) ION Data Accelerator enables a powerful and effective HA (High Availability) environment for your shared storage, when HA licensing is enabled. 54
  • 59. The diagram below shows basic LUN access (exported volumes) in an HA configuration. LUN 0 LUN 0 LUN 1 LUN 1 LUN 0 LUN 1 40Gb PERFORMANCE TEST RESULTS: HP DL380 / HP DL980 The following charts show performance results for an HP DL380 target running ION Data Accelerator, with an HP DL980 initiator. 55
  • 60. 56
  • 61. OVERVIEW OF THE ION DATA ACCELERATOR GUI Summary Screen: Creating a Storage Profile for the storage pool: 57
  • 62. Creating volumes from the storage pool: Setting up an initiator group (LUN masking) to access volumes: 58
  • 63. Managing initiators: Editing initiator access: 59
  • 64. Managing volumes: COMPARATIVE SOLUTIONS The diagram below shows a winning solution for ION Data Accelerator and Oracle, compared with rival EMC: 3PAR T400 Oracle SGA: 700 GB • HP DL980 • Red Hat 6 • 64 or 80 cores Intel E7 • 1 TB memory TempDB Other apps & Table- spaces HP IO Accelerator Redo Logs Hot Tables 60
  • 65. The table below illustrates the competitive advantages of ION Data Accelerator: Comparison Point ION Note Open Systems Server Foundation ✔ Fusion-io relies on time tested open systems server hardware while competitors are proprietary Fusion-io Adaptive Flashback vs. Competitor RAID ✔ VSL with Adaptive Flashback provides two orders of magnitude better media error rates ION RAID vs. Competition ✔ ION provides more flexibility with JBOD, RAID- 0, RAID-10 vs. one static configuration option Street Price ($/GB) ✔ Fusion-io delivers a solution estimated to be at least 30% lower cost/GB Price/IOPS ✔ Fusion-io is the clear winner Power ✔ Fusion-io draws less power BEST PRACTICES The following best practices are important to follow in order to achieve top performance for Oracle testing. • Present 16 to 32 LUNs to the host for maximum performance. • Use the noop scheduler. • Use round robin for multipath.conf. • When using a DL980 as load generator, make sure you pin the I/O issuing processes. • It doesn’t matter so much on which nodes the processes are pinned, as long as they are pinned. 61
  • 66. The maximum performance configuration shown below achieved about 700K IOPS. DL 980 IOH 1 IOH 2 CPU 0 CPU 1 CPU 2 CPU 3 HBA 1 HBA 2 HBA 3 HBA 4 IONSwitch HBA 1 HBA 2 BENCHMARK TEST CONFIGURATION Below is a proof-of-concept configuration that can be extended in any direction: A single server can achieve 600K IOPS at a 4KB block size. Below are system configurations for the storage server (ION Data Accelerator appliance) and the database server. Storage Server • DL 380p Gen8, 2 socket, 2.9GHz • 4 x 2.4TB HP IO Accelerator • 2 x dual-port 8Gbit Fibre Channel 62
  • 67. Database Server • DL980 G7 8s /80c, 1TB RAM • 4 x dual-port 8Gbit Fibre Channel RAW PERFORMANCE TEST RESULTS WITH FIO Total IOPS 1 2 4 8 16 32 64 128 0 100000 200000 300000 400000 500000 600000 700000 1 2 4 8 16 32 64 #ofJobs IOPS Qdepth ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 4KB block size, 100% read 63
  • 68. Average Completion Latency (Microseconds) 0 100000 200000 300000 400000 500000 600000 700000 0 100 200 300 400 500 600 700 800 900 1000 1 2 4 8 16 32 64 128 IOPS Latency(µs) # of Jobs Latency (µs) IOPS ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 4KB block size, 100% read, Qdepth = 4 Raw I/OTest: 70% Read, 30% Write ION Data Accelerator with RAID 0, 2 RAIDSETS, 16 LUNs at 4KB block size 64
  • 69. Raw I/OTest: 100% Read at 8KB 1 2 4 8 16 32 64 128 0 100000 200000 300000 400000 500000 1 2 4 8 16 32 64 400000-500000 300000-400000 200000-300000 100000-200000 0-100000 # of Jobs ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 8KB block size Raw I/OTest: Read Latency (Microseconds) 1 2 4 8 16 32 64 128 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 1 2 4 8 16 32 64 16000-18000 14000-16000 12000-14000 10000-12000 8000-10000 6000-8000 4000-6000 2000-4000 0-2000 ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 8KB block size 65
  • 70. ORACLE WORKLOAD TESTS The following configuration was used for Oracle workload testing: Database • 1TB of data • Tables from million to billion rows Data Access Pattern • Sequential write • Data load (bulk load, real-time) • Full table scan • Select data via index • Update data via index MB/sec 66
  • 71. processes IOPS processes ~2.2 GB/sec random read 67
  • 72. IOPS processes Up to 2.5 GB/sec write Up to 300 MB/sec redolog CPU Load 21 % max Load generator: hammerora from http://hammerora.sourceforge.net 1 TB DB size 80 users 10ms delay 68
  • 73. Cpu load: 33% Almost no IO wait!!! 69

×