• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
I/O Microbenchmarking with Oracle in Mind
 

I/O Microbenchmarking with Oracle in Mind

on

  • 1,785 views

This presentation I gave at the 2006 Hotsos Symposium discusses a passion of mine; micro-benchmarking things that are actually relevant to one's mission! All-too-often, I've seen people obsess over ...

This presentation I gave at the 2006 Hotsos Symposium discusses a passion of mine; micro-benchmarking things that are actually relevant to one's mission! All-too-often, I've seen people obsess over results from ad-hoc testing that seem to indicate that they have a problem - when in fact, their test bear no real resemblance to the demands of their actual workloads! The principles discussed here are also important outside the realm of Oracle.

Statistics

Views

Total Views
1,785
Views on SlideShare
1,783
Embed Views
2

Actions

Likes
2
Downloads
42
Comments
0

2 Embeds 2

http://www.linkedin.com 1
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    I/O Microbenchmarking with Oracle in Mind I/O Microbenchmarking with Oracle in Mind Presentation Transcript

    • I/O Microbenchmarking Hotsos Symposium - with Oracle in Mind Dallas, TX Bob Sneed, Sr. Staff Engineer March 7, 2006 Sun Microsystems, Inc Rev 0.7 - 3/7/2006 Performance, Availability, and Architecture Engineeing (PA2E) Group
    • Agenda • Preliminaries • The Devil is in the Details • What Oracle Actually Uses • Tool Roundup • Use Cases Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 2
    • Preliminaries Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 3
    • About the Presenter: Bob Sneed • With Sun since 1996 > Lots of “fly and fix” or “smoke-jumping” work around Y2K > Lots of that hinged on I/O issues and memory management • With PA2E since 2000 > Overall PA2E team does Sun product optimization, modeling for SPARC chip and architecture design, availability modeling, performance tools, and direct work with Oracle and select other ISVs > Bob's projects center around “Customer Focus” activities: engineering to actual customer requirements, Best Practices KM, and service delivery (in the performance space) • Related publications > “Sun/Oracle Best Practices” > “Oracle I/O: Supply and Demand” Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 4
    • Disclaimers Opinions and views expressed herein are those of the author, Bob Sneed, and do not represent any official opinion of Sun Microsystems, Inc. I am not a doctor - and I don't even play one on TV. If you goof up doing this stuff on your system and destroy all your data – it's not my fault or Sun's. There is no warranty, expressed or implied, in the quality of the information herein. This material is version 0.x. Further development is planned this year. Batteries not included. Your mileage may vary (YMMV). Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 5
    • What is Microbenchmarking? • A working definition: “The use of small synthetic workloads for the sake of evaluating the relative merit of specific and relevant APIs and configuration options.” • Compared to 'simulation', microbenchmarking is 'simpler' – but simulation-capable software can often be used for microbenchmarking (eg: filebench) • The lines are not totally clear between other forms of testing (eg: 'some testing', 'an exercise', or 'an experiment') and 'microbenchmarking' • Mainly, 'micro' implies 'small' ... Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 6
    • Why Microbenchmark? • Disoptimal I/O configuration is the #1 platform-level root cause for performance complaints in scaled databases on Sun systems worldwide • It is an extremely useful and accessible means of gaining insight into the I/O stack and storage configuration options • 'Hands-on' empirical methods are best for learning • It is much less expensive and less complicated than most real Oracle benchmarks; you do not even need to have Oracle installed • Microbenchmarks can provide good 'sanity check' metrics for any given storage configuration • It is fun ... once you are doing it ... correctly! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 7
    • Is Microbenchmarking Dangerous? • Absolutely! Many bad things can happen, including ... > Wasting time testing irrelevant things > Wasting money testing irrelevant things > Misinterpretation of results leading to bad policy decisions > Inadvertent destruction of data (Hey - you break it, you bought it!) • But – not microbenchmarking is also dangerous ... > Making sub-optimal I/O choices creates risk ... > Risk of compromising end-user experience and elevated support costs > Risk of avoidable hardware over-provisioning > Risk of wasted money on 'high performance' options – that aren't > Risk of excessive Oracle consultancy to 'tune around' the I/O stack – Of course, for Oracle consultants this might be called 'opportunity'. > Risk of avoidable system instability and linearity – Some choices are far more linear and stable than others! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 8
    • Time is Money • End-user experience > Good performance enables them to add value to the enterprise • Oracle & storage tuning efforts > Good platform configuration can avoid some of the costs • Support interactions > Expensive for customers > Expensive for vendors Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 9
    • The Devil is in the Details Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 10
    • The I/O Stack is Another Session • See also: “Oracle I/O: Supply and Demand” - a whitepaper that surveys the I/O stack > It's dated, and does not discuss QFS, ODM, ASM, RAC/grid, or tradeoff of conventional versus direct I/O - but it's a pretty good overview nonetheless Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 11
    • Common Microbenchmarking Errors • Running irrelevant tests > Test should mimic some part of Oracle's operation • Failure to pay attention to initial (pre-test) state influences > Beware the 'warm cache'! • Failure to repeat tests to assure repeatability of results > By inference, repeat setting initial state • Leaping to conclusions > The 'why' of many results may not be obvious Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 12
    • Confusing Lingo! • In Solaris ... > The opposite of 'asynchronous' is 'blocking' > The opposite of 'synchronous' is 'deferred' > Oracle LGWR and DBWR default to asynchronously-managed (AIO) operations with synchronous completion criteria; got that? > Oracle LGWR and DBWR cannot be made non-synchronous • In Oracle, 'Wait Event' names cause much confusion ... > db_file_sequential_read events – physically RANDOM I/O > db_file_scattered_read events – physically SEQUENTIAL I/O > db_file_parallel_write events – physically RANDOM I/O > RTFM: “Oracle Wait Interface: A Practical Guide to Performance Diagnostics and Tuning” - BUT: IGNORE the recommendations on Page 134! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 13
    • About 'Latency' • Definition: “the elapsed time of a single operation” • Four strategies for combatting latency > Don't do the work (“the best I/O is one that never happens”) > Add concurrency (“many hands make light work”) > Increase work per operation (“work smarder, not harder”) > Improve the physics (“C – it's not just a good idea, it's the law!”) • About SCSI spindle write caches ... > Disabled by default in Solaris – for safety's sake > Enabled by default in Windows and Linux > Unsupported: vary using 'format -e' (expert mode) in Solaris ... Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 14
    • What Oracle Actually Uses Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 15
    • High-Level Viewpoint • 'Traditional' I/O access (sole focus today) > 'Conventional path' > LGWR logs changes > DBWR checkpoints dirty blocks to disk > 'Direct path' > Shadow processes write directly to target files • 'New' I/O modes (some other day, some other way) > Oracle Disk Manager (ODM) > An Oracle-defined API, only implemented by VxFS > When used, truss shows lots of ioctl() calls > Automated Storage Management (ASM) > <Not yet investigated> (Sorry; I've been busy!) > No microbenchmarks known to exist for 'New' I/O methods - at least - not to Bob Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 16
    • WARNING: C Code Ahead! • Not a C programmer? > Please remain seated! • Out-of-scope for a DBA? > Wrong! This stuff is fundamental! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 17
    • In the Beginning ... • UNIX devices were simple ... > open() > close() > read() > write() > seek() > ioctl() - catch-all for other functions Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 18
    • Now, APIs Abound ... • Diversity serving performance ... > open(), open64() > close() > read(), pread(), aioread(), aio_read(), readv(), lio_listio() > write(), pwrite(), aiowrite(), aio_write(), writev(), lio_listio() > seek() - integrated in modern I/O calls > ioctl() - catch-all for other functions – including Veritas VxFS Oracle Disk Manager (ODM) implementation > mmap() - memory-mapped I/O Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 19
    • Some Major Oracle I/O Categories • LGWR > open(...O_DSYNC...), aiowrite(), aiowait(), aiowaitn() • DBWR > open(...O_DSYNC...), aiowrite(), aiowait(), aiowaitn() • MBRC reads > pread(), aioread() traditionally with PQO • Single-block reads > pread() • ARCH > Deferred writes on output files • Direct path writes > Deferred pwrite(), with periodic fsync() to flush • Plus lots of 'it depends', varying by Oracle version > readv(), lio_listio() also used some, but mmap() is not - AFAIK Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 20
    • Some Major Oracle I/O Tunables • disk_asynch_io > Default TRUE; disabling indicates a dubious filesystem choice • db_writer_processes > One usually enough, many cannot achieve the same demand concurrency as one using AIO > 10g on SMP uses one per memory locale (lgroup) • db_cache_size > More efficient and performance-scalable than filesystem cache • db_file_multiblock_read_count > Size of FFS read operations > WARNING: Warps optimizer decisions! • small_table_threshold > One (of many) ways to leverage large db_cache_size Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 21
    • Tool Roundup Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 22
    • Tool Categories • Included with the Operating System > mkfile, cp (inappropriate) > dd (sometimes useful) • Sun controlled distribution > vdbench – portable, basis for SPC-1 standard benchmark > Sun StorEdge Analysis Tool (SWAT) – fancy data visualization! • 3rd-party > vxbench (Veritas) • Open source – sophisticated stuff! (find w/ Google) > filebench – Sun-promoted, featureful – worth a whole preso! > iozone – Popular, featureful, often misused – worth some study! • Roll-your-own – simple tools > For example, Bob's K.I.S.S. codes: wfile & iox > Free code, will be downloadable from http://solarisinternals.com ... (E-Mail Bob.Sneed@Sun.Com with 'iox' in subject line until it gets posted) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 23
    • Why Not Use 'cp' and 'mkfile'? • Because they do not resemble Oracle operations at all! • Use truss to confirm ... <<< Lab/demo goes here >>> Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 24
    • Advice: Keep a Scientific Perspective • There is always a bottleneck somewhere > System memory? > CPU speed? > HBA/channel/bus speed? > Some I/O library or implementation detail? > A bug that's already been patched - but not on your system? > Fully-cached performance? > Actual moving parts? > What the storage can do? > What the storage is likely to do for Oracle? • You can design an experiment to test any of these – and you may find one of these explaining your results – by surprise ... Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 25
    • Use Cases Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 26
    • Before We Start ... • Pay no attention to the absolute numbers today ... > This test equipment is cobbled-together parts > It's the relative goodness of different options that matters • About the test system ... > Sun Ultra 60, Dual 450 Mhz UltraSPARC II, 1 GB RAM > Six-disk Sun UltraSCSI 'multipack' with LVD-160 disks inside > You could do most of this stuff with the second internal disk alone > All loaded software is downloadable for free > Solaris 10, Update 1 > Sun Studio 11 and CC for SPARC Systems (GCCFSS cool tools) > Various microbenchmarking codes > ... all very affordable on eBay these days! (In other words – there is no excuse for not having test equipment your shop!!) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 27
    • What to test? OS Buffered Unbuffered RAW UFS direct I/O QFS qwrite Concurrent QFS direct,qwrite VxFS CQIO VxFS QIO VxFS ODM Filesystem QFS direct I/O Non-concurrent defaults VxFS direct I/O Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 28
    • Q.E.D. - Bob's “Eye Chart” ... OS-Level Kernel Performance Write Buffering AIO Admin Relative to Cost Logging Concurrency [3] (KAIO) Complexity RAW RAW FREE[1] N/A YES NO YES HIGH BASELINE UFS FREE YES [2] NO YES NO VERY LOW UFS direct I/O FREE YES [2] YES NO NO LOW SIMILAR QFS $ N/A NO YES NO VERY LOW QFS qwrite - N/A YES YES NO LOW QFS direct,qwrite,samaio - N/A YES NO YES+ LOW SIMILAR VxFS $ YES NO YES NO VERY LOW VxFS direct I/O - YES NO NO NO LOW VxFS Quick I/O (QIO) $++ YES YES NO YES HIGH SIMILAR VxFS Cached Quick I/O (CQIO) $++ YES YES YES YES HIGH VxFS Oracle Disk Manager (ODM) $++ YES YES NO YES VERY LOW SIMILAR [1] Unless, of course, a 3rd-party volume manager is used, like VxVM [2] Not ON by default in all Solaris versions; requires trivial setup [3] Includes prefetching, deferred writes, and read re-hits (may help) and overheads of segmap & 'extra copy' (may hurt) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 29
    • A Few Words About Instrumentation • iostat -xnzTd • mpstat • vmstat / vmstat -p • prstat -m / prstat -mL • mdb -k (allows examining kernel settings) • kstat (stats on virtual memory and more) • A wall clock (a wristwatch or stopwatch will do) • A stethoscope (advanced!) • A spy glass (watch the blinking lights) • lockstat (see low-level locking) • plockstat (see application-level locking) • DTrace (Solaris 10) – for really advanced geeks • Mainly, though – what the application sees! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 30
    • Fun Things to Test (1/2) • Test patterns ... > File creation speed versus re-write speed > File reading speed > Random read, write, and read/write performance • Easy stuff to vary ... > Native device latency > open() mode (esp. O_DSYNC) > Locality of demand > Concurrency of 'demand' > Concurrency of 'supply' > OS-level buffering > UFS direct I/O usage (effects both buffering and concurrency) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 31
    • Fun Things to Test (2/2) • Other stuff to vary ... > Filesystem choice (UFS, QFS, VxFS) > Filesystem block size (esp. with VxFS) > Filesystem logging options > Filesystem versus raw device performance > Volume management options > VM choice (SVM, VxVM) > RAID options (0, 1, 5) > Stripe depth and width > UFS noatime option > UFS maxcontig tunable > HBA/LUN throttle (sd_max_throttle/ssd_max_throttle) > scsi_options (Often set incorrectly!) • In other words – any place there is a controllable variable! > Endless hours of fun!! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 32
    • How Real Benchmark Engineers Do It • First – strategically plan and design everything > It gets easier with experience and known patterns for success • Next - configure the storage LUNs > Microbenchmark the LUNs > If performance not right, re-configure • Next - configure volume management > Microbenchmark raw volumes > If performance not right, re-configure • Next - configure filesystem > Microbenchmark filesystem objects > If performance not right, re-configure • Finally - Install and configure the database > If performance not right, back to the drawing board! Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 33
    • Sequential Writing – When It Matters • When creating data files > When creating a database or adding data files > When writing filesystem LOBs > Especially interesting in certain disaster-recovery scenarios > Generally: O_DSYNC allocating writes, 128 KB • When writing to REDO logs > Interesting when log_file_synch is significant > O_DSYNC non-allocating writes, by default using AIO • When archiving REDO logs > Especially interesting at high REDO rates; potential for 'cannot switch log' database hangs > Deferred allocating writes Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 34
    • Sequential Writing – Performance Factors • Write size • O_DSYNC versus deferred • Concurrency of demand • Allocating versus non-allocating (metadata overhead) • Filesystem logging options (metadata efficiency) > Especially in space-allocating case! • Filesystem code path (eg: UFS default versus direct) • Filesystem tunables (eg: write throttles) • Write latency of target device (hardware caching) > Cache size relative to file size • Volume management factors • Path management factors • Interconnect, HBA factors, and target spindle technology Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 35
    • Sequential Writing – wfile • How to proceed > Download > Examine > Build > Experiment • wfile – key characteristics > Default is O_DSYNC (start with the slow case!) > Free code, compact source > Simple command-line operation Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 36
    • wfile – Usage root_stray_10: wfile Usage: wfile [{+,-}{sync,dsync,fsync,direct} ...] <file> <filesize> [<writesize>] Where: '+/-sync' controls O_SYNC on open() (default OFF) '+/-dsync' controls O_DSYNC on open() (default ON) '+fsync_each' fsync() each write() (default OFF) '-fsync_timed' include final fsync() in times (default ON) '-fsync' suppresses final fsync() (default OFF) '+/-direct' controls directio() mode (default from fs mount option) Notes: fsync() is called by default after writing unless sync writing modes or '+fsync_each' are used. Final fsync is included in reported stats unless suppressed by '-fsync_timed'. <writesize> defaults to 512 bytes. 'k', 'm', and 'g' syntax is allowed for <filesize> and <writesize> Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 37
    • wfile – Sample Commands <<< Demo/lab goes here >>> Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 38
    • Random I/O – When It Matters • Writes > Checkpoint writes • Reads > Fetching data and index blocks Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 39
    • Random I/O – Performance Factors • I/O size • Locality of reference • Demand concurrency • Supply concurrency • Exact API used (SUNW AIO, POSIX AIO, writev, listio) • All that other stuff ... Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 40
    • Random I/O – iox • How to proceed > Download > Examine > Build > Experiment • iox – key characteristics > Specifically exercises SUNW AIO code path ('saio') > Free code, compact source > Simple command-line operation > Emits statistics every 10 seconds (but not 'variance', yet) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 41
    • iox – Usage root_stray_11: iox iox 0.7 Usage: iox {option=value[,value] ...} {filename ...} Valid options (with defaults in paretheses) are: load={read|write|readwrite},{random|sequential}[,saio] [open={sync,dsync,direct,create,truncate,append}] [close={fsync,remove,truncate}] [duration=<seconds> - Run time (forever) [interval=<seconds> - Reporting interval (10) [filesize=<num>] - Specify I/O range (file size) [iosize=<num>] - I/O size (8192) [iocount=<num>] - I/O count (infinite) [align=<num>] - I/O alignment constraint (8192) [dop=<num>] - Degree Of Parallelism (4 w/ seq, 256 w/ random) [us=<num>] - Think time - usec per MB (0 seq) [pctread=<num>] - Percent read vs. write (50) [timeout=<num>] - AIO timeout threshold (600 sec) [seed=<num>] - Seed for lrand48 (time()) [grow=<bool>] - Allow file to grow (NO) [core=<bool>] - Suppress core on quit (YES) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 42
    • iox – Sample Commands <<< Demo/lab goes here >>> Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 43
    • Back to the Eye Chart ... OS-Level Kernel Performance Write Buffering AIO Admin Relative to Cost Logging Concurrency [3] (KAIO) Complexity RAW RAW FREE[1] N/A YES NO YES HIGH BASELINE UFS FREE YES [2] NO YES NO VERY LOW UFS direct I/O FREE YES [2] YES NO NO LOW SIMILAR QFS $ N/A NO YES NO VERY LOW QFS qwrite - N/A YES YES NO LOW QFS direct,qwrite,samaio - N/A YES NO YES+ LOW SIMILAR VxFS $ YES NO YES NO VERY LOW VxFS direct I/O - YES NO NO NO LOW VxFS Quick I/O (QIO) $++ YES YES NO YES HIGH SIMILAR VxFS Cached Quick I/O (CQIO) $++ YES YES YES YES HIGH VxFS Oracle Disk Manager (ODM) $++ YES YES NO YES VERY LOW SIMILAR [1] Unless, of course, a 3rd-party volume manager is used, like VxVM [2] Not ON by default in all Solaris versions; requires trivial setup [3] Includes prefetching, deferred writes, and read re-hits (may help) and overheads of segmap & 'extra copy' (may hurt) Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 44
    • ? ? ? ? ? Q&A ? ? ? ? ? Copyright © 2006 by Sun Microsystems, Inc. All rights reserved. 45