Database I/O Performance:Measuring and PlanningAlex GorbachevInsight-Out Database SymposiumTokyo, 2011
Alex Gorbachev    • CTO,  The Pythian Group    • Blogger    • OakTable Network member    • Oracle ACE Director    • Battle...
Why Companies Trust Pythian    • Recognized Leader:    •   Global industry-leader in remote database administration servic...
Why Measure                I/O Performance?         Diagnostics & troubleshooting                Proof of impact       Cap...
Instrumentation:          Storage Stack vs Oracle Database      ➡   Oracle DB call                   ➡     Storage I/O cal...
Is Profiling an I/O Call         Feasible?6            © 2009/2010 Pythian
Direct Attached Storage Stack    Illustration from Guttina Srinivass Blog - http://guttinasrinivas.wordpress.com/7        ...
Simplified Enterprise Storage Stack    Sample IBM Storage Stack - http://www.ibm.com/developerworks/tivoli/library/t-snapt...
9   © 2009/2010 Pythian
complex     Storage stack is too      and  heterogeneous to        build end-to-end IO profile10                © 2009/201...
Sources of I/O Performance Measurements     Database as an application consuming I/O services      MUST HAVE          Dril...
How is I/O Measured in the Database?     • I/O      code paths (syscalls) are instrumented - I/O Waits      •   timed_stat...
WHAT Do We Measure?            Response Time       Throughput / Bandwidth          Skew & Patterns     I/O measurements ar...
Reproducible                issue?     10046                trace        response time        skew & patterns14         © ...
Mr Tools - The Time-Saver15            © 2009/2010 Pythian
Example Profile: 4+ hours batch jobWait Event / Syscall                            DURATION          CALLS          MEAN  ...
I/O Response Time Histogram     Matched event names:             db file sequential read     Options:                group...
Datafile Skew? Matched event names:         db file sequential read Options:            group   = $p1            name    =...
Analyzing Datafile ChunksMatched event names:        db file sequential readOptions:           group     = $p1*1000000000+...
Playing with Chunks SizeMatched event names:        db file sequential readOptions:           group      = $p1*1000000000+...
Time Periods Analysis                             One minute average IO response time, seconds     2.0     1.5     1.0    ...
10046 Trace Is Expensive... NOT!     • 10046  tracing overhead is insignificant     • This sample 4+ hours batch - trace <...
Storing 3GB of data on Amazon S3     costs less than $1 per month23               © 2009/2010 Pythian
What Does 10046 Not Buy You?     • Throughput      •   Doable but needs quite a bit of traces to enable and process      •...
Measuring Throughput                                            Database                                           Host   ...
Average values     make sense only if events     are perfectly randomly     distributed as well as response times26       ...
Don’t Be Trapped by Averages!     • Averaging     response times      •   Loosing skew info     • Loosing    IO calls attr...
Choosing the Aggregation Interval     • 24 hours running window     • 95% of transaction should complete within 1 seconds ...
Random Arrivals concept         applies 100% to IO calls     Detecting   Random Arrivals rule violation                 re...
Monitoring I/O Performance and SLAs     • How    your transactions SLAs transform to IO SLAs?     • Percentile    requirem...
Importance of Response Time Histograms     • Includinghistograms in the snapshots adds more color to       the averaged me...
A Tool to Collect Short Interval Averages     • Requirements:      •   1 minute or less intervals      •   Collect system ...
ASH Data for I/O Measurements?     V$ACTIVE_SESSION_HISTORY                &     DBA_HIST_ACTIVE_SESS_HISTORY     • TIME_W...
ASH itself is     misleading for I/O performance                 measurements       Sampling tends to hide short waits    ...
AWR Sources     • DBA_HIST_EVENT_HISTOGRAM     • DBA_HIST_FILEMETRIC_HISTORY                   *     • DBA_HIST_FILESTATXS...
AWR Example - DBA_HIST_SYSMETRIC_HISTORY     -- Physical Reads Per Sec     -- Physical Writes Per Sec     -- I/O Requests ...
V$SESSION_WAIT_HISTORY?     • The   last 10 wait events for each active session.     • Column    WAIT_TIME_MICRO      •   ...
Measuring at the OS Layer     • OS   is not really transparent for IO requests      •   Has IO requests queues      •   Ut...
Measuring at the SAN Layer     • Normally     most of IO time is spent on physical disk but...      •   Read cache impact ...
Exadata Storage Cell Measurement     • Replacement  of SAN layer     • More than jut stats per disk / controller and etc  ...
Increased Importance of Low Latency Network     • With       traditional HDD random access times of 5-10ms      ➡       Co...
Exadata: Flash + InfiniBand = Very Low Latency?     • Let’s     check some Exadata 10046 traces...     Matched event names...
Exadata: Flash + InfiniBand = Very Low Latency?                           await  svctm  %util                            0...
Measuring for Planning:     Aggregate Interval     1.   Choose a large-ish interval     2.   Analyze histograms - skewed i...
AWR Example - Reads & Writes (IOPS)45                 © 2009/2010 Pythian
AWR Example - Throughput (MBPS)46               © 2009/2010 Pythian
AWR Example - Redo Generation (MBPS)47                  © 2009/2010 Pythian
Measuring for Planning:     Distinguish Different Kinds of I/O     • Random      vs sequential I/O      •   If underlying ...
Measuring for Planning:     Business Function Granularity     • Measure         I/O at the right granularity      •   Idea...
planning measurements      Capacity     from database view alone are               enough50               © 2009/2010 Pyth...
Oracle Database CALIBRATE_IO      DBMS_RESOURCE_MANAGER.CALIBRATE_IO        (<DISKS>, <MAX_LATENCY>, iops, mbps, lat);    ...
ORION - ORacle I/O Numbers     • Free   tool from Oracle simulating database-like IOs      •   No database required      •...
ORION Example 1: Scalability Anomaly                                         HP blades                                    ...
ORION Example 1: Impact of Large IOs                                         HP blades                                    ...
ORION Example 1: Write IO Impact                                        HP blades                                        H...
ORION Example 2: Initial Run - Failed Expectations     NetApp NAS, 1 Gbit Ethernet, 42 disks             5000             ...
ORION Example 2: Tune-Up Results                                           Switched from Intel to Broadcom NICs           ...
ORION Example 3: RAID558           © 2009/2010 Pythian
ORION Example 3: RAID1059           © 2009/2010 Pythian
Presenting       measurements:     Visualization is the Key60             © 2009/2010 Pythian
Q&A     Email me - gorbachev@pythian.com     Read my blog - http://www.pythian.com     Follow me on Twitter - @AlexGorbach...
Upcoming SlideShare
Loading in...5
×

[INSIGHT OUT 2011] A23 database io performance measuring planning(alex)

1,467

Published on

Published in: Technology
2 Comments
10 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,467
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
2
Likes
10
Embeds 0
No embeds

No notes for slide

[INSIGHT OUT 2011] A23 database io performance measuring planning(alex)

  1. 1. Database I/O Performance:Measuring and PlanningAlex GorbachevInsight-Out Database SymposiumTokyo, 2011
  2. 2. Alex Gorbachev • CTO, The Pythian Group • Blogger • OakTable Network member • Oracle ACE Director • BattleAgainstAnyGuess.com • President, Oracle RAC SIG2 © 2009/2010 Pythian
  3. 3. Why Companies Trust Pythian • Recognized Leader: • Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and SQL Server • Work with over 150 multinational companies such as Western Union, Fox Interactive Media, and MDS Inc. to help manage their complex IT deployments • Expertise: • One of the world’s largest concentrations of dedicated, full-time DBA expertise. • Global Reach & Scalability: • 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response38 © 2011 Pythian
  4. 4. Why Measure I/O Performance? Diagnostics & troubleshooting Proof of impact Capacity planning and monitoring Platform validation / acceptance testing4 © 2009/2010 Pythian
  5. 5. Instrumentation: Storage Stack vs Oracle Database ➡ Oracle DB call ➡ Storage I/O call 1.read block • UNKNOWN 2.read block 3.latch free 4.read block 5.enqueue 6.send result Can profile a DB call Cannot profile I/O call5 © 2009/2010 Pythian
  6. 6. Is Profiling an I/O Call Feasible?6 © 2009/2010 Pythian
  7. 7. Direct Attached Storage Stack Illustration from Guttina Srinivass Blog - http://guttinasrinivas.wordpress.com/7 © 2009/2010 Pythian
  8. 8. Simplified Enterprise Storage Stack Sample IBM Storage Stack - http://www.ibm.com/developerworks/tivoli/library/t-snaptsm1/index.html8 © 2009/2010 Pythian
  9. 9. 9 © 2009/2010 Pythian
  10. 10. complex Storage stack is too and heterogeneous to build end-to-end IO profile10 © 2009/2010 Pythian
  11. 11. Sources of I/O Performance Measurements Database as an application consuming I/O services MUST HAVE Drill down into the rest of the I/O stack ASM Operating System Storage arrays Complimentary ...11 © 2009/2010 Pythian
  12. 12. How is I/O Measured in the Database? • I/O code paths (syscalls) are instrumented - I/O Waits • timed_statistics=true • Additional statistics are collected • IO size, amount, time spent • Granularity on different levels • Global, session, datafile, service, module/action • Stored in SGA as cumulative counters - X$ tables • Externalized via V$ views • Snapshots taken by various tools like Statspack, AWR, Snapper, etc.12 © 2009/2010 Pythian
  13. 13. WHAT Do We Measure? Response Time Throughput / Bandwidth Skew & Patterns I/O measurements are almost always aggregate!13 © 2009/2010 Pythian
  14. 14. Reproducible issue? 10046 trace response time skew & patterns14 © 2009/2010 Pythian
  15. 15. Mr Tools - The Time-Saver15 © 2009/2010 Pythian
  16. 16. Example Profile: 4+ hours batch jobWait Event / Syscall DURATION CALLS MEAN MIN MAX----------------------------- ------------------------ ---------- ----------- ----------- -----------db file sequential read 11861.295517 81.4% 201940 0.058737 0.000000 5.473023log file switch (checkpoint.. 1941.262523 13.3% 49 39.617603 0.001443 211.405054PL/SQL lock timer 764.452061 5.2% 765 0.999284 0.000008 1.003142log buffer space 0.149762 0.0% 8 0.018720 0.006973 0.030125undo segment extension 0.126689 0.0% 19 0.006668 0.001265 0.0336826 others 0.201454 0.0% 14 0.014390 0.000004 0.059468----------------------------- ------------------------ ---------- ----------- ----------- -----------TOTAL (11) 14567.488006 100.0% 202795 0.071834 0.000000 211.40505416 © 2009/2010 Pythian
  17. 17. I/O Response Time Histogram Matched event names: db file sequential read Options: group = name = db file sequential read where = 1 RANGE {min <= e < max} DURATION CALLS MEAN ----------------------- ------------------------ ---------- ----------- 0.000000 0.000001 0.000000 0.0% 14 0.000000 0.000001 0.000010 0.000021 0.0% 8 0.000003 0.000010 0.000100 0.008654 0.0% 180 0.000048 0.000100 0.001000 41.040579 0.3% 86617 0.000474 0.001000 0.010000 201.892556 1.7% 36305 0.005561 0.010000 0.100000 1435.417470 12.1% 66754 0.021503 0.100000 1.000000 3730.265905 31.4% 9059 0.411775 1.000000 10.000000 6452.670332 54.4% 3003 2.148741 10.000000 100.000000 0.000000 0.0% 0 100.000000 1000.000000 0.000000 0.0% 0 1000.000000 Infinity 0.000000 0.0% 0 ----------------------- ------------------------ ---------- ----------- TOTAL (8) 11861.295517 100.0% 201940 0.05873717 © 2009/2010 Pythian
  18. 18. Datafile Skew? Matched event names: db file sequential read Options: group = $p1 name = db file sequential read where = 1 File ID DURATION CALLS MEAN MIN MAX 6 2383.052786 20.1% 40086 0.059449 0.000000 4.825304 10 2131.333101 18.0% 21568 0.098819 0.000029 5.366355 12 2065.204816 17.4% 35353 0.058417 0.000000 5.104831 7 1870.332973 15.8% 32955 0.056754 0.000000 4.954959 11 1711.504204 14.4% 39065 0.043812 0.000000 4.819981 9 1659.888036 14.0% 23735 0.069934 0.000000 5.473023 14 36.206148 0.3% 3141 0.011527 0.000063 4.442775 8 3.532841 0.0% 5877 0.000601 0.000073 0.061977 13 0.193044 0.0% 126 0.001532 0.000343 0.104574 1 0.046855 0.0% 32 0.001464 0.000000 0.022407 3 0.000713 0.0% 2 0.000357 0.000311 0.000402 TOTAL (11) 11861.295517 100.0% 201940 0.058737 0.000000 5.47302318 © 2009/2010 Pythian
  19. 19. Analyzing Datafile ChunksMatched event names: db file sequential readOptions: group = $p1*1000000000+int($p2*8192/1024/1024) name = db file sequential read where = $ela>0.1 File Chunk DURATION CALLS MEAN MIN MAX------------ ------------------------ ---------- ----------- ----------- ----------- 10000008570 175.587622 1.7% 120 1.463230 0.134717 4.373926 6000000381 173.669439 1.7% 119 1.459407 0.107691 3.713161 10000008566 157.199899 1.5% 102 1.541175 0.167078 4.366412 10000008565 147.466754 1.4% 98 1.504763 0.128982 4.538604 6000008641 139.614461 1.4% 90 1.551272 0.127778 4.799470 10000008567 120.733972 1.2% 89 1.356561 0.100613 4.564558 9000008223 107.619815 1.1% 73 1.474244 0.118106 5.473023 10000008563 95.949235 0.9% 72 1.332628 0.115185 3.580435 9000008224 90.483791 0.9% 79 1.145364 0.129597 5.468010 6000006191 86.307121 0.8% 78 1.106502 0.102094 3.876378 4329 others 8888.304128 87.3% 11142 0.797730 0.100035 5.366355------------ ------------------------ ---------- ----------- ----------- -----------TOTAL (4339) 10182.936237 100.0% 12062 0.844216 0.100035 5.47302319 © 2009/2010 Pythian
  20. 20. Playing with Chunks SizeMatched event names: db file sequential readOptions: group = $p1*1000000000+int($p2*8192/1024/1024/16) name = db file sequential read where = $ela>0.1 File Chunk DURATION CALLS MEAN MIN MAX----------- ------------------------ ---------- ----------- ----------- -----------10000000535 846.934923 8.3% 633 1.337970 0.100168 4.564558 7000000029 315.398085 3.1% 353 0.893479 0.103097 3.670991 6000000023 280.162428 2.8% 330 0.848977 0.100183 3.71316112000000171 261.555298 2.6% 268 0.975953 0.103535 4.01404312000000170 193.130501 1.9% 166 1.163437 0.102184 3.937978 9000000513 175.100649 1.7% 124 1.412102 0.118106 5.473023 7000000157 173.111037 1.7% 160 1.081944 0.102949 4.237775 6000000540 140.663440 1.4% 91 1.545752 0.127778 4.799470 6000000386 130.590608 1.3% 172 0.759248 0.100873 3.87637811000000156 122.062914 1.2% 135 0.904170 0.100622 3.748086 447 others 7544.226354 74.1% 9630 0.783409 0.100035 5.468010----------- ------------------------ ---------- ----------- ----------- -----------TOTAL (457) 10182.936237 100.0% 12062 0.844216 0.100035 5.47302320 © 2009/2010 Pythian
  21. 21. Time Periods Analysis One minute average IO response time, seconds 2.0 1.5 1.0 0.5 0 1 7 13 19 25 31 38 45 52 58 64 70 77 83 92 98 104 110 116 122 128 134 140 146 152 159 165 171 177 186 196 202 208 214 220 226 232 238 24421 © 2009/2010 Pythian
  22. 22. 10046 Trace Is Expensive... NOT! • 10046 tracing overhead is insignificant • This sample 4+ hours batch - trace <30MB with 300K+ lines • 10x compressed - 3 MB • 30 batches per night - <1GB of traces • 10x compressed - 100 MB per night One month of complete 10046 trace batch history is only 3GB compressed22 © 2009/2010 Pythian
  23. 23. Storing 3GB of data on Amazon S3 costs less than $1 per month23 © 2009/2010 Pythian
  24. 24. What Does 10046 Not Buy You? • Throughput • Doable but needs quite a bit of traces to enable and process • No accounting for non-database workload • No visibility on how each IO call translates into “real” IOs • Real IOs - requests done by DB server OS? • Real IOs - requests done by a SAN controller? • Real IOS - requests served by disk controller? • Caching impact24 © 2009/2010 Pythian
  25. 25. Measuring Throughput Database Host • AWR & Statspack • OS tools Storage Array • Like sar, iostat, DTrace • Storage vendor tools • Like EMC Symmetrix Performance Analyzer (SPA)25 © 2009/2010 Pythian
  26. 26. Average values make sense only if events are perfectly randomly distributed as well as response times26 © 2009/2010 Pythian
  27. 27. Don’t Be Trapped by Averages! • Averaging response times • Loosing skew info • Loosing IO calls attributes • Sizes, offsets, data blocks • Loosingscope - what transaction is this IO request for? • Reduced time granularity • Traditional Statspack & AWR snaps are hourly • sar data is captured every 5 (or 10?) minutes be default • SAN stats usually aggregated as high as 1 hours (SPA - 5 minutes?)27 © 2009/2010 Pythian
  28. 28. Choosing the Aggregation Interval • 24 hours running window • 95% of transaction should complete within 1 seconds • 99% of transactions should complete within 10 seconds • 10 seconds is timeout so 1% of transactions can fail and it’s OK • 24 hours is 86,400 seconds => 1% is 864 seconds (14.4 min) •1 hour intervals => few minutes hiccups won’t be noticeable • 5 minutes intervals => significant spikes of IO response time will likely be noticeable • But really want to go to intervals within the typical transaction response times28 © 2009/2010 Pythian
  29. 29. Random Arrivals concept applies 100% to IO calls Detecting Random Arrivals rule violation requires averaging interval close to response time29 © 2009/2010 Pythian
  30. 30. Monitoring I/O Performance and SLAs • How your transactions SLAs transform to IO SLAs? • Percentile requirements • Commit to response time according to percentile requirements at the pre-defined throughput and concurrency levels • *average* 2000 IOPS per second with up to 40 concurrent IOs • 99% of IOs - <10 ms, 99.9% IOs - <100ms • 1 minute sliding window • Monitoring such SLAs - must average 1 minute and collect response times histogram30 © 2009/2010 Pythian
  31. 31. Importance of Response Time Histograms • Includinghistograms in the snapshots adds more color to the averaged measures • Histogram is an indicator of skew • They help selecting the right measurements interval • Histograms can be build on any value - not just response times • Histogram of IO throughput per 5 minutes intervals to analyze whether we have bursts of IO activity. • Histogram in Statspack reports appeared in 10g • Histogram in AWR reports appeared in 11g31 © 2009/2010 Pythian
  32. 32. A Tool to Collect Short Interval Averages • Requirements: • 1 minute or less intervals • Collect system level IO waits and stats • Collect session level IO waits and stats • Collect IO response time histograms (system and session) • Nice to have - per service/module/action granularity • Production collection example (6 years old) • Oracle 9i RAC, HP-UX 64 cores • thousands DB calls per second, thousands IO calls per second • *All* stats and waits with 1-5 minute snaps and at logoff • Tanel Poder’s Snapper and Sesspack32 © 2009/2010 Pythian
  33. 33. ASH Data for I/O Measurements? V$ACTIVE_SESSION_HISTORY & DBA_HIST_ACTIVE_SESS_HISTORY • TIME_WAITED => 11.2 documentation is misleading • DELTA_TIME • DELTA_READ_IO_REQUESTS/BYTES • DELTA_WRITE_IO_REQUESTS/BYTES33 © 2009/2010 Pythian
  34. 34. ASH itself is misleading for I/O performance measurements Sampling tends to hide short waits invalidating it for any response time analysis34 © 2009/2010 Pythian
  35. 35. AWR Sources • DBA_HIST_EVENT_HISTOGRAM • DBA_HIST_FILEMETRIC_HISTORY * • DBA_HIST_FILESTATXS • DBA_HIST_IOSTAT_DETAIL/FILETYPE/FUNCTION • DBA_HIST_SERVICE_STAT • DBA_HIST_SESSMETRIC_HISTORY * • DBA_HIST_SQLSTAT • DBA_HIST_SYSTEM_EVENT • DBA_HIST_SYSSTAT • DBA_HIST_SYSMETRIC_HISTORY * * These views have granularity of 1 minute35 © 2009/2010 Pythian
  36. 36. AWR Example - DBA_HIST_SYSMETRIC_HISTORY -- Physical Reads Per Sec -- Physical Writes Per Sec -- I/O Requests per Second -- I/O Megabytes per Second -- Redo Generated Per Sec -- Average Synchronous Single-Block Read Latency SELECT begin_time, ROUND(value,1) v FROM dba_hist_sysmetric_history WHERE metric_name= Average Synchronous Single-Block Read Latency ORDER BY 1;36 © 2009/2010 Pythian
  37. 37. V$SESSION_WAIT_HISTORY? • The last 10 wait events for each active session. • Column WAIT_TIME_MICRO • Amount of time waited (in microseconds)37 © 2009/2010 Pythian
  38. 38. Measuring at the OS Layer • OS is not really transparent for IO requests • Has IO requests queues • Utilizes various I/O schedulers that decide on requests priority • ASYNC I/O • Filesystems and buffered I/O • Impact of CPU scheduling • Timespent in OS layer becomes important as we move to SSD and Flash storage • Difficult to directly associate OS stats with DB stats38 © 2009/2010 Pythian
  39. 39. Measuring at the SAN Layer • Normally most of IO time is spent on physical disk but... • Read cache impact • Write cache impact • Cache saturation situations • Abnormal situations like controller/switch failure • Quality of Service (QoS) • Flash based storage shifts the balance of time again • Non-disk component of IO response time becomes more prominent • Difficultto associate SAN stats with OS & DB stats • Virtualization kicks in39 © 2009/2010 Pythian
  40. 40. Exadata Storage Cell Measurement • Replacement of SAN layer • More than jut stats per disk / controller and etc • Storage Cell now performs more than just I/O functions • Muchbetter accountability and association with database • Database segment visibility in flash cache • IORM metrics - category, database, consumer groups • Flash Cache metrics • Cumulative and 1 minute aggregates • Some stats are passed back to the database • V$SYSSTAT, V$SQL, waits, XML cell stats in V$CELL_STATE40 © 2009/2010 Pythian
  41. 41. Increased Importance of Low Latency Network • With traditional HDD random access times of 5-10ms ➡ Communication overhead is minimal - less than 10% • FC storage latencies are in few hundreds of microseconds • NFS mounted storage adds less than 1ms latency • IP stack is heavier on CPU => impact of OS CPU scheduler • Flash read latency is order of magnitude shorter ➡ Suddenly InfiniBand SAN becomes necessity! • microseconds latency41 © 2009/2010 Pythian
  42. 42. Exadata: Flash + InfiniBand = Very Low Latency? • Let’s check some Exadata 10046 traces... Matched event names: cell single block physical read Options: group = name = cell single block physical read where = 1 RANGE {min <= e < max} DURATION CALLS MEAN 0.000000 0.000001 0.000000 0.0% 0 0.000001 0.000010 0.000000 0.0% 0 0.000010 0.000100 0.000000 0.0% 0 0.000100 0.001000 0.191839 95.5% 310 0.000619 0.001000 0.010000 0.008983 4.5% 3 0.002994 0.010000 0.100000 0.000000 0.0% 042 © 2009/2010 Pythian
  43. 43. Exadata: Flash + InfiniBand = Very Low Latency? await  svctm  %util 0.51   0.31   5.85 0.79   0.38   6.40 0.57   0.41   6.50 0.62   0.40   7.00 0.41   0.30   4.95 0.43   0.32   5.60 Device:         rrqm/s   wrqm/s    r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util sdn               0.50     0.00 188.50  0.00  1512.00     0.00    16.04     0.10    0.51   0.31   5.85 sdo               1.50     0.00 170.50  0.00  1376.00     0.00    16.14     0.14    0.79   0.38   6.40 sdp               2.50     0.00 157.00  0.00  1276.00     0.00    16.25     0.09    0.57   0.41   6.50 sdq               0.50     0.00 173.50  0.00  1392.00     0.00    16.05     0.11    0.62   0.40   7.00 sdr               0.50     0.00 166.50  0.00  1336.00     0.00    16.05     0.07    0.41   0.30   4.95 sds               1.00     0.00 175.50  0.00  1412.00     0.00    16.09     0.08    0.43   0.32   5.6043 © 2009/2010 Pythian
  44. 44. Measuring for Planning: Aggregate Interval 1. Choose a large-ish interval 2. Analyze histograms - skewed inside the interval? 3. If Yes, reduce the interval 4. Repeat steps 1-3 until ... a) you either see no skew or ... b) business stops carrying about skew inside that interval44 © 2009/2010 Pythian
  45. 45. AWR Example - Reads & Writes (IOPS)45 © 2009/2010 Pythian
  46. 46. AWR Example - Throughput (MBPS)46 © 2009/2010 Pythian
  47. 47. AWR Example - Redo Generation (MBPS)47 © 2009/2010 Pythian
  48. 48. Measuring for Planning: Distinguish Different Kinds of I/O • Random vs sequential I/O • If underlying disks are spinning media • Small vs Large IOs • Throughput is then measured either in IOPS or MBPS • Reads vs Writes • Sometimes can be generalized as what % are the writes48 © 2009/2010 Pythian
  49. 49. Measuring for Planning: Business Function Granularity • Measure I/O at the right granularity • Ideally per business transaction / function • Practical - service, session, module/action, SQL • “System” I/O - LGWR, ARCH, DBWR, etc. • Indirect association to business transactions • Helps building more realistic capacity planning models49 © 2009/2010 Pythian
  50. 50. planning measurements Capacity from database view alone are enough50 © 2009/2010 Pythian
  51. 51. Oracle Database CALIBRATE_IO DBMS_RESOURCE_MANAGER.CALIBRATE_IO (<DISKS>, <MAX_LATENCY>, iops, mbps, lat); • iops - max read per second (random single block) • lat - actual average single block latency at iops rate • mbps - max MB/s throughput (large reads) simplistic read-only needs a database outputs max only requires ASYNC I/O51 © 2009/2010 Pythian
  52. 52. ORION - ORacle I/O Numbers • Free tool from Oracle simulating database-like IOs • No database required • Same I/O libs / code-path • Still requires ASYNC I/O • Very flexible • Large vs Small IOs; flexible sizes; mixed • Random vs Sequential I/O patterns; mixed • Configurable write I/O % • Can simulate ASM striping layout52 © 2009/2010 Pythian
  53. 53. ORION Example 1: Scalability Anomaly HP blades HP Virtual Connect Flex10 Big NetApp box 100 disks53 © 2009/2010 Pythian
  54. 54. ORION Example 1: Impact of Large IOs HP blades HP Virtual Connect Flex10 Big NetApp box 100 disks54 © 2009/2010 Pythian
  55. 55. ORION Example 1: Write IO Impact HP blades HP Virtual Connect Flex10 Big NetApp box 100 disks55 © 2009/2010 Pythian
  56. 56. ORION Example 2: Initial Run - Failed Expectations NetApp NAS, 1 Gbit Ethernet, 42 disks 5000 30.0 4000 Read only 22.5 Latency, ms 3000 IOPS 15.0 2000 7.5 1000 0 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 100 IOPS Latency 5000 50 4000 40 Read write Latency, ms 3000 30 IOPS 2000 20 1000 10 0 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 10056 © 2009/2010 Pythian
  57. 57. ORION Example 2: Tune-Up Results Switched from Intel to Broadcom NICs IOPS Latency 10000 12 10 8000 8 Latency, ms 6000 IOPS 6 4000 4 2000 2 0 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 100 15000 8 12500 6 10000 Latency, ms IOPS 7500 4 5000 2 2500 0 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 10057 © 2009/2010 Pythian
  58. 58. ORION Example 3: RAID558 © 2009/2010 Pythian
  59. 59. ORION Example 3: RAID1059 © 2009/2010 Pythian
  60. 60. Presenting measurements: Visualization is the Key60 © 2009/2010 Pythian
  61. 61. Q&A Email me - gorbachev@pythian.com Read my blog - http://www.pythian.com Follow me on Twitter - @AlexGorbachev Join Pythian fan club on Facebook & LinkedIn61 © 2009/2010 Pythian

×