NFS Tuning for Oracle                         Kyle Haileyhttp://dboptimizer.com    April 2013
Intro Who am I    why NFS interests me DAS / NAS / SAN    Throughput    Latency NFS configuration issues*    Networ...
Who is Kyle Hailey 1990 Oracle     90 support     92 Ported v6     93 France     95 Benchmarking     98 ST Real Worl...
Who is Kyle Hailey 1990 Oracle       90 support       92 Ported v6       93 France       95 Benchmarking       98 ST...
Who is Kyle Hailey 1990 Oracle       90 support       92 Ported v6       93 France       95 Benchmarking       98 ST...
Typical ArchitectureProduction      Development            QA           UAT  Instanc              Instanc       Instanc   ...
Database VirtualizationSource                      ClonesProduction                 Development      QA         UAT  Insta...
Which to use?
DAS is out of the picture
Fibre Channel
NFS - available everywhere
NFS is attractive but is it fast enough?
DAS vs NAS vs SAN     attach   Agile Cheap   maintenanc spee                            e          dDA   SCSI     no    ye...
speedEthernet• 100Mb 1994• 1GbE - 1998• 10GbE – 2003• 40GbE – 2012• 100GE –est. 2013Fibre Channel• 1G 1998• 2G 2003• 4G – ...
Ethernet vs Fibre Channel
Throughput vs Latency
Throughput8 bits = 1 Byte 1GbE        ~= 100 MB/sec    30-60MB/sec typical, single threaded, mtu 1500    90-115MB clean...
Throughput : netio   Server                                  Client                                                       ...
Routers: traceroute$ traceroute 172.16.100.1441 172.16.100.144 (172.16.100.144) 0.226 ms 0.171 ms 0.123 ms$ traceroute 101...
Wire Speed – where is the hold up?                   ms      us   ns           0.000 000 000                              ...
4G FC vs 10GbEWhy would FC be faster?8K block transfer times       8GB FC            = 10us       10G Ethernet = 8us
More stack morelatency                    Oracle                     NFS                     TCP                      NIC ...
Oracle and SUN benchmark200us overhead more for NFS • 350us without Jumbo frames • 1ms because of network topology        ...
8K block NFS latency overhead   8k wire transfer latency    1 GbE -> 80 us    10 GbE -> 8 us   Upgrading 1GbE to 10GbE  ...
NFS why the bad reputation? Given 2 % overhead why the reputation? Historically slower Bad Infrastructure  1. Network t...
Performance issues1. Network   a) Topology   b) NICs and mismatches   c) Load2. TCP   a) MTU “Jumbo Frames”   b) window   ...
1. a) Network Topology Hubs Routers Switches
HUBs - no HUBs allowed  Layer Name            Level       Device  7      Application  6      Presentation  5      Session ...
Multiple Switches – generally fast Two types of Switches    Store and Forward        1GbE 50-70us        10GbE 5-35us ...
Routers - as few as possible Routers can add 300-500us latency If NFS latency is 350us (typical non-tuned) system Then ...
Every router adds~ 0.5 ms6 routers = 6 * .5 =                        103ms                         8     Network Overhead ...
1. b) NICs and Hardware mismatchNICs    10GbE    1GbE      Intel (problems?)      Broadcomm (ok?)Hardware Mismatch: Sp...
Hardware mismatch  Full Duplex - two way simultaneous GOOD  Half Duplex - one way but both lines BAD
1. c) Busy Network Traffic can congest network    Caused drop packets and retransmissions    Out of order packets    C...
Busy Network Monitoring Visibility difficult from any one machine    Client    Server    Switch(es)$ netstat -s -P tcp...
Busy Network Testing Netio is available here:        http://www.ars.de/ars/ars.nsf/docs/netioOn Server box netio –s –tOn T...
2. TCP Configurationa) MTU (Jumbo Frames)b) TCP windowc) TCP congestion window
2. a) MTU 9000 : Jumbo Frames MTU – maximum Transfer Unit   Typically 1500   Can be set 9000   All components have to ...
Jumbo Frames : MTU 90008K block transfer  Test 1                       Test 2Default MTU 1500               Now with MTU 9...
2. b) TCP Socket Buffer Set max    Socket Buffer    TCP Window If maximum is reached, packets are dropped.            ...
TCP window sizes max data send/receive before acknowledgement   Subset of the TCP socket sizesCalculating    = latency *...
2. c) Congestion windowdelta      bytes        bytes     unack     cong          sendus         sent         recvd     byt...
3. NFS mount options      a) Forcedirectio – be careful      b) rsize , wsize - increase      c) Actimeo=0, noac – don’t u...
3 a) Forcedirectio Skip UNIX cache read directly into SGA Controlled by init.ora    Filesystemio_options=SETALL (or di...
Direct I/OExample query   77951 physical reads , 2nd execution   (ie when data should already be cached) 5 secs => no dir...
Oracle data block cache (SGA)   Oracle data block cache (SGA)   Unix file system Cache          Unix file system Cache    ...
Direct I/O Advantages    Faster reads from disk    Reduce CPU    Reduce memory contention    Faster access to data al...
Oracle data block         Oracle data block    cache (SGA)               cache (SGA)                                      ...
SGA Buffer               UNIX Cache Hits                                    Cache                                        H...
3. b)ACTIMEO=0 , NOAC   Disable client side file attribute cache   Increases NFS calls   increases latency   Avoid on ...
3. c)rsize/wsize NFS transfer buffer size Oracle says use 32K Platforms support higher values and can significantly  im...
Max rsize , wsize 1M    Sun Solaris Sparc    HP/UX    Linux 64K    AIX    Patch to increase to 512         http://...
NFS Overhead Physical vs Cached IOStorage(SAN)            9            8 Cache      7            6            5           ...
NFS Overhead Physical vs Cached IO 100 us + 6.0 ms 100 us + 0.1 ms -> 2x slower SAN cache is expensive -> use it for write...
Advanced LINUX    Rpc queues    Iostat.py Solaris    Client max read size 32k , boost it 1M    NFS server default th...
Issues: LINUX rpc queue  On LINUX, in /etc/sysctl.conf modify      sunrpc.tcp_slot_table_entries = 128  then do      sysct...
Linux tools: iostat.py$ ./iostat.py -1172.16.100.143:/vdb17 mounted on /mnt/external: op/s rpc bklog 4.20    0.00read:   o...
Solaris Client max read size nfs3_bsize    Defaults to 32K    Set it to 1M      mdb -kw      > nfs3_bsize/D        nfs3...
Issues: Solaris NFS Server threadssharectl get -p servers nfssharectl set -p servers=512 nfssvcadm refresh nfs/server
TcpdumpAnalysis     LINUX     Solaris          ms   msOracl                     Oraclee      58       47                  ...
Oraclesnoop / tcpdump    TCP                  Network        snoop                   TCP                   NFS
Wireshark : analyze TCP dumps  yum install wireshark  wireshark + perl     find common NFS requests         NFS client...
tcpdump -s 512 host server_machine -w client.cap                                                   Oraclesnoop / tcpdump  ...
parsetcp.pl nfs_server.cap nfs_client.cap  Parsing nfs server trace: nfs_server.cap  type    avg ms count    READ : 44.60,...
parsetcp.pl server.cap client.cap   Parsing NFS server trace: server.cap   type    avg ms count     READ : 1.17, 9042   Pa...
Linux   Solaris   tool     source                                               “db file sequential                       ...
netperf.shmss: 1448 local_recv_size (beg,end): 128000 128000 local_send_size (beg,end):  49152   49152remote_recv_size (be...
Summary Test    Traceroute : routers and latency    Netio : throughput NFS close to FC , if Network topology clean   ...
Conclusion: Give NFS some love                       NFS         gigabit switch can be anywhere from         10 to 50 time...
dtraceList the names of traceable probes:dtrace –ln provider:module:function:name• -l = list instead of enable probes     ...
http://cvs.opensolaris.org/source/
Dtracetcp:::send, tcp:::receive{   delta= timestamp-walltime;    walltime=timestamp;    printf("%6d %6d %6d %8d  %8s %8d %...
DelayedAcknowledgement
Solaris   ndd -set /dev/tcp       tcp_max_buf 8388608   ndd -set /dev/tcp       tcp_recv_hiwat 4194304   ndd -set /dev/...
Direct I/O ChallengesDatabase Cache usage over 24 hours           DB1                DB2    DB3           Europe          ...
Collaborate nfs kyle_final
Collaborate nfs kyle_final
Collaborate nfs kyle_final
Upcoming SlideShare
Loading in …5
×

Collaborate nfs kyle_final

1,905 views
1,759 views

Published on

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,905
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
42
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • asdf
  • http://www.strypertech.com/index.php?option=com_content&view=article&id=101&Itemid=172
  • http://www.srbconsulting.com/fibre-channel-technology.htm
  • http://www.faqs.org/photo-dict/phrase/1981/ethernet.html
  • http://gizmodo.com/5190874/ethernet-cable-fashion-show-looks-like-a-data-center-disaster
  • http://www.zdnetasia.com/an-introduction-to-enterprise-data-storage-62010137.htmA 16-port gigabit switch can be anywhere from 10 to 50 times cheaper than an FC switch and is far more familiar to a network engineer. Another benefit to iSCSI is that because it uses TCP/IP, it can be routed over different subnets, which means it can be used over a wide area network for data mirroring and disaster recovery.
  • http://www.freewebs.com/skater4life38291/Lamborghini-I-Love-Speed.jpg
  • 4Gbs /8b = 500MB/s = 500 KB/ms, 50 KB/100us, 10K/20us8K datablock, say 10K, = 20us10Gs /8b = 1250MB/s = 1250 KB/ms , 125 KB/100us 10K/8us8K block, say 10K = 8ushttp://www.unifiedcomputingblog.com/?p=1081, 2, 4, and 8 GbFibre Channel all use 8b/10b encoding.   Meaning, 8 bits of data gets encoded into 10 bits of transmitted information – the two bits are used for data integrity.   Well, if the link is 8Gb, how much do we actually get to use for data – given that 2 out of every 10 bits aren’t “user” data?   FC link speeds are somewhat of an anomaly, given that they’re actually faster than the stated link speed would suggest.   Original 1Gb FC is actually 1.0625Gb/s, and each generation has kept this standard and multiplied it.  8Gb FC would be 8×1.0625, or actual bandwidth of 8.5Gb/s.   8.5*.80 = 6.8.   6.8Gb of usable bandwidth on an 8Gb FC link.So 8Gb FC = 6.8Gb usable, while 10Gb Ethernet = 9.7Gb usable. FC=850MB/s, 10gE= 1212MB/sWith FCoE, you’re adding about 4.3% overhead over a typical FC frame. A full FC frame is 2148 bytes, a maximum sized FCoE frame is 2240 bytes. So in other words, the overhead hit you take for FCoE is significantly less than the encoding mechanism used in 1/2/4/8G FC.FCoE/Ethernet headers, which takes a 2148 byte full FC frame and encapsulates it into a 2188 byte Ethernet frame – a little less than 2% overhead. So if we take 10Gb/s, knock down for the 64/66b encoding (10*.9696 = 9.69Gb/s), then take off another 2% for Ethernet headers (9.69 * .98), we’re at 9.49 Gb/s usable. 
  • http://virtualgeek.typepad.com/virtual_geek/2009/06/why-fcoe-why-not-just-nas-and-iscsi.htmlNFS is great and iSCSI are great, but there’s no getting away from the fact that they depend on TCP retransmission mechanics (and in the case of NFS, potentially even higher in the protocol stack if you use it over UDP - though this is not supported in VMware environments today). because of the intrinsic model of the protocol stack, the higher you go, the longer the latencies in various operations. One example (and it’s only one) - this means always seconds, and normally many tens of seconds for state/loss of connection (assuming that the target fails over instantly, which is not the case of most NAS devices). Doing it in shorter timeframes would be BAD, as in this case the target is an IP, and for an IP address to be non-reachable for seconds - is NORMAL.
  • http://www.cisco.com/en/US/tech/tk389/tk689/technologies_tech_note09186a00800a7af3.shtml
  • http://www.flickr.com/photos/annann/3886344843/in/set-72157621298804450
  • Collaborate nfs kyle_final

    1. 1. NFS Tuning for Oracle Kyle Haileyhttp://dboptimizer.com April 2013
    2. 2. Intro Who am I  why NFS interests me DAS / NAS / SAN  Throughput  Latency NFS configuration issues*  Network topology  TCP config  NFS Mount Options* for non-RAC, non-dNFS
    3. 3. Who is Kyle Hailey 1990 Oracle  90 support  92 Ported v6  93 France  95 Benchmarking  98 ST Real World Performance 2000 Dot.Com 2001 Quest 2002 Oracle OEM 10g Success! First successful OEM design
    4. 4. Who is Kyle Hailey 1990 Oracle  90 support  92 Ported v6  93 France  95 Benchmarking  98 ST Real World Performance 2000 Dot.Com 2001 Quest 2002 Oracle OEM 10g 2005 Embarcadero  DB Optimizer
    5. 5. Who is Kyle Hailey 1990 Oracle  90 support  92 Ported v6  93 France  95 Benchmarking  98 ST Real World Performance 2000 Dot.Com 2001 Quest 2002 Oracle OEM 10g 2005 Embarcadero  DB Optimizer DelphixWhen not being a Geek - Have a little 4 year old boy who takes up all my time
    6. 6. Typical ArchitectureProduction Development QA UAT Instanc Instanc Instanc Instanc e e e e Database Database Database Database File system File system File system File system
    7. 7. Database VirtualizationSource ClonesProduction Development QA UAT Instance Instance Instance Instance NFS vDatabas vDatabas vDatabas Database e e e File system Fiber Channel
    8. 8. Which to use?
    9. 9. DAS is out of the picture
    10. 10. Fibre Channel
    11. 11. NFS - available everywhere
    12. 12. NFS is attractive but is it fast enough?
    13. 13. DAS vs NAS vs SAN attach Agile Cheap maintenanc spee e dDA SCSI no yes difficult fastSNA NFS - yes yes easy ??S EthernetSA Fibre yes no difficult fastN Channel
    14. 14. speedEthernet• 100Mb 1994• 1GbE - 1998• 10GbE – 2003• 40GbE – 2012• 100GE –est. 2013Fibre Channel• 1G 1998• 2G 2003• 4G – 2005• 8G – 2008• 16G – 2011
    15. 15. Ethernet vs Fibre Channel
    16. 16. Throughput vs Latency
    17. 17. Throughput8 bits = 1 Byte 1GbE ~= 100 MB/sec  30-60MB/sec typical, single threaded, mtu 1500  90-115MB clean topology  125MB/sec max 10GbE ~= 1000 MB/sec Throughput ~= size of pipe
    18. 18. Throughput : netio Server Client Throughput ~= size of pipe Server netio –s -t Client netio –t server Packet size 1k bytes: 51.30 MByte/s Tx, 6.17 MByte/s Rx. Packet size 2k bytes: 100.10 MByte/s Tx, 12.29 MByte/s Rx. Packet size 4k bytes: 96.48 MByte/s Tx, 18.75 MByte/s Rx. Packet size 8k bytes: 114.38 MByte/s Tx, 30.41 MByte/s Rx. Packet size 16k bytes: 112.20 MByte/s Tx, 19.46 MByte/s Rx. Packet size 32k bytes: 114.53 MByte/s Tx, 35.11 MByte/s Rx.
    19. 19. Routers: traceroute$ traceroute 172.16.100.1441 172.16.100.144 (172.16.100.144) 0.226 ms 0.171 ms 0.123 ms$ traceroute 101.88.123.1951 101.88.229.181 (101.88.229.181) 0.761 ms 0.579 ms 0.493 ms2 101.88.255.169 (101.88.255.169) 0.310 ms 0.286 ms 0.279 ms3 101.88.218.166 (101.88.218.166) 0.347 ms 0.300 ms 0.986 ms4 101.88.123.195 (101.88.123.195) 1.704 ms 1.972 ms 1.263 ms Tries each step 3 times Last line should be the total latency latency ~= Length of pipe
    20. 20. Wire Speed – where is the hold up? ms us ns 0.000 000 000 Wire 5ns/mLight = 0.3m/ns RAMWire ~= 0.2m/ns Physical 8K random disk read ~= 7Wire ~= 5us/km ms Physical small sequential write ~= 1Data Center 10m = 50ns msLA to London = 30msLA to SF = 3ms
    21. 21. 4G FC vs 10GbEWhy would FC be faster?8K block transfer times  8GB FC = 10us  10G Ethernet = 8us
    22. 22. More stack morelatency Oracle NFS TCP NIC Network NFS Not FC NIC TCP NFS Filesystem Cache/spindl e
    23. 23. Oracle and SUN benchmark200us overhead more for NFS • 350us without Jumbo frames • 1ms because of network topology 8K blocks 1GbE with Jumbo Frames , Solaris 9, Oracle 9.2 Database Performance with NAS: Optimizing Oracle on NFS Revised May 2009 | TR-3322 http://media.netapp.com/documents/tr-3322.pdf
    24. 24. 8K block NFS latency overhead 8k wire transfer latency  1 GbE -> 80 us  10 GbE -> 8 us Upgrading 1GbE to 10GbE  80 us – 8 us = 72 us faster 8K transfer 8K NFS on 1GbE was 200us, so on 10GbE should be  200 -72 = 128us (0.128 ms/7 ms) * 100 = 2% latency increase over DAS
    25. 25. NFS why the bad reputation? Given 2 % overhead why the reputation? Historically slower Bad Infrastructure 1. Network topology and load 2. NFS mount options 3. TCP configuration Compounding issues  Oracle configuration  I/O subsystem response
    26. 26. Performance issues1. Network a) Topology b) NICs and mismatches c) Load2. TCP a) MTU “Jumbo Frames” b) window c) congestion window3. NFS mount options
    27. 27. 1. a) Network Topology Hubs Routers Switches
    28. 28. HUBs - no HUBs allowed Layer Name Level Device 7 Application 6 Presentation 5 Session 4 Transport 3 Network IP addr Routers 2 Datalink mac addr Switches 1 Physical Wire Hubs • Broadcast, repeaters • Risk collisions • Bandwidth contention
    29. 29. Multiple Switches – generally fast Two types of Switches  Store and Forward  1GbE 50-70us  10GbE 5-35us  Cut through  10GbE .3-.5usLaye Namer3 Network Routers IP addr2 Datalink Switches mac addr1 Physical Hubs Wire
    30. 30. Routers - as few as possible Routers can add 300-500us latency If NFS latency is 350us (typical non-tuned) system Then each router multiplies latency 2x, 3x, 4x etc Layer Name 3 Network IP addr Routers 2 Datalink mac Switches addr 1 Physical Wire Hubs
    31. 31. Every router adds~ 0.5 ms6 routers = 6 * .5 = 103ms 8 Network Overhead 6 NFS 6.0 ms avg I/O 4 physical 2 0 Fast 0.2ms Slow 3ms
    32. 32. 1. b) NICs and Hardware mismatchNICs  10GbE  1GbE  Intel (problems?)  Broadcomm (ok?)Hardware Mismatch: Speeds and duplex are oftennegotiated $ ethtool eth0 Settings for eth0:Example Linux: Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: HalfCheck that values are as expected
    33. 33. Hardware mismatch Full Duplex - two way simultaneous GOOD Half Duplex - one way but both lines BAD
    34. 34. 1. c) Busy Network Traffic can congest network  Caused drop packets and retransmissions  Out of order packets  Collisions on hubs, probably not with switches Analysis  netstat  netio
    35. 35. Busy Network Monitoring Visibility difficult from any one machine  Client  Server  Switch(es)$ netstat -s -P tcp 1TCP tcpRtoAlgorithm = 4 tcpRtoMin = 400 tcpRetransSegs = 5986 tcpRetransBytes = 8268005 tcpOutAck =49277329 tcpOutAckDelayed = 473798 tcpInDupAck = 357980 tcpInAckUnsent = 0 tcpInUnorderSegs =10048089 tcpInUnorderBytes =16611525 tcpInDupSegs = 62673 tcpInDupBytes =87945913 tcpInPartDupSegs = 15 tcpInPartDupBytes = 724 tcpRttUpdate = 4857114 tcpTimRetrans = 1191 tcpTimRetransDrop = 6 tcpTimKeepalive = 248
    36. 36. Busy Network Testing Netio is available here: http://www.ars.de/ars/ars.nsf/docs/netioOn Server box netio –s –tOn Target box: netio -t server_machine NETIO - Network Throughput Benchmark, Version 1.31 (C) 1997-2010 Kai Uwe Rommel TCP server listening. TCP connection established ... Receiving from client, packet size 32k ... 104.37 MByte/s Sending to client, packet size 32k ... 109.27 MByte/s Done.
    37. 37. 2. TCP Configurationa) MTU (Jumbo Frames)b) TCP windowc) TCP congestion window
    38. 38. 2. a) MTU 9000 : Jumbo Frames MTU – maximum Transfer Unit  Typically 1500  Can be set 9000  All components have to support  Else error and/or hangs
    39. 39. Jumbo Frames : MTU 90008K block transfer Test 1 Test 2Default MTU 1500 Now with MTU 900 delta send recd delta send recd <-- 164 <-- 164 152 132 --> 273 8324 --> 40 1448 --> 67 1448 --> Change MTU 66 1448 --> # ifconfig eth1 mtu 9000 up 53 1448 --> 87 1448 --> Warning: MTU 9000 can hang 95 952 --> if any of the hardware in the = 560 connection is configured only for MTU 1500
    40. 40. 2. b) TCP Socket Buffer Set max  Socket Buffer  TCP Window If maximum is reached, packets are dropped. LINUX   Socket buffer sizes  sysctl -w net.core.wmem_max=8388608  sysctl -w net.core.rmem_max=8388608  TCP Window sizes  sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"  sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608" Excellent book
    41. 41. TCP window sizes max data send/receive before acknowledgement  Subset of the TCP socket sizesCalculating = latency * throughputEx, 1ms latency , 1Gb NIC = 1Gb/sec * 0.001s = 100Mb/sec * 1Byte/8bits= 125KB
    42. 42. 2. c) Congestion windowdelta bytes bytes unack cong sendus sent recvd bytes window window 31 1448 139760 144800 195200 33 1448 139760 144800 195200 29 1448 144104 146248 195200 31 / 0 145552 144800 195200 41 1448 145552 < 147696 195200 30 / 0 147000 > 144800 195200 22 1448 147000 76744 195200Data collected with congestion window size Dtrace is drastically loweredBut could get similar data from snoop (load data intowireshark)
    43. 43. 3. NFS mount options a) Forcedirectio – be careful b) rsize , wsize - increase c) Actimeo=0, noac – don’t use (unless on RAC)Sun Solaris rw,bg,hard,rsize=32768,wsize=32768,vers=3,[forcedirectio or llock],nointr,proto=tcp,suidAIX rw,bg,hard,rsize=32768,wsize=32768,vers=3,cio,intr,timeo=600,proto=tcpHPUX rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,proto=tcp, suid, forcedirectioLinux rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,tcp,actimeo=0
    44. 44. 3 a) Forcedirectio Skip UNIX cache read directly into SGA Controlled by init.ora  Filesystemio_options=SETALL (or directio)  Except HPUX Sun Forcedirectio – sets forces directio but not required Solaris Fielsystemio_options will set directio without mount option AIX HPUX Forcedirectio – only way to set directio Filesystemio_options has no affect Linux
    45. 45. Direct I/OExample query 77951 physical reads , 2nd execution (ie when data should already be cached) 5 secs => no direct I/O 60 secs => direct I/O Why use direct I/O?
    46. 46. Oracle data block cache (SGA) Oracle data block cache (SGA) Unix file system Cache Unix file system Cache By default get both Direct I/O can’t use Unix cache
    47. 47. Direct I/O Advantages  Faster reads from disk  Reduce CPU  Reduce memory contention  Faster access to data already in memory, in SGA Disadvantages  Less Flexible  Risk of paging , memory pressure  Can’t share memory between multiple databases http://blogs.oracle.com/glennf/entry/where_do_you_cache_oracle
    48. 48. Oracle data block Oracle data block cache (SGA) cache (SGA) Oracle data block cache (SGA)Unix file system Cache Unix file system Cache Unix file system Cache5 seconds 60 seconds 2 seconds
    49. 49. SGA Buffer UNIX Cache Hits Cache Hits Miss= UNIX Cache < 0.2 ms File System Cache Hits Miss SAN Cache Hits= NAS/SAN < 0.5 ms SAN/NA S Storage Cache Disk Reads Hits Miss Disk Reads= Disk Reads ~ 6ms Hits
    50. 50. 3. b)ACTIMEO=0 , NOAC Disable client side file attribute cache Increases NFS calls increases latency Avoid on single instance Oracle Metalink says it’s required on LINUX (calls it “actime”) Another metalink it should be taken off=> It should be taken off
    51. 51. 3. c)rsize/wsize NFS transfer buffer size Oracle says use 32K Platforms support higher values and can significantly impact throughput Sun rsize=32768,wsize=32768 , max is 1M Solaris AIX rsize=32768,wsize=32768 , max is 64K HPUX rsize=32768,wsize=32768 , max is 1M Linux rsize=32768,wsize=32768 , max is 1M On full table scans using 1M has halved the response time over 32K Db_file_multiblock_read_count has to large enough take advantage of the size
    52. 52. Max rsize , wsize 1M  Sun Solaris Sparc  HP/UX  Linux 64K  AIX  Patch to increase to 512  http://www-01.ibm.com/support/docview.wss?uid=isg1IV24594  http://www-01.ibm.com/support/docview.wss?uid=isg1IV24688  http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=6100-08-00-1241  http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=7100-02-00-1241
    53. 53. NFS Overhead Physical vs Cached IOStorage(SAN) 9 8 Cache 7 6 5 4 NFS + network reads 3 Physic 2 al 1 0 phys vs NFS phys multi phys routers cached vs nfs cached vs cached vs switch switches routers
    54. 54. NFS Overhead Physical vs Cached IO 100 us + 6.0 ms 100 us + 0.1 ms -> 2x slower SAN cache is expensive -> use it for write cache Target cache is cheaper –> add more Takeaway: Move cache to client boxes • cheaper • Quicker save SAN cache for writeback
    55. 55. Advanced LINUX  Rpc queues  Iostat.py Solaris  Client max read size 32k , boost it 1M  NFS server default threads 16 Tcpdump
    56. 56. Issues: LINUX rpc queue On LINUX, in /etc/sysctl.conf modify sunrpc.tcp_slot_table_entries = 128 then do sysctl -p then check the setting using sysctl -A | grep sunrpc NFS partitions will have to be unmounted and remounted Not persistent across reboot
    57. 57. Linux tools: iostat.py$ ./iostat.py -1172.16.100.143:/vdb17 mounted on /mnt/external: op/s rpc bklog 4.20 0.00read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000
    58. 58. Solaris Client max read size nfs3_bsize  Defaults to 32K  Set it to 1M mdb -kw > nfs3_bsize/D nfs3_bsize: 32768 > nfs3_bsize/W 100000 nfs3_bsize: 0xd = 0x1000 00 >
    59. 59. Issues: Solaris NFS Server threadssharectl get -p servers nfssharectl set -p servers=512 nfssvcadm refresh nfs/server
    60. 60. TcpdumpAnalysis LINUX Solaris ms msOracl Oraclee 58 47 NFSNFS ? ? TCP/TCPNetwork ? ? NetworkTCP/NF ? ?S TCPNFS NFS .1 2server Cache/SAN
    61. 61. Oraclesnoop / tcpdump TCP Network snoop TCP NFS
    62. 62. Wireshark : analyze TCP dumps  yum install wireshark  wireshark + perl  find common NFS requests  NFS client  NFS server  display times for  NFS Client  NFS Server  Deltahttps://github.com/khailey/tcpdump
    63. 63. tcpdump -s 512 host server_machine -w client.cap Oraclesnoop / tcpdump TCP Network snoop TCP NFS snoop -q -d aggr0 -s 512 –o server.cap client_machine
    64. 64. parsetcp.pl nfs_server.cap nfs_client.cap Parsing nfs server trace: nfs_server.cap type avg ms count READ : 44.60, 7731 Parsing client trace: nfs_client.cap type avg ms count READ : 46.54, 15282 ==================== MATCHED DATA ============ READ type avg ms server : 48.39, client : 49.42, diff : 1.03, Processed 9647 packets (Matched: 5624 Missed: 4023)
    65. 65. parsetcp.pl server.cap client.cap Parsing NFS server trace: server.cap type avg ms count READ : 1.17, 9042 Parsing client trace: client.cap type avg ms count READ : 1.49, 21984 ==================== MATCHED DATA ============ READ type avg ms count server : 1.03 client : 1.49 diff : 0.46
    66. 66. Linux Solaris tool source “db file sequential read” wait (which is Oracle 58 ms 47 ms oramon.sh basically a timing of “pread” for 8k randomOracle reads specifically NFS tcpdump on NFS tcpparse.s TCP 1.5 45 ms LINUX snoop Client h on Solaris network 0.5 1 ms DeltaNetwork tcpparse.s TCP Server 1 ms 44 ms h snoop NFS dtrace nfs:::op-read- .1 ms 2 ms DTrace start/op-read-done NFS Server
    67. 67. netperf.shmss: 1448 local_recv_size (beg,end): 128000 128000 local_send_size (beg,end): 49152 49152remote_recv_size (beg,end): 87380 3920256remote_send_size (beg,end): 16384 16384mn_ms av_ms max_ms s_KB r_KB r_MB/s s_MB/s <100u <500u <1ms <5ms <10ms <5 .08 .12 10.91 15.69 83.92 .33 .38 .01 .01 .12 .54 .10 .16 12.25 8 48.78 99.10 .30 .82 .07 .08 .15 .57 .10 .14 5.01 8 54.78 99.04 .88 .96 .15 .60 .22 .34 63.71 128 367.11 97.50 1.57 2.42 .06 .07 .01 .35 .93 .43 .60 16.48 128 207.71 84.86 11.75 15.04 .05 .10 .90 1.42 .99 1.30 412.42 1024 767.03 .05 99.90 .03 .08 .03 1.30 2.251.77 2.28 15.43 1024 439.20 99.27 .64 .73 2.65 5.35
    68. 68. Summary Test  Traceroute : routers and latency  Netio : throughput NFS close to FC , if Network topology clean  Mount  Rsize/wsize at maximum,  Avoid actimeo=0 and noac  Use noactime  Jumbo Frames Drawbacks  Requires clean topology  NFS failover can take 10s of seconds  Oracle 11g dNFS client mount issues handled transparently
    69. 69. Conclusion: Give NFS some love NFS gigabit switch can be anywhere from 10 to 50 times cheaper than an FC switch http://github.com/khailey
    70. 70. dtraceList the names of traceable probes:dtrace –ln provider:module:function:name• -l = list instead of enable probes Example• -n = Specify probe name to trace or list dtrace –ln tcp:::send• -v = Set verbose mode $ dtrace -lvn tcp:::receive 5473 tcp ip tcp_output send Argument Types args[0]: pktinfo_t * args[1]: csinfo_t * args[2]: ipinfo_t * args[3]: tcpsinfo_t * args[4]: tcpinfo_t *
    71. 71. http://cvs.opensolaris.org/source/
    72. 72. Dtracetcp:::send, tcp:::receive{ delta= timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8d %8s %8d %8d %8d %8d %d n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_rwnd, args[3]->tcps_cwnd, args[3]->tcps_retransmit );}tcp:::receive{ delta=timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %d n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_rwnd, args[3]->tcps_cwnd, args[3]->tcps_retransmit );}
    73. 73. DelayedAcknowledgement
    74. 74. Solaris ndd -set /dev/tcp tcp_max_buf 8388608 ndd -set /dev/tcp tcp_recv_hiwat 4194304 ndd -set /dev/tcp tcp_xmit_hiwat 4194304 ndd -set /dev/tcp tcp_cwnd_max 8388608 mdb -kw > nfs3_bsize/D nfs3_bsize: 32768 > nfs3_bsize/W 100000 nfs3_bsize: 0xd = 0x100000 > add it to /etc/system for use on reboot set nfs:nfs3_bsize= 0x100000
    75. 75. Direct I/O ChallengesDatabase Cache usage over 24 hours DB1 DB2 DB3 Europe US Asia

    ×