Getput suite

A Swift Benchmarking Tool

Mark Seger
Hewlett Packard
Cloud Services

4/19/2013 Getput Swift Performance Tools
1

Problem Statement
• Performance Measurements
– Consistent/standard mechanisms for controlled experiments
– Ability to easily modify test parameters
– Minimal installation, configuration and use
– Easy to compare results of multiple runs
– Easy to clean up when done
• Benchmarking – run performance tests at scale
– Repeat tests while increasing demand for resources
– Parallel tests must be coordinated: start/finish together

4/19/2013 Getput Swift Performance Tools 2

Getput Suite
• Multiple tools organized in a hierarchy
– getput: actual workhorse, runs tests on single client
– gpmaster: coordinates running getput on multiple clients
– gpsuite: defines suites of tests to minimize switches usage
– yourscript: can call gpsuite multiple times when desired


getput.py
• Uses swiftclient library
• Lots of switches, lots of different behaviors
– Standalone
• Basic: creds, cname, oname, size, num/runtime, tests, rep count
• More: processes, container type: shared/byproc/bynode, latency
details, operation logging, and still more
– Multi-node (controlled by gpmaster)
• start time, rank


gpmaster.py
• Coordinates running of getput on multiple clients
– Assures all start together and finish approx together
– Summarizes results as a single line
– Unlike getput only runs 1 test at a time, job for gpsuite
• More required switches than getput
– Credentials file
– Rank
– Start time
– Hosts file or single client name, may need ssh key too
– And a few more…
• But rarely run by itself!

gpsuite.py
• Removes complexity of running gpmaster
• Think of macros: gpsuite –suite full
– Sets of object sizes, eg: 1k, 10k, 100k, etc
– Numbers of threads, eg: 1, 2, 4, 8, etc
• Distributes threads across multiple clients
• Some runs can take hours with a single command
• Cleans up after each run


Getput Output
Earliest versions
Inst Start End Seconds Tests Num MB/S IOPS Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0


Getput Output
Earliest versions
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0

Added latency range in later versions
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0


Getput Output
Earliest versions
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0

0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0

Added CPU and started playing with compression in more recent versions
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs Procs OSize %CPU Comp
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0 1 10k 0.30 no
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0 1 10k 0.39 no
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0 1 10k 0.58 no


Getput Output
Earliest versions
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0

0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0

Added CPU and started playing with compression in more recent versions
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs Procs OSize %CPU Comp
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0 1 10k 0.30 no
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0 1 10k 0.39 no
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0 1 10k 0.58 no

Eventually added latency distribution histogram
Latency LatRange Errs Procs OSize 0.0 0.1 0.2 0.3 0.4 0.5
0.106 0.02-00.36 0 10 10k 527 396 67 10 0 0
0.041 0.01-00.07 0 10 10k 1000 0 0 0 0 0
0.031 0.01-00.16 0 10 10k 964 36 0 0 0 0


Observations
• Swift multi-scaling excellent
– With multiple clients performance grows close to linearly
– With single client and multiple threads
• Smaller objects scale very well with even lots of threads
• Larger objects hit either CPU/Network wall!
• Both compression and encryption cost CPU
– Limits large object bandwidth, less so with smaller ones
– Early testing: !compression up to 2X boost for large objects
• Similar behavior when using http instead of https
– Only just started looking at changing ciphers

Recommendation: make compression, ssl and cipher choice optional in swiftclient


Look at the network during tests
This is always true for uncompressible objects: upload speed ~= network bandwidth
segerm@az1-nv-compute-0001:~$ ./getput.py -cc -oo -n1 -s100m -tp --comp
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange
0 15:52:15 15:52:20 5.85 put 1 17.10 0.17 5.800 5.80-05.80

segerm@az1-nv-compute-0001:~$ collectl
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
0 0 1342 1078 0 0 20 2 0 4 70 56
0 0 261 304 0 0 20 2 0 3 0 2
1 0 580 578 0 0 0 0 0 5 0 3
3 0 4697 780 0 0 0 0 135 2010 15956 11517
4 0 5859 1324 0 0 0 0 138 2345 19037 13708
4 0 5168 609 0 0 48 6 138 2354 19036 13706
4 0 5597 993 0 0 4 1 138 2351 19053 13717
4 0 5129 538 0 0 0 0 139 2366 19053 13716
3 0 4579 1070 0 0 0 0 107 1817 14554 10495
0 0 154 201 0 0 20 2 0 1 0 1


Compression can be your friend too
…but only for compressible objects
segerm@az1-nv-compute-0001:~$ ./getput.py -cc -oo -n5 -s100m -tp --otype s --comp
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange
0 16:00:19 16:00:29 10.33 put 5 48.42 0.48 2.060 2.03-02.09

0 0 223 292 0 0 56 9 0 1 0 1
1 0 618 565 0 0 0 0 14 20 2 16
3 0 1380 694 0 0 0 0 14 167 605 317
4 0 1846 1194 0 0 0 0 11 165 508 304
3 1 9799 1008 0 0 12 2 173 2949 848 2949
4 1 11071 996 0 0 0 0 198 3377 607 3376

Look what the proxy is doing
1 0 1512 523 0 0 16 3 5 36 8 34
8 2 6377 892 0 0 0 0 658 6588 171130 117279
7 2 5488 1835 0 0 8 1 519 4933 150290 103175
6 2 8772 6113 0 0 0 0 744 8679 162089 114059

3 Obj Servers
Another reason to make compression optional!

Let’s talk about latency
• Latency metrics originally based on averages
– Like coarse monitoring, great for trends but poor for exceptions
– Soon realized more detail was needed
• Consider the following. What does it really mean?
– Is the only problem that one entry of 0.083?


On closer inspection
• The first 4 entries don’t look too bad
• Even the bottom one isn’t that horrible


Ranges shed more light
• Even though first 4 lines have close latencies, look at
their max values
• Now we know why line 5 so bad
• Even line 6 has very high max


But even that’s not enough
• Min/Max doesn’t tell us how many outliers
• Line 2/4 have almost 50 in the .5 bucket
• Line 5 has 6 PUTs >4 seconds
• Line 6 all over the place


Example 1: Latency of 0.04 too high!
• When looking at 1k, 10k and 100k GETS, noticed IOPS for 10k
were lower!
– Great reason to look at more than MB/sec
• After much digging discovered this only applied to object sizes
7888 -> 22469 bytes
– This could only have been found by running sets of tests and looking
very closely at the numbers
• What’s going on here?


Example 1: Latency of 0.04 too high!
• When looking at 1k, 10k and 100k GETS, noticed IOPS for 10k
were lower!
– Great reason to look at more than MB/sec
• After much digging discovered this only applied to object sizes
7888 -> 22469 bytes
– This could only have been found by running sets of tests and looking
very closely at the numbers
• What’s going on here?
– We run pound on proxies to support multiple connection ports
– Proxy does fast get and passes data to pound over loopback address
– Max segsize for loopback >> network MSS
– Eventlet uses 8192 byte buffers
– Nagle algorithm: bytes > 8192 and ~<8192+MSS have delayed ACK
• Eventlet needs bigger buffers? Turn off nagle?

Example 2: Latency 0.5
• Observed a number of these in small object PUTs
• Caused by a proxy timeout connecting to obj server
• Might be worth looking into ways to reduce and/or
not try to re-contact a non-responsive server


Example 3: Latency 6 Secs
• These occur less frequently, but do happen
• Traced back to disk error on object server
• BUT the other 2 object servers responded in < 1sec
• Think about how many IOPS are being lost!

Might it be worth it to return after 2 successes?
Maybe at least ignore writes to that disk?


So what’s next for latency?
• Investigate why some ops have even longer latencies
• Added another switch to getput! --logops
– Extended put_object() to return transaction ID
– Writes detailed log records for every operation
– Makes it possible for longer latency transactions to be traced
segerm@az1-nv-compute-0000:~$ more /tmp/getput-p-0-1363878303.log
15:05:03.522 1363878303.521659 1363878303.459080 0.062547 eb4194b73e46f52f774a63fa552755d4 o-0-1-1
15:05:03.574 1363878303.574005 1363878303.521702 0.052291 eb4194b73e46f52f774a63fa552755d4 o-0-1-2
15:05:03.627 1363878303.627218 1363878303.574032 0.053174 eb4194b73e46f52f774a63fa552755d4 o-0-1-3
15:05:03.686 1363878303.686175 1363878303.627244 0.058918 eb4194b73e46f52f774a63fa552755d4 o-0-1-4
15:05:03.747 1363878303.746874 1363878303.686201 0.060661 eb4194b73e46f52f774a63fa552755d4 o-0-1-5
15:05:03.804 1363878303.804106 1363878303.746900 0.057194 eb4194b73e46f52f774a63fa552755d4 o-0-1-6
15:05:03.866 1363878303.866148 1363878303.804133 0.061979 eb4194b73e46f52f774a63fa552755d4 o-0-1-7
15:05:03.932 1363878303.931911 1363878303.866175 0.065724 eb4194b73e46f52f774a63fa552755d4 o-0-1-8

Recommendation: GET, PUT and DEL calls should return transaction IDs


swcmd: a nifty helper utility
• One challenge of benchmarking can be LOTs of
containers and objects needing cleanup
– Can have dozens to 100s containers
– Can have Ks to 100Ks of objects
– Swift client too slow for deletes!
• Swift client utility could use some more functionality
– How about displaying numbers of objects in containers?
– Container sizes and even dates?
– When listing containers same things
– What about parallel or even wild card listing/deletes?
• Only parallelizes for >1K objects in a container
• Uses multiprocessing can hit 300-400 deletes/sec


Examples
swcmd ls
63482 61M 2013-03-21 16:19:12 qc-1363882747
49 4G 2013-03-09 13:13:36 vlat-1362834811
0 0 2013-03-20 22:05:06 vlat-1363817101
1 10 2013-03-15 13:58:37 xxx-0-0
1 200M 2013-03-11 12:28:16 xyxxy
2 200M 2013-03-11 12:29:01 xyzzy
2901 702M 2013-02-12 16:34:19 zzz

swcmd –p ls xyz # list containers starting with xyz
swcmd –f rc zzz # force removal of zzz even though not empty
swcmd –p pf x # force removal of ALL containers starting with x
Swcmd rm xyzzy/xyzzy # remove specific object

Recommendation: add these types of features to the swift utility


Questions?


Getput suite

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getput suite

Similar to Getput suite (20)

More from Iben Rodriguez

More from Iben Rodriguez (8)

Recently uploaded

Recently uploaded (20)

Getput suite