Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305) | AWS re:Invent 2013

3,580 views

Published on

In this session, we explain how to measure the key performance-impacting metrics in a cloud-based application. With specific examples of good and bad tests, we make it clear how to get reliable measurements of CPU, memory, disk, and how to map benchmark results to your application. We also cover the importance of selecting tests wisely, repeating tests, and measuring variability.

Published in: Technology, Business

Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305) | AWS re:Invent 2013

  1. 1. Best Practices for Benchmarking and Performance Analysis in the Cloud Robert Barnes, Amazon Web Services November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Benchmarks: Measurement Demo 4 How many ways to measure? 6 3 At least 20… 4 3
  3. 3. Cloud Benchmarks: Prequel • • • • • The best benchmark Absolute vs. relative measures Fixed time or fixed work What’s different? Use a good AMI Average CPU result CentOS 5.4 ami-… CentOS 5.4 ami-… CentOS 5.4 ami-… AWS CentOS 5.4 ami-… Ubuntu 12.4 ami-… 0.00 5.00 10.0015.0020.0025.0030.00 Coefficient of Variance 60% 50% 40% 30% 20% 10% 0%
  4. 4. Scenario: CPU-based Instance Selection • • • • 1. 2. 3. 4. Application runs on premises Primary requirement is integer CPU performance Application is complex to set up, no benchmark tests exist, limited time What instance would work best? Choose a synthetic benchmark Baseline: Build, configure, tune, and run it on premises Run the same test (or tests) on a set of instance types Use results from the instance tests to choose the best match
  5. 5. Testing CPU • Choose a benchmark – geekbench, UnixBench, sysbench(cpu), and SPEC CPU2006 Integer • How do you know when you have a good result? • Tests run on 9 instance types – 10 instances of each of the 9 types launched – Tests run a minimum of 4 times on each instance – Ubuntu 13.04 base AMI
  6. 6. geekbench Overview • Workloads in 3 categories – 13 Integer tests – 10 Floating Point tests – 4 Memory tests • • • • Integer AES Twofish SHA1 SHA2 BZip2 compress BZip2 decompress JPEG compress JPEG decompress PNG compress Commercial product (64bit) No source code Runs single and multi-cpu Fast setup, fast runtime PNG decompress Sobel LUA Dijkstra Memory STREAM copy STREAM scale STREAM add STREAM triad Floating Point Black-Scholes Mandelbrot Sharpen image Blur image SGEMM DGEMM SFFT DFFT N-Body Ray trace
  7. 7. geekbench Script SEQNO=$1 GBTXT=gbtest.txt DL=+ ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`" TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`” OUTID=$ID$DL$TYPE$DL START=$(date +%s.%N) ./geekbench_x86_64 --no-upload >$GBTXT END=$(date +%s.%N) DIFF=$(echo "$END - $START" | bc) OUTNAME=$OUTID$SEQNO$DL$DIFF$DL$GBTXT mv $GBTXT $OUTNAME … grep “Geekbench Score” i-*$GBTXT >gbresults.txt cat gbresults.txt | sed s/:// | awk ‘/i-/ {print $1”;”$4”;”$5}’>gbresults.csv
  8. 8. geekbench Geekbench 1CPU ratio C.O.V. m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlarge c3.large c3.xlarge c3.2xlarge cc2.8xlarge NCPU ratio C.O.V. 0.93 0.93 0.80 0.80 0.76 1.13 1.13 1.13 1.00 2.04 3.80 1.54 2.82 5.11 1.32 2.51 4.88 15.46 1.04% 1.40% 2.84% 1.34% 2.28% 0.93% 0.39% 0.19% 0.71% 2.31% 1.46% 4.06% 1.21% 1.71% 0.71% 1.81% 0.25% 1.93% RT (min) 2.06 2.08 1.99 2.04 2.01 1.76 1.74 1.70 2.21
  9. 9. geekbench – Run Variance geekbench 1CPU ratio m3.xlarge instance-1 instance-2 instance-3 instance-4 instance-5 instance-6 instance-7 instance-8 instance-9 instance-10 0.93 0.97 0.94 0.94 0.94 0.94 0.93 0.93 0.94 0.94 C.O.V. 0.31% 0.23% 0.17% 0.10% 0.32% 0.10% 0.25% 0.38% 0.11% 0.09%
  10. 10. geekbench – Integer Portion gb-integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min) c3.large c3.xlarge c3.2xlarge cc2.8xlarge 1.12 1.13 1.12 1.00 0.50% 0.38% 0.38% 0.20% 1.37 2.72 5.35 17.88 0.43% 0.41% 0.51% 3.31% NA NA NA NA geekbench c3.large c3.xlarge c3.2xlarge cc2.8xlarge 1.13 1.13 1.13 1.00 0.93% 0.39% 0.19% 0.71% 1.32 2.51 4.88 15.46 0.71% 1.81% 0.25% 1.93% 1.76 1.74 1.70 2.21
  11. 11. UnixBench Overview • Default: the BYTE Index – 12 workloads, run 2 times (roughly 29 minutes each time) • • • • Integer computation Floating point computation System calls File system calls – Geomean Of results to a baseline produces a system benchmarks index score • Open source – must be built – Must be patched for > 16 CPUs 11
  12. 12. UnixBench Script SEQNO=$1 UBTXT=ubtest.txt DL=+ ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`" TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`" FN=$ID$DL$TYPE$DL$SEQNO$DL$UBTXT COPIES=`cat /proc/cpuinfo | grep processor | wc –l` ./Run –c 1 –c $COPIES >$FN … grep “System Benchmarks Index Score” i-*$UBTXT >ubresults.txt cat ubresults.txt | sed s/”.txt:System Benchmarks Index Score”// | awk ‘/i-/ {print $1”;”$2}’>ubresults.csv
  13. 13. UnixBench UnixBench 1CPU ratio m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlarge c3.large c3.xlarge c3.2xlarge cc2.8xlarge 1.38 1.42 0.40 0.42 0.48 1.10 1.06 1.10 1.00 C.O.V. 1.90% 1.85% 5.82% 1.71% 3.31% 1.33% 1.48% 0.54% 2.97% NCPU ratio 2.49 4.21 0.76 1.23 2.02 1.91 2.85 4.50 6.44 C.O.V. 1.36% 1.99% 1.28% 1.75% 1.71% 1.54% 1.26% 1.02% 2.65% RT (min) 28.25 28.29 28.30 28.32 28.34 28.17 28.21 28.96 30.20
  14. 14. UnixBench – Dhrystone 2 UB-Integer 1CPU ratio c3.large c3.xlarge c3.2xlarge cc2.8xlarg e UnixBench c3.large c3.xlarge c3.2xlarge cc2.8xlarg e C.O.V. NCPU ratio C.O.V. RT (min) 1.05 1.05 1.05 0.24% 0.27% 0.07% 1.10 2.20 4.34 0.30% 0.28% 0.23% 0.17 0.17 0.17 1.00 0.10% 15.54 0.95% 0.17 1.10 1.06 1.10 1.33% 1.48% 0.54% 1.91 2.85 4.50 1.54% 1.26% 1.02% 28.17 28.21 28.96 1.00 2.97% 6.44 2.65% 30.20
  15. 15. SPEC CPU2006 Overview • • • • • • Competitive (reviewed) Commercial (site) license required Source code provided, must be built Highly customizable Full “reportable” run 5+ hours Published results on www.spec.org
  16. 16. SPEC CPU2006 Overview Benchmark 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Category C Programming language C Compression C C compiler C Combinatorial optimization C Artificial intelligence C Search gene sequence C Artificial intelligence C Physics / quantum computing C Video compression C++ Discrete event simulation C++ Path-finding algorithms C++ Xml processing
  17. 17. SPEC CPU2006 Integer Script CPATH=“/cpu2006/result” COPIES=`cat /proc/cpuinfo | grep processor | wc –l` SITXT=estspecint.txt DL=+ ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`” TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`” FN=$ID$DL$TYPE$DL$SEQNO$DL$SITXT runspec –noreportable –tune=base –size=ref –rate=$COPIES –iterations=1 / 400 403 445 456 458 462 464 471 473 483 grep “_base” $CPATH/CINT*.ref.csv | cut -d, -f1-2 > $FN grep “total seconds elapsed” $CPATH/CPU*.log | awk '/finished/ {print $9}’ >>$FN
  18. 18. Estimated SPEC CPU2006 Integer Est. SPECint m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlarge c3.large c3.xlarge c3.2xlarge cc2.8xlarge 1CPU ratio 1.01 1.01 0.76 0.79 0.78 1.11 1.10 1.08 1.00 C.O.V. 1.06% 1.67% 1.97% 0.94% 0.16% 1.95% 1.96% 0.87% 0.29% RT (min) 54.39 54.49 70.83 68.85 68.73 50.00 50.29 50.87 54.92 NCPU ratio 2.24 4.25 1.39 2.76 5.21 1.25 2.39 4.67 14.92 C.O.V. 1.15% 1.63% 2.45% 1.24% 1.26% 1.47% 1.28% 0.25% 0.52% RT (min) 104.18 109.22 85.37 85.42 89.91 94.22 97.66 100.22 125.74
  19. 19. Sysbench Overview • Designed as quick system test of MySQL servers • Test categories – – – – – – Fileio Cpu Memory Threads Mutex oltp • Source code provided, must be built • Very simplistic defaults – tuning recommended
  20. 20. Sysbench Script COPIES=`cat /proc/cpuinfo | grep processor | wc –l` TDS=$(($COPIES * 2)) STXT=sysbenchcpu.txt DL=+ ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`” TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`” FN=$ID$DL$TYPE$DL$TDS$DL$STXT sysbench –num-threads=$TDS --max-requests=30000 --test=cpu / --cpu-max-prime=100000 run > $FN grep “total time:” i-*$STXT| cut -d, -f1-2 > $FN
  21. 21. Sysbench – CPU sysbench m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlarge c3.large c3.xlarge c3.2xlarge cc2.8xlarge Default 3.21 6.41 1.59 3.19 8.83 1.78 3.55 6.55 25.34 C.O.V. 1.44% 1.38% 0.75% 0.64% 0.62% 0.26% 0.53% 8.45% 2.30% RT (min) tuned ratio 0.06 0.03 0.11 0.06 0.02 0.10 0.05 0.03 0.01 1.69 3.38 0.80 1.60 4.71 0.91 1.83 3.54 13.69 C.O.V. 1.29% 1.41% 0.23% 0.76% 0.20% 0.09% 0.02% 3.31% 1.10% RT (min) 3.86 1.93 8.16 4.07 1.38 7.13 3.57 1.85 0.48
  22. 22. Summary: CPU Comparison GB GB Int UB UB Int Est. SPECInt sysbench default sysbench tuned m3.xlarge 2.04 2.01 2.49 1.88 2.24 3.21 1.69 m3.2xlarge 3.80 3.96 4.21 3.77 4.25 6.41 3.38 m2.xlarge 1.54 1.52 0.76 1.59 1.38 1.59 0.80 m2.2xlarge 2.82 3.02 1.23 3.19 2.76 3.19 1.60 m2.4xlarge 5.11 5.54 2.02 6.48 5.21 8.83 4.71 c3.large 1.32 1.37 1.91 1.10 1.25 1.78 0.91 c3.xlarge 2.51 2.72 2.85 2.20 2.39 3.55 1.83 c3.2xlarge 4.88 5.35 4.50 4.34 4.67 6.55 3.54 15.46 17.88 6.44 15.5 4 14.92 25.34 13.69 cc2.8xlarge
  23. 23. Scenario: Memory Instance Selection • Application runs on premises • Primary requirement: memory throughput of 20K MB/sec • What instance would work best? 1. 2. 3. 4. Choose a synthetic benchmark Baseline: Build, configure, tune, and run it on premises Run the same test (or tests) on a set of instance types Use results from the instance tests to choose the best match
  24. 24. Testing Memory • Choose a benchmark: – stream, geekbench, sysbench(memory) • How do you know when you have a good result? • Tests run on 9 instance types – Minimum of 10 instances launched – Tests run a minimum of 3 times on each instance – Ubuntu 13.04 base AMI
  25. 25. Stream* Overview • Synthetic measure sustainable memory bandwidth – – – – Published results at www.cs.virginia.edu/stream/top20/Bandwidth.html Must be built By default, runs 1 thread per cpu Use stream-scaling to automate array size and thread scaling • https://github.com/gregs1104/stream-scaling name kernel COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i) bytes FLOPS iter iter 16 0 16 1 24 1 24 2 * McCalpin, John D.: "STREAM: Sustainable Memory Bandwidth in High Performance Computers",
  26. 26. Memory Scripts TDS=`cat /proc/cpuinfo | grep processor | wc –l` export OMP_NUM_THREADS= $TDS MTXT=stream.txt DL=+ ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`” TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`” FN=$ID$DL$TYPE$DL$TDS$DL$MTXT ./stream | egrep "Number of Threads requested|Function|Triad|Failed|Expected|Observed" > $FN MTXT=sysbench-mem.txt FN=$ID$DL$TYPE$DL$TDS$DL$MTXT ./sysbench --num-threads=$TDS --test=memory run >$FN
  27. 27. Memory Comparison StreamTriad m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlarge c3.large c3.xlarge c3.2xlarge cc2.8xlarge 23640.56 26046.17 18766.58 22421.91 19634.50 11434.83 21141.30 30235.78 55200.86 Geekbench sysbench Memory-Triad (default) 15375.64 14999.27 17365.76 17600.00 14405.82 9967.96 13972.65 20657.49 37067.32 302.95 603.40 528.16 1019.08 1576.30 2116.84 2643.33 2944.91 1195.90 sysbench memory defaults --memory-block-size [1K] --memory-total-size [100G] --memory-scope {global,local} [global] --memory-hugetlb [off] --memory-oper {read, write, none} [write] --memory-access-mode {seq,rnd} [seq]
  28. 28. Testing Disk I/O • Storage options: – – – – Amazon EBS Amazon EBS PIOPs Ephemeral hi1.4xlarge local storage • Test parameters: – – – – – Read % Write % Sequential Random Queue depth • Storage configuration – Volume(s) – RAID – LVM • I/O metrics – IOPs – Throughput – Latency
  29. 29. Benchmarking PIOPs 1200 Launch an Amazon EBS-optimized instance • Create provisioned IOPS volumes • Attach the volumes to Amazon EBS-optimized instance • Pre-warm volumes • Tune queue depth and latency against IOPs PIOPs 2K Queue Depth 1D PIOPS 2K QD2 2D PIOPS 2K 2D PIOPS 2K QD2 1000 Latency (usec) • 1D PIOPS 2K 800 600 400 200 0 Seq. Read Seq. Write Mixed Seq Read Mixed Seq Write Rand Read Rand Write Mixed Rand Read Mixed Rand Write
  30. 30. Testing Disk I/O Examples • disk copy • cp file1 /disk1/file1 • dd • dd if=/dev/zero of=/data1/testile1 bs=1048 count=1024000 • fio – flexible io tester • fio simple.cfg • • • • • • • [global] clocksource=cpu randrepeat=0 ioengine=libaio direct=1 group_reporting size=1G • • • • • • • • [xvdd-fill] filename=/data1/testfile1 refill_buffers scramble_buffers=1 iodepth=4 rw=write bs=2m stonewall • • • • • • • • • • [xvdd-1disk-write-1k-1] time_based ioscheduler=deadline iodepth=1 rate_iops=4080 ramp_time=10 filename=/data1/testfile1 runtime=30 bs=1k rw=write
  31. 31. Summary Disk I/O Seconds MB/sec cp f1 f2 17.248 59.37 rm –rf f2; cp f1 f2 .853 1200.47 cp f1 f3 .880 1164.96 dd if=/dev/zero bs=1048 count=1024000 of=d1 .722 1419.01 dd if=/dev/urandom bs=1048 count=1024000 of=d2 fio simple.cfg 79.710 12.84 NA 61.55
  32. 32. Beyond Simple Disk I/O Random PIOPs 16disk 1M I/O MBps read 1006.73 write 904.03 r70w30 1005.91
  33. 33. Summary If benchmarking your application is not practical, synthetic benchmarks can be used if you are careful. • • • • • Choose the best benchmark that represents your application Analysis – what does “best” mean? Run enough tests to quantify variability Baseline – what is a “good result” ? Samples – keep all of your results – more is better!
  34. 34. Please give us your feedback on this presentation ENT305 As a thank you, we will select prize winners daily for completed surveys!

×