Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CMP301_Deep Dive on Amazon EC2 Instances

9,155 views

Published on

Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.

CMP301_Deep Dive on Amazon EC2 Instances

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep Dive on Amazon EC2 Instances Featur ing Per for mance Optimization B est Pr actices A d a m B o e g l i n , S r H P C S o l u t i o n s A r c h i t e c t C M P 3 0 1 N o v e m b e r 2 8 , 2 0 1 7
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect from the session • Understanding the factors that going into choosing an Amazon EC2 instance • Defining system performance and how it is characterized for different workloads • How Amazon EC2 instances deliver performance while providing flexibility and agility • How to make the most of your EC2 instance experience through the lens of several instance types
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. API EC2 EC2 Amazon Elastic Compute Cloud (EC2) is big Instances Networking Purchase options
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EC2 instances Host server Hypervisor Guest 1 Guest 2 Guest n
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • First launched in August 2006 • M1 instance • “One size fits all” In the past M1
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EC2 instances history 2006 m1.small 2007 m1.large m1.xlarge 2008 c1.xlarge c1.medium 2009 m2.2xlarge m2.4xlarge 2010 m2.xlarge cc1.4xlarge t1.micro cg1.4xlarge 2011 cc2.8xlarge 2012 m1.medium hi1.4xlarge hs1.8xlarge m3.2xlarge m3.xlarge 2013 cr1.8xlarge g2.2xlarge c3.8xlarge c4.4xlarge c3.2xlarge c3.xlarge c3.large i2.xlarge i2.2xlarge i2.4xlarge i2.8xlarge 2014 m3.medium m3.large r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge t2.micro t2.small t2.medium 2015 g2.8xlarge c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge d2.xlarge d2.2xlarge d2.4xlarge d2.8xlarge t2.large m4.large m4.xlarge m4.2xlarge m4.4xlarge m4.10xlarge t2.nano x1.32xlarge 2016 m4.16xlarge x1.16xlarge p2.xlarge p2.8xlarge p2.16xlarge r4.large r4.xlarge r4.2xlarge r4.4xlarge r4.8xlarge r4.16xlarge 2017 x1e.32xlarge g3.4xlarge g3.8xlarge g3.16xlarge i3.large i3.xlarge i3.2xlarge i3.4xlarge i3.8xlarge i3.16xlarge f1.2xlarge f1.16xlarge c5.19xlarge c5.9xlarge c5.4xlarge c5.2xlarge c5.xlarge p3.16xlarge p3.8xlarge p3.2xlarge x1e.16xlarge
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Instance generation c5.xlarge Instance family Instance size
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EC2 instance families General purpose Compute optimized C4 Storage and I/O optimized I3 P3 Accelerated Memory optimized R4C5 M4 D2 X1 G3 F1 C3
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s a virtual CPU? (vCPU) • A vCPU is typically a hyper-threaded physical core* • On Linux, “A” threads enumerated before “B” threads • On Windows, threads are interleaved • Divide vCPU count by two to get core count • Cores by EC2 and Amazon RDS DB instance type: https://aws.amazon.com/ec2/virtualcores/ * The “T” family is special
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Disable hyper-threading if you need to • Useful for FPU heavy applications • Use ‘lscpu’ to validate layout • Hot offline the “B” threads for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' 'n' | sort -un); do echo 0 | sudo tee /sys/devices/system/cpu/cpu${cpunum}/online Done • Set grub to only initialize the first half of all threads maxcpus=20
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Instance sizing c5.18xlarge 2 - c5.9xlarge ≈ 4 - c5.4xlarge ≈ 8 - c5.2xlarge ≈
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Resource allocation • All resources assigned to you are dedicated to your instance with no over-commitment* • All vCPUs are dedicated to you • Memory allocated is assigned only to your instance • Network resources are partitioned to avoid “noisy neighbors” • Curious about the number of instances per host? Use “dedicated hosts” as a guide. *Again, the “T” family is special
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Launching new instances and running tests in parallel is easy…[when choosing an instance] there is no substitute for measuring the performance of your full application.” —EC2 documentation
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Timekeeping explained • Timekeeping in an instance is deceptively hard • gettimeofday(), clock_gettime(), QueryPerformanceCounter() • The TSC • CPU counter, accessible from userspace • Requires calibration, vDSO • Invariant on Sandy Bridge+ processors • Xen pvclock; does not support vDSO • On current generation instances, use TSC as clocksource
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <time.h> #define BILLION 1E9 int main(){ float diff_ns; struct timespec start, end; int x; clock_gettime(CLOCK_MONOTONIC, &start); for ( x = 0; x < 100000000; x++ ) { struct timeval tv; gettimeofday(&tv, NULL); } clock_gettime(CLOCK_MONOTONIC, &end); diff_ns = (BILLION * (end.tv_sec - start.tv_sec)) + (end.tv_nsec - start.tv_nsec); printf ("Elapsed time is %.4f secondsn", diff_ns / BILLION ); return 0; } Benchmarking—time intensive application
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. [centos@ip-192-168-1-77 testbench]$ strace -c ./test Elapsed time is 10.0336 seconds % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.99 3.322956 2 2001862 gettimeofday 0.00 0.000096 6 16 mmap 0.00 0.000050 5 10 mprotect 0.00 0.000038 8 5 open 0.00 0.000026 5 5 fstat 0.00 0.000025 5 5 close 0.00 0.000023 6 4 read 0.00 0.000008 8 1 1 access 0.00 0.000006 6 1 brk 0.00 0.000006 6 1 execve 0.00 0.000005 5 1 arch_prctl 0.00 0.000000 0 1 munmap ------ ----------- ----------- --------- --------- ---------------- 100.00 3.323239 2001912 1 total Using the Xen clocksource
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. [centos@ip-192-168-1-77 testbench]$ strace -c ./test Elapsed time is 2.0787 seconds % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 32.97 0.000121 7 17 mmap 20.98 0.000077 8 10 mprotect 11.72 0.000043 9 5 open 10.08 0.000037 7 5 close 7.36 0.000027 5 6 fstat 6.81 0.000025 6 4 read 2.72 0.000010 10 1 munmap 2.18 0.000008 8 1 1 access 1.91 0.000007 7 1 execve 1.63 0.000006 6 1 brk 1.63 0.000006 6 1 arch_prctl 0.00 0.000000 0 1 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000367 53 1 total Using the TSC clocksource
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tip: Use TSC as clocksource On Linux: # cat /sys/devices/system/cl*/cl*/current_clocksource xen Change with: # echo tsc > /sys/devices/system/cl*/cl*/current_clocksource Or add to grub: tsc=reliable clocksource=tsc On Windows 2008 R2 and newer: • Nothing to do, it selects the best clocksource automatically
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • c4.8xlarge, d2.8xlarge, m4.10xlarge, m4.16xlarge, p2.16xlarge, x1.16xlarge, x1.32xlarge, etc • By entering deeper idle states, non- idle cores can achieve up to 300MHz higher clock frequencies • But… deeper idle states require more time to exit, may not be appropriate for latency-sensitive workloads • Linux: limit c-state by adding “intel_idle.max_cstate=1” to grub • Windows: no options to control c states P-state and C-state control
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tip: P-state control for AVX2 • If an application makes heavy use of AVX2 on all cores, the processor may attempt to draw more power than it should • Processor will transparently reduce frequency • Frequent changes of CPU frequency can slow an application sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo“ See also: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. T2 instances • Lowest cost Amazon EC2 instance at $0.0065 per hour • Burstable performance • Fixed allocation enforced with CPU credits Model vCPU Baseline CPU credits/ hour Maximum CPU credit balance Memory (GiB) t2.nano 1 5% 3 72 .5 t2.micro 1 10% 6 144 1 t2.small 1 20% 12 288 2 t2.medium 2 40% (of 200% max) 24 576 4 t2.large 2 60% (of 200% max) 36 864 8 t2.xlarge 4 90% (of 400% max) 54 1296 16 t2.2xlarge 8 135% (of 800% max) 81 1944 32
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How credits work Baseline rate Credit balance Burst rate • A CPU credit provides the performance of a full CPU core for one minute • An instance earns CPU credits at a steady rate • An instance consumes credits when active • Credits expire (leak) after 24 hours
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tip: Monitor CPU credit balance
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. X1 instances • Largest memory instance with 4 TB of DRAM • Quad socket, Intel E7 processors with 128 vCPUs Model vCPU Memory (GiB) Local storage Network x1.32xlarge 128 1952 2x 1920GB SSD 25Gbps x1.16xlarge 64 976 1x 1920GB SSD 10Gbps x1e.32xlarge 128 3904 2x 1920GB SSD 25Gbps x1e.16xlarge 64 1952 1x 1920GB SSD 10Gbps x1e.8xlarge 32 976 960 GB SSD Up to 10Gbps x1e.4xlarge, x1e.2xlarge, x1e.xlarge… In-memory databases, big data processing, HPC workloads
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Non-uniform memory access • Each processor in a multi-CPU system has local memory that is accessible through a fast interconnect • Each processor can also access memory from other CPUs, but local memory access is a lot faster than remote memory • Performance is related to the number of CPU sockets and how they are connected—Intel QuickPath Interconnect (QPI) NUMA
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. r4.16xlarge QPI 244GB 244GB 32 vCPUs 32 vCPUs
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. x1e.32xlarge QPI QPI QPIQPI QPI 976GB 976GB 976GB 976GB 32 vCPU’s 32 vCPU’s 32 vCPU’s 32 vCPU’s
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tip: Kernel support for NUMA balancing • Linux Kernel 3.8+ and Windows Datacenter 2003+ support NUMA balancing • Not always the most efficient • Overhead of managing memory can slow down your app • Does your app have more memory that fits in a single socket? • Linux: set “numa=off” in grub to disable NUMA awareness • Do you have many processes or a footprint less than a single socket? • Linux: use “numactl” to restrict them to specific cores or nodes • Example: numactl --cpunodebind=0 --membind=0 ./myapp.run • Windows: use processor affinity to lock applications to specific cores
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Operating systems impact performance • Memory intensive web application • Created many threads • Rapidly allocated/deallocated memory • Comparing performance of RHEL6 and RHEL7 • Notice high amount of “system” time in top • Found a benchmark tool (ebizzy) with a similar performance profile • Traced it’s performance with “perf”
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On RHEL6 [ec2-user@ip-172-31-12-150-RHEL6 ebizzy-0.3]$ sudo perf stat ./ebizzy -S 10 12,409 records/s real 10.00 s user 7.37 s sys 341.22 s Performance counter stats for './ebizzy -S 10': 361458.371052 task-clock (msec) # 35.880 CPUs utilized 10,343 context-switches # 0.029 K/sec 2,582 cpu-migrations # 0.007 K/sec 1,418,204 page-faults # 0.004 M/sec 10.074085097 seconds time elapsed
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. RHEL6 flame graph output www.brendangregg.com/flamegraphs.html
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On RHEL7 [ec2-user@ip-172-31-7-22-RHEL7 ~]$ sudo perf stat ./ebizzy-0.3/ebizzy -S 10 425,143 records/s real 10.00 s user 397.28 s sys 0.18 s Performance counter stats for './ebizzy-0.3/ebizzy -S 10': 397515.862535 task-clock (msec) # 39.681 CPUs utilized 25,256 context-switches # 0.064 K/sec 2,201 cpu-migrations # 0.006 K/sec 14,109 page-faults # 0.035 K/sec 10.017856000 seconds time elapsed Up from 12,400 records/s! Down from 1,418,204!
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. RHEL7 flame graph output
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Split driver model Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application 1 23 4 5
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Device pass through: Enhanced networking • SR-IOV eliminates need for driver domain • Physical network device exposes virtual function to instance • Requires a specialized driver, which means: • Your instance OS needs to know about it • Amazon EC2 needs to be told your instance can use it
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. After enhanced networking Hardware Driver Domain Guest Domain Guest Domain VMM NIC Driver Physical CPU Physical Memory SR-IOV Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application 1 2 3 NIC Driver
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Next generation of enhanced networking • Hardware checksums • Multi-queue support • Receive side steering • 25 Gbps in a placement group • New open source Amazon network driver Elastic network adapter
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Network performance • 25 Gigabit and 10 Gigabit • Measured one-way, double that for bi-directional (full duplex) • High, moderate, low–a function of the instance size and EBS optimization • Not all created equal–test with iperf if it’s important! • Use placement groups when you need high and consistent instance-to- instance bandwidth • All traffic limited to 5 Gb/s when exiting EC2
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • R4, I3, G3, F1, X1, P3, C5 • Largest sizes provide consistent 10 Gbps and 25 Gbps • Smaller sizes • Up to 10 Gbps with baseline • Accrue credits when network usage below baseline • Full bandwidth without placement group, but do need multiple streams • Single stream limited to 10 Gbps in placement group • 5 Gbps per stream across AZ’s, but up to 25 Gbps total ENA on R4 and beyond
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. P3 GPU Instances • NVIDIA Tesla V100 Volta Based GPU • 1 PetaFLOP of computational performance in a single instance • Provides up to 14X performance improvement over P2 for Machine Learning use-cases • Up to 2.6X performance improvement over P2 for High-Performance Computing use-cases P3 Instance Size GPUs Accelerator (V100) GPU Peer to Peer GPU Memory (GB) vCPUs Memory (GB) Network Bandwidth EBS Bandwidth P3.2xlarge 1 1 No 16 8 61 Up to 10Gbps 1.7Gbps P3.8xlarge 4 4 NVLink 64 32 244 10Gbps 7Gbps P3.16xlarge 8 8 NVLink 128 64 488 25Gbps 14Gbps
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. F1 FPGA instances • Up to 8 Xilinx Virtex UltraScale Plus VU9p FPGAs in a single instance with four high-speed DDR-4 per FPGA • Largest size includes high performance FPGA interconnects via PCIe Gen3 (FPGA Direct), and bidirectional ring (FPGA Link) • Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and image processing F1 Instance size FPGAs FPGA Link FPGA Direct vCPUs Memory (GiB) NVMe instance storage Network bandwidth f1.2xlarge 1 - 8 122 1 x 470 Up to 10 Gigabit f1.16xlarge 8 Y Y 64 976 4 x 940 25 Gbps
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s new with C5? • Custom KVM based hypervisor • More resources for EC2 instances • ENA Only • No netfront (vif) driver failback • Install ENA driver on r4 or similar instance • Disable “Predictable Network Interface Names” • Add “net.ifnames=0” to grub • Remove “/etc/udev/70-persistent-net.rules” before creating AMI • NVME based EBS storage • Use the most recent kernel/NVME driver that your operating system supports! • You’ll need to have the NVME driver available in initramfs • Ensure you’re using UUIDs or LABELs in /etc/fstab • NVME Drivers have timeouts that “blkfront” does not • Maximum of 27 PCI Devices • EBS Volumes + ENI’s • ACPI Signals for shutdown (use acpid)
  45. 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2009–Longer ago than you think • The 2.6.32 Linux Kernel for RHEL6 was released 2009 • Doesn’t support latest AWS networking (ENA) out of the box • Outdated Xen virtualization drivers spend more time in the hypervisor • Older NVME driver does not perform as well for IO Linux: use 3.10+ kernel (or newer for NVME instances) • Current Amazon Linux • Ubuntu 14.04 or later • RHEL/Centos 7 or later • Or use the 3.10 (kernel-lt) or 4.x (kernel-ml) version available from ELREPO in RHEL/Centos 6 Windows: • Windows Server 2008 R2, Windows Server 2012 R2, Windows Server 2016
  46. 46. EBS performance http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html Instance type EBS- optimized by default Max EBS bandwidth (Mbps)* Expected EBS throughput (MB/s)** Max. IOPS (16 KB I/O size)** Max network bandwidth r4.16xlarge Yes 14,000 1,750 75,000 25 Gb/s i3.16xlarge Yes 14,000 1,750 65,000 25 Gb/s m4.16xlarge Yes 10,000 1,250 65,000 25 Gb/s p2.16xlarge Yes 10,000 1,250 65,000 25 Gb/s x1.32xlarge Yes 10,000 1,250 65,000 10 Gb/s i3.8xlarge Yes 7,000 850 32,500 10 Gb/s r4.8xlarge Yes 7,000 875 37,500 10 Gb/s • Instance size affects throughput • Match your volume size and type to your instance • RAID multiple EBS volumes together to achieve:
  47. 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary: Getting the most out of EC2 instances • Use modern, up to date operating systems • Timekeeping: use TSC • C state and P state controls • Monitor T2 CPU credits • NUMA balancing • Enhanced networking • Profile your application
  48. 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Virtualization themes • Bare metal performance goal, and in many scenarios already there • History of eliminating hypervisor intermediation and driver domains • Hardware-assisted virtualization • Scheduling and granting efficiencies • Device pass through
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Next steps • Visit the Amazon EC2 documentation • Launch an instance and try your app!
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!

×