SlideShare a Scribd company logo
1 of 80
Download to read offline
CPN302
Your Linux AMI: Optimization and Performance
(Intro)
Thor Nolen, Ecosystem Solutions Architect
November 15, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Linux In EC2 Is About Choice
• Agnostic
Linux In EC2 Is About Choice
• Agnostic
• Easy to deploy, configure, and update
Instance Type Selection
Instance Type Selection
• Choose but be flexible
Instance Type Selection
• Choose but be flexible
• Be careful running the edge of what your
instance type can handle
Instance Type Consideration
• Linux AMI choice is not just distribution and
version
Instance Type Consideration
• Linux AMI choice is no longer just distribution
and version
• PV or HVM
Virtualization Type
PV
• Operating System is
“aware” of its virtual
environment
• Requires OS
modifications
Virtualization Type
PV
• Operating System is
“aware” of its virtual
environment
• Requires OS
modifications

HVM
• Leverages processor
capabilities to deliver full
virtualization
• Can use an unmodified
Operating System
Virtualization Type
PV
• Operating System is
“aware” of its virtual
environment
• Requires OS
modifications

HVM
• Leverages processor
capabilities to deliver full
virtualization
• Can use an unmodified
Operating System
– But PV network and
storage drivers
recommended
Enhanced Networking
• i2 and C3
Enhanced Networking
• i2 and C3
• HVM only
Enhanced Networking
• i2 and C3
• HVM only
• Requires download, compile, and install of
drivers
PV or HVM?
• There are performance differences
PV or HVM?
• There are performance differences
• Determine your metrics, test, and measure
PV or HVM?
• There are performance differences
• Determine your metrics, test, and measure
• Application / workload testing will guide which
variant is best for you
Linux Partner Ecosystem in EC2
Please give us your feedback on this
presentation

CPN302
As a thank you, we will select prize
winners daily for completed surveys!
Your Linux AMI:
Optimization and Performance
Coburn Watson
Manager Cloud Performance, Netflix, Inc.
November 15, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Netflix, Inc.
•
•
•
•
•

World's leading internet television network
~ 40 Million subscribers in 40+ countries
Over a billion hours streamed per month
Approximately 33% of all US Internet traffic at night
Recent Notables
• Increased originals catalog
AMI Performance @ Netflix
Why Tune the AMI?
• @ Netflix: 10’s of 1000’s of instances running globally
–

“Rising Tide Lifts All Ships”

• Large variability in production workloads
–
–
–
–

OLTP (majority of REST-based services)
Batch/Pre-Compute (think movie recommendations…)
Cassandra
EVCache (memcached tier)

• Cloud environments have inherent performance variability
–

Improve resilience to such variability

• Deployment model affords ease of customization
Baking Performance Into the Base
• Aminator – Open Source AMI bakery

• Broad propagation of standard performance tunings
–

Apache, Tomcat configurations

• Focused application of workload-specific configurations
–

Primarily kernel and OS optimizations: CPU Scheduling, Memory Management, Network, IO
Linux Kernel Tuning - Benefits
•
•
•
•

Effectively drive key instance resource dimensions
Improved efficiency at scale saves big $
Tuning process drives identification of ideal instance type
Readily available advanced Linux tools (e.g. perf, systemtap)
provide deep insight into the kernel and the application:
– Top Down Analysis: Review of application interaction with system resources
– Bottom Up Analysis: System resource usage of the application
Kernel Tuning Trade-Offs
• Kernel subsystems are inter-dependent
– tuning in one area my improve efficiency at the expense of another

• 80/20 Rule:
80%: Improvement gained by application refactoring and tuning
20%: OS tuning, infrastructure improvement etc..

• Tuning tailors the system for a specific workload
–

Other workload may perform worse

Tuning objective is to align system resources to application
requirements in order to improve overall system performance.
Linux Performance Tools
Metrics of Interest
• Performance analysis focus
Resource

Characteristic

CPU

utilization, saturation, process priority, affinity, NUMA

Memory

physical/virtual memory usage, swapping, page cache

Network IO

network stack congestion, latency, throughput

Block IO

block layer and device latency, throughput, file system

Scalability

concurrency, parallelism, shared resources, lock contention
Basic Tools
Tool

Description

vmstat, dstat

Reports system-wide CPU utilization, saturation, memory and swap usage.
Overview of kernel events like: syscall, context switch, interrupts etc.

mpstat

Reports per-CPU utilization. hard/soft interrupts, virtualization overhead
(%steel, %guest)

top, atop, htop,
nmon

Reports per process/thread state, scheduling priorities and CPU usage etc.
atop is similar to top but keeps historical data for trend analysis. htop and
nmon provides similar stats with graphical view.

iostat

IO latency/throughput at the driver and the block layer. Device utilization

sar

Keeps historical data about CPU, memory, Network, IO usage

uptime

Reports CPU saturation - Threads waiting for CPU
Basic Tools, cont.
Tool

Description

free

Free memory and swap. Counts page cache memory as free

/proc/meminfo

Memory, swap and file system statistics. Kernel memory usage, statistics for
conservative memory allocation policy, HugeTLB etc..

pidstat

Per process/thread CPU usage, context switch, memory, swap, IO usage

ps, pstree

Per process/thread CPU and Memory usage

/proc, /sys File system

/proc: stats about process, threads, scheduling, kernel stacks, memory etc..
/sys: Report device specific stats: disk, NIC etc..

netstat, iptraf

TCP/IP statistics , routing, errors, network connectivity, and NIC stats.
iptraf shows real time tcp/ip network traffic

nicstat, ping, ifconfig

NIC stats, network connectivity, netmask, subnet etc..
Advanced Tools
Tool

Description

blktrace

Profile the Linux Block layer and reports events like: merge, plug, split, remapped etc.
Reports PID, block number, IO size, timestamp etc..

slabtop

Kernel memory usage and statistics for various kernel caches in use by kernel

pmap

Dumps all memory segments in the process address space: heap, stack, mmap

pstack, jstack

Dumps application user level stack trace. jstack contains java methods, tid, pid, threads
states

iotop

per process/thread IO statistics. Reports application time spend blocking on IO

/proc/net/softnet_stats

per CPU backlog queue throttling (netdev_max_backlog) stats.

/proc/interrutps
/proc/softirqs

Tells which CPU is processing device interrupts. softirqs provides information about softirq
processing for network stack and Linux block layer

tcpdump, wireshark

Network sniffer. Capture network traffic (libpcap format) for post analysis. Wireshark can be
used to analyze tcpdump and ethereal traces

ethtool

NIC low level statistics: NIC speed, full duplex, transmit descriptors , ring buffer
Advanced Tools, cont.
Tool

Description

perf

Application and kernel profiling and tracing tool. Reports top kernel and
application CPU bound routines, stack traces. Capture hardware events (cpu
cache, TLB misses etc.), software (kernel, application) static and dynamic
events to perform low level profiling and tracing.

systemtap

Application and kernel profiling and tracing tool. Allow inserting trace points in
the kernel and application dynamically to capture low level profiling data for
performance analysis and debugging. Scripting language similar to C and Perl.

latencytop

Kernel blocking events due to lock, IO, condition variable . Dumps kernel stack

strace

Report information about system calls generated by the application: Type of
system call, arguments, return value, errno, and elapsed time.

numastat

numa related latency stats on the HVM platform
AMI Tuning
Use Case: CFS scheduler tuning
• Goal:
– Improve batch and compute-intensive processing:
• Increase time slice and/or process priority in order to reduce context switches
• Longer the time process runs on the CPU, better the use of CPU caches

• Tunables:
– Change scheduling policy of workload: # chrt –a –b –p 0 <PID>
OR
– Set CFS tunables to improve time slice at a system-wide level
• sched_latency_ns: 6ms * (1 + log2(ncpus))
Ex: 4 CPU cores = 18ms. Set it higher
• sched_min_granularity_ns: 0.75 * (1 + log2(ncpus))
Ex: 4 CPU cores = 2.25ms. Set it higher
Use Case: CFS scheduler tuning
Use Case: Page Cache Tuning
• Goal:
– Increase application write throughput
– Reduce IO flooding by writing consistently rather than in bulk

• Tunables:
–
–
–
–

dirty_ratio= 60
dirty_background_ratio= 5
dirty_expire_centisecs= 30000
swappiness=0

• Page cache hit/miss ratio:
– systemtap (ioblock_request, vfs_read probes).
– fincore command can be used to find what pages of a file are in page cache
Use Case: Linux Block Layer Tuning
• Goal:
– Queue more data to SSD device to achieve higher throughput
– Better sequential read IO throughput by fetching more data
– Distribute IO processing across multiple CPUs

• Tunables:
/sys/block/<dev>/queue/nr_requests=256

/sys/block/<dev>/read_ahead=256

/sys/block/<dev>/queue/scheduler=noop

/sys/block/<dev>/queue/rq_affinity=2
Use Case: Memory Allocation Tuning
• Goal:
– Avoid running out of memory while running a production load
– Do not allow memory over-commit that may result in OOM

• Tunable
– overcommit_memory=2
– overcommit_ratio=80
Use Case: Network Stack Tuning
• Goal:
– Increase Network Stack Throughput
– Larger TCP receive and Congestion window
– Scale network stack processing across multiple CPUs

• Tunable
tcp_slow_start_after_idle=0

rmem_max,wmem_max = 16777216 or higher

tcp_fin_timeout=10

tcp_wmem, tcp_rmem
8388608,1258291,16777216 or higher

tcp_early_retrans=1

rps_sock_flow_entries=32768

netdev_max_backlog=5000

/sys/class/net/eth?/queues/rx0/rps_flow_cnt=32768

txqueuelen=5000

/sys/class/net/eth?/queues/rx-0/rps_cpus=0xf
Netflix AMI Tuning Roadmap
Future tuning activity
•

M3 class instances supports both HVM and PV. Easy validation of performance gain
with HVM versus PV

•

Study Cassandra workload on SSD-based systems

•

Tune Linux Block Layer and compare performance of different IO schedulers: noop,
CFQ, deadline

•

Test file system: XFS, EXT4, BTRFS performance on various workload running on
SSD instances.

•

Test network performance with new TCP/IP and Network Stack features: TCP early
retransmit, TCP Proportional Rate Reduction, and RFS/RPS features

•

Capture low level performance metrics using perf, systemtap, and JVM profiling tools
Please give us your feedback on this
presentation

CPN302
As a thank you, we will select prize
winners daily for completed surveys!
Appendix:
Perf and SystemTap
Profiling and Tracing Benefits
• Fine grain measurements and low level statistics to help
with difficult to solve performance issues
• Isolate hot spots, resource usage and contention in
application and kernel
• Gain comprehensive insight into application and kernel
behavior
SystemTap and Perf Benefits
• Inserts trace points into the running application and kernel
without adding any debug code

• Lower overhead, processing done in the kernel space. No
stopping/starting the application
• Help build custom tools to fill out observability gaps

• Analyze throughput and latency across application and all
kernel subsystems
• Unified view of user (application) and kernel events
SystemTap and Perf Benefits (cont.)
SystemTap and Perf can track all sorts of events at system-wide,
process and thread levels:
•
•
•
•
•
•
•
•

Time spent in system call and kernel functions; arguments passed, return
values, errno.
Dump application and kernel stack trace at any point in the execution path
Time spend in various process states: blocking for IO, lock, resource and
waiting for CPU
Top CPU bound user and kernel functions
Low level TCP stats. Not possible with standard tools.
Low IO and Network activities. Page cache hit/miss rates.
Monitor page faults, memory allocation, memory leaks
Aggregate results when large amount of data needs to be collected and
analyzed
Perf and SystemTap packages
• Perf:
– apt-get update
– apt-get install linux-tools-common
– apt-get install linux-base
– apt-get install linux-tools-$(uname –r)

• SystemTap: Install kernel debug packages and kernel
header exactly matching your kernel version
– kernel debug packages: http://ddebs.ubuntu.com/pool/main/l/linux/
– apt-get install kernel-headers-$(uname –r)
– apt-get systemtap
SystemTap and Perf Events
Perf and SystemTap capture events generated from various sources:
• Hardware Events (perf only): If running on bare-metal system, perf can
access hardware events generated by PMU (performance monitoring Unit)
Examples: CPU cache/TLB loads, references and misses, IPC (cpu stall
cycles), Branch etc..
• Software Events: Events like: page-faults, cpu-clock, context switches etc..
• Static Trace Events: These are trace points coded into entry and exit of
kernel functions: Examples; syscalls, net, sched, irq, etc..
• Dynamic Trace Events: These are dynamic trace points that can be inserted
on-the-fly (hot patching) into application and kernel functions via break point
engine (kprobe). No kernel and application debug compilation, pauses etc..
perf: sub commands
perf top: Top User and Kernel Routines
perf top -G
perf stat: Hardware Events
perf stat: Software Events
perf stat: Net Events
perf probe: Add a New Event
perf record - Record Events
perf report – Process Recorded Events
perf record – Record Specific Events
perf report – Dump Full Stack Traces
perf programming
SystemTap
• SystemTap supports scripted language similar to C and Perl and
follows an event-action model:
– Event: Trace or Probe point of interest
• Example: system calls, kernel functions, profiling events etc..
– Action: What to do when event of interest occur
• Example: Print app-name, PID whenever write() syscall is invoked

• Idea behind a SystemTap is to name an event (probe) and provide a
handler to perform action in the event context
– probe point is like a break point but instead of stopping
kernel/application at the break point, SystemTap causes a branch
(jump) to probe handler routine to perform the action.

• Script can have multiple probes and associated handlers. Data is
accumulated in buffer and then dump it out into standard out.
SystemTap – Runs as a Kernel Module
• When systemtap script is executed, it is converted into .c file and
compiled as a linux kernel module (.ko)
• Module is loaded into the kernel and probes are inserted by hot
patching the running kernel and application
• Module is unloaded when <cntl><c> is pressed or exit() probe is
invoked from the module.
• Systemtap script use file extension (.stp) and contains probe and
handler written in the format:
probe event { statements}

• When run as a script, first line should have interpreter:
#!/usr/bin/env stap

• Or run from the command line:
# stap –e script.stp
SystemTap: Events
SystemTap trace points can be placed at various locations
in kernel:
– syscall: system call entry and return
• Example: syscall.read, syscall.read.return

– vfs: VFS functions entry and return
– kernel.function: Kernel function entry and return
• Example: kernel.function(“do_fork”), kernel.function(“do_fork”).return

– module.function: Kernel module entry and return

• Other events:
– begin: event fires at the start of script
– end: event fires when script exit
– timer: event fires periodically.
SystemTap: Functions
Commonly used functions:
•
•
•
•
•
•
•
•
•
•
•
•

tid():The ID of the current thread.
uid(): The ID of the current user.
cpu(): The current CPU number.
gettimeofday_s(): The number of seconds since UNIX epoch (January 1, 1970)
probefunc(): Probe function
pid(): PID
execname: Executable name
thread_ident(): Provide indentation to nicely format printing of function call entry and return
target(): specify the pid on the command line
print_backtrace(): Print the complete stack trace
print_regs(): print CPU registers
kernel_string(). Useful to print char type in data structures
Appendix:
AMI TUNING
CFS Scheduler Tuning
• CFS scheduler:
– Provides fair share of CPU resources to all running tasks

– Tasks are assigned weights (priority) to control the time a task can
run on the CPU.
• Involuntary context switch: A task has consumed its time slot or is preempted by higher priority task
• Task voluntary relinquishes the CPU when it blocks on a resource: IO
(disk, net), locks..
• CFS supports various scheduling policies: FIFO, BATCH, IDLE,
OTHER (default), RR
CFS Tunable – Compute Intensive Workload
• Performance goal of Batch workload is to complete the given task in
the shortest time possible. SCHED_BATCH policy is more
appropriate for batch processing workloads
• Task running with SCHED_BATCH policy gets bigger time-slice and
thus does not get involuntary context switched as frequently and
that allows computed tasks to run longer and gets better use of CPU
caches.
CFS Tunable – Compute Intensive Workload
CFS tunables can also be set to reduce context switching activity:
• sched_latency_ns: period in which each runnable task should run
once. Larger value offers bigger CPU slice, that may improve
compute performance. Interactive application performance may
suffer
Default: 6ms * (1 + log2(ncpus)). Example: 4 CPU cores = 18ms (default). Change it to 36 ms

• sched_min_granularity_ns: Threshold on minimum amount of CPU
cycles each task should get. Larger value helps compute workload.
Default: 0.75 * (1 + log2(ncpus)). Example: 4 CPU cores: 2.25ms (default). Change it to 5ms
Internal Testing at Netflix shows 2-5% performance improvement of compute intensive tasks when running the
workload with SCHED_BATCH policy as compared to SCHED_OTHER.
Avoid OOM Killer
To overcome memory and swap shortages the Linux kernel may kill random processes to
free memory. This mechanism is called Out-Of-Memory Killer.
Tunable

Discussion

Heuristic overcommit
overcommit_memory=0 (default)

Allows to overcommit some reasonable amount of memory as determined
by free memory, swap and other heuristics. No reservation of memory and
swap. Thus memory and swap may run out before application uses all of its
memory. This may result in application failure due to OOM.

Always overcommit
overcommit_memory=1

Allow wild overcommit. Any size of memory allocation (malloc) will be
successful. As in the case of Heuristic, memory and swap may run out and
trigger OOM killer.

Strict overcommit
overcommit_memory=2
overcommit_ratio=80

Prevents overcommit. It does not count free memory or swap when making
decisions about commit limit. When application calls malloc(1GB), kernel
reserves or deducts 1G from free memory and swap. This guarantees that
memory committed to application will be available if needed. This prevents
OOM due to no overcommit allowed.
Avoid OOM Killer (continue..)
•

When strict overcommit is enforced, total memory that can be allocated system-wide is
restricted to:
overcommit Limit = Physical Memory x overcommit_ratio + swap
where: overcommit_ratio=50% (default). Tune overcommit_ratio = 80%

•
•
•

New program may fail to allocate memory even when the system is reporting plenty of free
memory and swap. This is due to memory and swap reserved on behalf of the process.
This feature does not effect memory use by file system page cache. Page cache memory is
always counted as free.
Use “/proc/meminfo” statistics to monitor memory already been committed.
CommitLimit : Total amount of memory that can be allocated system-wide
Committed_AS: Memory already been committed on behalf of application
MemoryAvailable: CommitLimit - Committed_AS

Any attempt to allocate memory over “MemoryAvailable” will fail when strict overcommit is used.
Tuning for Higher Throughput
Tunable

Discussion

dirty_ratio

Throttle writes when dirty pages in the file system cache reaches to
40%. For write intensive workload increase it to 60-80%

dirty_background_ratio

Wakes up pdflush when dirty pages reach 10% of total memory.
Reducing the value (5%) allows pdflush to wake up early and that
may keep dirty pages growth in check

dirty_expire_centisecs

Data can stay dirty in the page cache for 30 secs. Increase it to
60-300 seconds on large memory systems to prevent heavy IO to
the storage due to short deadline. Drawback of tuning is that
unexpected outage may result in loss of data not committed.

swappiness

Controls Linux periodic swapping activities. Large value favors
growing page cache by steeling application in-active pages. Setting
value to zero disables periodic swapping. Large value may improve
application write throughput. Value of zero is recommended for
latency sensitive application
Linux Block Layer – IO Tuning
•

sysfs (/sys) is used to set device specific
attributes (tunables):
/sys/block/<dev>/queue/..
•

•

•

nr_requests: Limits number of IO requests queued per
device to 128. To improve IO throughput consider
doubling this value for RAID (multiple disks) devices
or SSD.
scheduler: VM instances use Xen virtualization layer
and thus have no knowledge of underlying geometry
of disks. noop IO scheduler is recommended
considering it is FIFO and has least overhead.
read_ahead: Improves sequential IO performance.
Larger value mean fetch more data into page cache to
improve application IO throughput.

noop
IO scheduler

nr_requests
Block Layer: IO Affinity
•
•

•

Linux IO Affinity feature distributes IO processing work across multiple CPUs
When the application blocks on IO, the kernel records the CPU and dispatches IO.
When the IO is marked completed by the storage driver, the block layer performs IO
processing on the same CPU that has originally issued the IO.
This feature is very helpful when dealing with high IOPS rates such as SSD systems
given the IO completion processing will be distributed across multiple CPUs.

Tunable

Discussion

rq_affinity = 1 (default)

Block layer will migrate IO completion to the CPU group that
originally submitted the request

rq_affinity = 2

Forces the IO completion on the CPU that originally issued the
IO. Thus bypass the “group” logic. This option maximizes
distribution of the IO completion
RPS/RFS - Network Performance and Scalability
•
•

•

•
•

RPS (Receive Packet Steering) and RFS (Receive Flow Steering) can help system to
scale better by distributing network stack processing across multiple CPUs
Without this feature network stack processing is restricted to the same CPU that
serviced the NIC interrupts, and that may induce latencies and lower the network
throughput
NIC driver calls netif_rx() to enqueue the packet for processing. RPS function
get_rps_cpu() selects the appropriate queue that should process the request and
thus distributes the work across multiple CPUs.
RPS make decision by hash lookup that uses CPU bitmask to decide which CPU
should process the packet
RFS steers the processing to the CPU where the application thread, that eventually
consumes the data, is running. It uses the hash as an index into the network flow
lookup table that maps the flow to the CPUs. This improves CPU cache locality.
RPS/RFS - Network Performance and Scalability
(continue..)
Tunable

Discussion

core.
rps_sock_flow_entries=32768

global flow table containing the desired CPU to flow. Each table value
is a CPU index that is updated during socket calls.

/sys/class/net/eth?/queues/rx-0
rps_flow_cnt=32768

Number of entries in the per-queue flow table. Value of flow is
determined by number of active connections. Setting 32768 is a good
start for moderately loaded server. For a single queue device (as in the
case of AWS instances), the value of two tunables should be the
same.
core.rps_sock_flow_entries should be set in order for it to work.

/sys/class/net/eth?/queues/rx-0
rps_cpus=0xf

It is set as a bitmask of CPUs. Disable when set to zero (means
packets are processed on the interrupted CPU). Set to all CPU or
CPUs that are part of the same NUMA node (large server). Setting
value 0xf will cause CPU 0,1,2,3 to do network stack processing
Network Stack Tuning
Packet Transmit Path:
• Network stack converts application payload
written in socket buffer into TCP segments (or
UDP datagrams), calculates the best route
and then writes the packet into NIC driver
queue.
• QOS is provided by inserting various queue
disciplines (FIFO, RED, CBQ..). Queue size is
set to txqueuelen
• NIC driver process packets one-by-one by
writing (DMA) to NIC transmit descriptors. In
case of Xen, packet is written into Xen shared
IO ring (Xen split device driver model)
Network Stack Tuning
Packet Receive Path:
• Device writes (DMA) packet into kernel memory
and raises interrupt.
• In case of Xen, packet is written into IO shared
ring and notification is sent via event channel
• NIC driver interrupt handler copies the packet into
input queue (per-cpu queue). Queue is
maintained by network stack and its size is set to
netdev_max_backlog.
• Packets are processed on the same CPU that
received the interrupt. If RPS/RFS feature is
enabled then network stack processing is
distributed across multiple CPUs
• Packet is eventually written to socket buffer.
Application wakes up and process the payload
TCP Congestion and Receiver Advertise Window
TCP tuning requires understanding of some critical parameters
Paramters

Discussion

receiver window size (rwnd)
sender window size (swnd)
congestion window (cwnd)

cwnd controls number of packets a sender can send without needing an
acknowledgment. TCP cwnd starts with 10 segments (slow start) and increase
exponentially until it reaches receiver advertise window size (rwnd). Thus TCP
cwnd will continue to grow if rwnd and swnd are set to a large value. However,
setting rwnd and swnd too large may result in packet loss due to congestion and
this may cut the cwnd to half of rwnd or to TCP slow start value resulting in slower
throughput.
Proportional Rate Reduction (PRR) and Early Retransmit (ER) features (kernel 3.2)
help recover from packet losses quickly by retransmit early and pacing out
retransmission across received ACKs during TCP fast recovery

Bandwidthdelay product (BDP)

rwnd and swnd should be set larger than BDP. Otherwise, TCP throughput will be
limited.
BDP = Link Bandwidth * RTT = 1000 * 0.001 sec /8 = 128KB

Socket Buffer size
tcp_wmem, tcp_rmem,
rmem_max, wmem_max

Limits amount data application can send/receive to/from network stack. To improve
application throughput socket size should be set large enough to utilize the TCP
window fully
Network Stack Tunables: Higher Throughput
Tunable

Value

Discussion

tcp_slow_start_after_idle
(default: 1 means enable)

0 (disable)

prevents TCP slow start value (10 segments) to be used as a new advertise
window for connections sitting idle for 3 seconds. Better throughput due to
continue use of receiver advertise window instead of slow start.

tcp_fin_timeout
(default: 60 sec)

10 sec

This tunable limits number of connections in TCP TIME_WAIT state to avoid
running out of available ports. Recommended for site with high socket churn
rate and server application initiating connection close.
TIME_WAIT timeout = 2 * tcp_fin_timeout

tcp_early_retrans
(default=0 means disable)

1
(enable)

It allows fast retransmit to trigger after 2 duplicate (instead of 3 or more) ACKs
for the same segment is received. Allows connection to recover quickly due to
packet loss or network congestion.
http://research.google.com/pubs/pub37486.html

netdev_max_backlog
(default: 1000 packets)

5000

packets received by NIC driver are queued into per CPU input queue for
network stack processing. Packets will be dropped if input queue is full and
cause TCP retransmits
Network Stack Tunables: Higher Throughput
Tunable

Value

Discussion

txqueuelen (default: 1000)

5000

Controls amount of data that can be queued by network stack for NIC
driver processing.
For latency sensitive application, consider reducing the value (means
less buffering) so that TCP congestion avoidance kicks in early in case
of packet loss.

rmem_max
wmem_max

16777216
or
higher

Maximum receive and send socket buffer size for all protocols. Set the
same as tcp_wmem and tcp_rmem. It sets the maximum TCP receive
window size. Larger the receive buffer, more data can be sent before
requiring acknowledgement.
Caution: Larger buffer may cause memory pressure

tcp_wmem
tcp_rmem

8388608, 1258291,
16777216
or
higher

Control socket receive and send buffer size. Triplet:
Min: Minimum socket buffer size during memory pressure (default:
4096)
Default: socket buffer size
(receive buffer: 87380 | send buffer: 16384)
Max: Maximum socket buffer size (auto-tuned)

More Related Content

What's hot

Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Seung-Hoon Baek
 
OpenvSwitch Deep Dive
OpenvSwitch Deep DiveOpenvSwitch Deep Dive
OpenvSwitch Deep Diverajdeep
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, CitrixXPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, CitrixThe Linux Foundation
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitchSim Janghoon
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Novell
 
Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험
Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험
Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험Seung-Hoon Baek
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConJérôme Petazzoni
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanCeph Community
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Vietnam Open Infrastructure User Group
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadKevin Traynor
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)The Linux Foundation
 
OpenStack Networking
OpenStack NetworkingOpenStack Networking
OpenStack NetworkingIlya Shakhat
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsNational Cheng Kung University
 
MP BGP-EVPN 실전기술-1편(개념잡기)
MP BGP-EVPN 실전기술-1편(개념잡기)MP BGP-EVPN 실전기술-1편(개념잡기)
MP BGP-EVPN 실전기술-1편(개념잡기)JuHwan Lee
 
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache JamesRoom 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache JamesVietnam Open Infrastructure User Group
 

What's hot (20)

Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조
 
OpenvSwitch Deep Dive
OpenvSwitch Deep DiveOpenvSwitch Deep Dive
OpenvSwitch Deep Dive
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, CitrixXPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitch
 
Meetup 23 - 02 - OVN - The future of networking in OpenStack
Meetup 23 - 02 - OVN - The future of networking in OpenStackMeetup 23 - 02 - OVN - The future of networking in OpenStack
Meetup 23 - 02 - OVN - The future of networking in OpenStack
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)
 
Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험
Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험
Open vSwitch와 Mininet을 이용한 가상 네트워크 생성과 OpenDaylight를 사용한 네트워크 제어실험
 
macvlan and ipvlan
macvlan and ipvlanmacvlan and ipvlan
macvlan and ipvlan
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)
 
OpenStack Networking
OpenStack NetworkingOpenStack Networking
OpenStack Networking
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
 
MP BGP-EVPN 실전기술-1편(개념잡기)
MP BGP-EVPN 실전기술-1편(개념잡기)MP BGP-EVPN 실전기술-1편(개념잡기)
MP BGP-EVPN 실전기술-1편(개념잡기)
 
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache JamesRoom 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
 

Similar to Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep Dive(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep DiveAmazon Web Services
 
20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWSAmazon Web Services Korea
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...WSO2
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance AnalysisRodrigo Campos
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesAmazon Web Services
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionSearce Inc
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engineBhuvaneshwaran R
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machineheraflux
 
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...VMworld
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 

Similar to Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013 (20)

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep Dive(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep Dive
 
20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance Analysis
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 Instances
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in Production
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engine
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 

Recently uploaded (20)

Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 

Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

  • 1. CPN302 Your Linux AMI: Optimization and Performance (Intro) Thor Nolen, Ecosystem Solutions Architect November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Linux In EC2 Is About Choice • Agnostic
  • 3. Linux In EC2 Is About Choice • Agnostic • Easy to deploy, configure, and update
  • 5. Instance Type Selection • Choose but be flexible
  • 6. Instance Type Selection • Choose but be flexible • Be careful running the edge of what your instance type can handle
  • 7. Instance Type Consideration • Linux AMI choice is not just distribution and version
  • 8. Instance Type Consideration • Linux AMI choice is no longer just distribution and version • PV or HVM
  • 9. Virtualization Type PV • Operating System is “aware” of its virtual environment • Requires OS modifications
  • 10. Virtualization Type PV • Operating System is “aware” of its virtual environment • Requires OS modifications HVM • Leverages processor capabilities to deliver full virtualization • Can use an unmodified Operating System
  • 11. Virtualization Type PV • Operating System is “aware” of its virtual environment • Requires OS modifications HVM • Leverages processor capabilities to deliver full virtualization • Can use an unmodified Operating System – But PV network and storage drivers recommended
  • 13. Enhanced Networking • i2 and C3 • HVM only
  • 14. Enhanced Networking • i2 and C3 • HVM only • Requires download, compile, and install of drivers
  • 15. PV or HVM? • There are performance differences
  • 16. PV or HVM? • There are performance differences • Determine your metrics, test, and measure
  • 17. PV or HVM? • There are performance differences • Determine your metrics, test, and measure • Application / workload testing will guide which variant is best for you
  • 19. Please give us your feedback on this presentation CPN302 As a thank you, we will select prize winners daily for completed surveys!
  • 20. Your Linux AMI: Optimization and Performance Coburn Watson Manager Cloud Performance, Netflix, Inc. November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 21. Netflix, Inc. • • • • • World's leading internet television network ~ 40 Million subscribers in 40+ countries Over a billion hours streamed per month Approximately 33% of all US Internet traffic at night Recent Notables • Increased originals catalog
  • 22. AMI Performance @ Netflix
  • 23. Why Tune the AMI? • @ Netflix: 10’s of 1000’s of instances running globally – “Rising Tide Lifts All Ships” • Large variability in production workloads – – – – OLTP (majority of REST-based services) Batch/Pre-Compute (think movie recommendations…) Cassandra EVCache (memcached tier) • Cloud environments have inherent performance variability – Improve resilience to such variability • Deployment model affords ease of customization
  • 24. Baking Performance Into the Base • Aminator – Open Source AMI bakery • Broad propagation of standard performance tunings – Apache, Tomcat configurations • Focused application of workload-specific configurations – Primarily kernel and OS optimizations: CPU Scheduling, Memory Management, Network, IO
  • 25. Linux Kernel Tuning - Benefits • • • • Effectively drive key instance resource dimensions Improved efficiency at scale saves big $ Tuning process drives identification of ideal instance type Readily available advanced Linux tools (e.g. perf, systemtap) provide deep insight into the kernel and the application: – Top Down Analysis: Review of application interaction with system resources – Bottom Up Analysis: System resource usage of the application
  • 26. Kernel Tuning Trade-Offs • Kernel subsystems are inter-dependent – tuning in one area my improve efficiency at the expense of another • 80/20 Rule: 80%: Improvement gained by application refactoring and tuning 20%: OS tuning, infrastructure improvement etc.. • Tuning tailors the system for a specific workload – Other workload may perform worse Tuning objective is to align system resources to application requirements in order to improve overall system performance.
  • 28. Metrics of Interest • Performance analysis focus Resource Characteristic CPU utilization, saturation, process priority, affinity, NUMA Memory physical/virtual memory usage, swapping, page cache Network IO network stack congestion, latency, throughput Block IO block layer and device latency, throughput, file system Scalability concurrency, parallelism, shared resources, lock contention
  • 29. Basic Tools Tool Description vmstat, dstat Reports system-wide CPU utilization, saturation, memory and swap usage. Overview of kernel events like: syscall, context switch, interrupts etc. mpstat Reports per-CPU utilization. hard/soft interrupts, virtualization overhead (%steel, %guest) top, atop, htop, nmon Reports per process/thread state, scheduling priorities and CPU usage etc. atop is similar to top but keeps historical data for trend analysis. htop and nmon provides similar stats with graphical view. iostat IO latency/throughput at the driver and the block layer. Device utilization sar Keeps historical data about CPU, memory, Network, IO usage uptime Reports CPU saturation - Threads waiting for CPU
  • 30. Basic Tools, cont. Tool Description free Free memory and swap. Counts page cache memory as free /proc/meminfo Memory, swap and file system statistics. Kernel memory usage, statistics for conservative memory allocation policy, HugeTLB etc.. pidstat Per process/thread CPU usage, context switch, memory, swap, IO usage ps, pstree Per process/thread CPU and Memory usage /proc, /sys File system /proc: stats about process, threads, scheduling, kernel stacks, memory etc.. /sys: Report device specific stats: disk, NIC etc.. netstat, iptraf TCP/IP statistics , routing, errors, network connectivity, and NIC stats. iptraf shows real time tcp/ip network traffic nicstat, ping, ifconfig NIC stats, network connectivity, netmask, subnet etc..
  • 31. Advanced Tools Tool Description blktrace Profile the Linux Block layer and reports events like: merge, plug, split, remapped etc. Reports PID, block number, IO size, timestamp etc.. slabtop Kernel memory usage and statistics for various kernel caches in use by kernel pmap Dumps all memory segments in the process address space: heap, stack, mmap pstack, jstack Dumps application user level stack trace. jstack contains java methods, tid, pid, threads states iotop per process/thread IO statistics. Reports application time spend blocking on IO /proc/net/softnet_stats per CPU backlog queue throttling (netdev_max_backlog) stats. /proc/interrutps /proc/softirqs Tells which CPU is processing device interrupts. softirqs provides information about softirq processing for network stack and Linux block layer tcpdump, wireshark Network sniffer. Capture network traffic (libpcap format) for post analysis. Wireshark can be used to analyze tcpdump and ethereal traces ethtool NIC low level statistics: NIC speed, full duplex, transmit descriptors , ring buffer
  • 32. Advanced Tools, cont. Tool Description perf Application and kernel profiling and tracing tool. Reports top kernel and application CPU bound routines, stack traces. Capture hardware events (cpu cache, TLB misses etc.), software (kernel, application) static and dynamic events to perform low level profiling and tracing. systemtap Application and kernel profiling and tracing tool. Allow inserting trace points in the kernel and application dynamically to capture low level profiling data for performance analysis and debugging. Scripting language similar to C and Perl. latencytop Kernel blocking events due to lock, IO, condition variable . Dumps kernel stack strace Report information about system calls generated by the application: Type of system call, arguments, return value, errno, and elapsed time. numastat numa related latency stats on the HVM platform
  • 34. Use Case: CFS scheduler tuning • Goal: – Improve batch and compute-intensive processing: • Increase time slice and/or process priority in order to reduce context switches • Longer the time process runs on the CPU, better the use of CPU caches • Tunables: – Change scheduling policy of workload: # chrt –a –b –p 0 <PID> OR – Set CFS tunables to improve time slice at a system-wide level • sched_latency_ns: 6ms * (1 + log2(ncpus)) Ex: 4 CPU cores = 18ms. Set it higher • sched_min_granularity_ns: 0.75 * (1 + log2(ncpus)) Ex: 4 CPU cores = 2.25ms. Set it higher
  • 35. Use Case: CFS scheduler tuning
  • 36. Use Case: Page Cache Tuning • Goal: – Increase application write throughput – Reduce IO flooding by writing consistently rather than in bulk • Tunables: – – – – dirty_ratio= 60 dirty_background_ratio= 5 dirty_expire_centisecs= 30000 swappiness=0 • Page cache hit/miss ratio: – systemtap (ioblock_request, vfs_read probes). – fincore command can be used to find what pages of a file are in page cache
  • 37. Use Case: Linux Block Layer Tuning • Goal: – Queue more data to SSD device to achieve higher throughput – Better sequential read IO throughput by fetching more data – Distribute IO processing across multiple CPUs • Tunables: /sys/block/<dev>/queue/nr_requests=256 /sys/block/<dev>/read_ahead=256 /sys/block/<dev>/queue/scheduler=noop /sys/block/<dev>/queue/rq_affinity=2
  • 38. Use Case: Memory Allocation Tuning • Goal: – Avoid running out of memory while running a production load – Do not allow memory over-commit that may result in OOM • Tunable – overcommit_memory=2 – overcommit_ratio=80
  • 39. Use Case: Network Stack Tuning • Goal: – Increase Network Stack Throughput – Larger TCP receive and Congestion window – Scale network stack processing across multiple CPUs • Tunable tcp_slow_start_after_idle=0 rmem_max,wmem_max = 16777216 or higher tcp_fin_timeout=10 tcp_wmem, tcp_rmem 8388608,1258291,16777216 or higher tcp_early_retrans=1 rps_sock_flow_entries=32768 netdev_max_backlog=5000 /sys/class/net/eth?/queues/rx0/rps_flow_cnt=32768 txqueuelen=5000 /sys/class/net/eth?/queues/rx-0/rps_cpus=0xf
  • 41. Future tuning activity • M3 class instances supports both HVM and PV. Easy validation of performance gain with HVM versus PV • Study Cassandra workload on SSD-based systems • Tune Linux Block Layer and compare performance of different IO schedulers: noop, CFQ, deadline • Test file system: XFS, EXT4, BTRFS performance on various workload running on SSD instances. • Test network performance with new TCP/IP and Network Stack features: TCP early retransmit, TCP Proportional Rate Reduction, and RFS/RPS features • Capture low level performance metrics using perf, systemtap, and JVM profiling tools
  • 42. Please give us your feedback on this presentation CPN302 As a thank you, we will select prize winners daily for completed surveys!
  • 44. Profiling and Tracing Benefits • Fine grain measurements and low level statistics to help with difficult to solve performance issues • Isolate hot spots, resource usage and contention in application and kernel • Gain comprehensive insight into application and kernel behavior
  • 45. SystemTap and Perf Benefits • Inserts trace points into the running application and kernel without adding any debug code • Lower overhead, processing done in the kernel space. No stopping/starting the application • Help build custom tools to fill out observability gaps • Analyze throughput and latency across application and all kernel subsystems • Unified view of user (application) and kernel events
  • 46. SystemTap and Perf Benefits (cont.) SystemTap and Perf can track all sorts of events at system-wide, process and thread levels: • • • • • • • • Time spent in system call and kernel functions; arguments passed, return values, errno. Dump application and kernel stack trace at any point in the execution path Time spend in various process states: blocking for IO, lock, resource and waiting for CPU Top CPU bound user and kernel functions Low level TCP stats. Not possible with standard tools. Low IO and Network activities. Page cache hit/miss rates. Monitor page faults, memory allocation, memory leaks Aggregate results when large amount of data needs to be collected and analyzed
  • 47. Perf and SystemTap packages • Perf: – apt-get update – apt-get install linux-tools-common – apt-get install linux-base – apt-get install linux-tools-$(uname –r) • SystemTap: Install kernel debug packages and kernel header exactly matching your kernel version – kernel debug packages: http://ddebs.ubuntu.com/pool/main/l/linux/ – apt-get install kernel-headers-$(uname –r) – apt-get systemtap
  • 48. SystemTap and Perf Events Perf and SystemTap capture events generated from various sources: • Hardware Events (perf only): If running on bare-metal system, perf can access hardware events generated by PMU (performance monitoring Unit) Examples: CPU cache/TLB loads, references and misses, IPC (cpu stall cycles), Branch etc.. • Software Events: Events like: page-faults, cpu-clock, context switches etc.. • Static Trace Events: These are trace points coded into entry and exit of kernel functions: Examples; syscalls, net, sched, irq, etc.. • Dynamic Trace Events: These are dynamic trace points that can be inserted on-the-fly (hot patching) into application and kernel functions via break point engine (kprobe). No kernel and application debug compilation, pauses etc..
  • 50. perf top: Top User and Kernel Routines
  • 54. perf stat: Net Events
  • 55. perf probe: Add a New Event
  • 56. perf record - Record Events
  • 57. perf report – Process Recorded Events
  • 58. perf record – Record Specific Events
  • 59. perf report – Dump Full Stack Traces
  • 61. SystemTap • SystemTap supports scripted language similar to C and Perl and follows an event-action model: – Event: Trace or Probe point of interest • Example: system calls, kernel functions, profiling events etc.. – Action: What to do when event of interest occur • Example: Print app-name, PID whenever write() syscall is invoked • Idea behind a SystemTap is to name an event (probe) and provide a handler to perform action in the event context – probe point is like a break point but instead of stopping kernel/application at the break point, SystemTap causes a branch (jump) to probe handler routine to perform the action. • Script can have multiple probes and associated handlers. Data is accumulated in buffer and then dump it out into standard out.
  • 62. SystemTap – Runs as a Kernel Module • When systemtap script is executed, it is converted into .c file and compiled as a linux kernel module (.ko) • Module is loaded into the kernel and probes are inserted by hot patching the running kernel and application • Module is unloaded when <cntl><c> is pressed or exit() probe is invoked from the module. • Systemtap script use file extension (.stp) and contains probe and handler written in the format: probe event { statements} • When run as a script, first line should have interpreter: #!/usr/bin/env stap • Or run from the command line: # stap –e script.stp
  • 63. SystemTap: Events SystemTap trace points can be placed at various locations in kernel: – syscall: system call entry and return • Example: syscall.read, syscall.read.return – vfs: VFS functions entry and return – kernel.function: Kernel function entry and return • Example: kernel.function(“do_fork”), kernel.function(“do_fork”).return – module.function: Kernel module entry and return • Other events: – begin: event fires at the start of script – end: event fires when script exit – timer: event fires periodically.
  • 64. SystemTap: Functions Commonly used functions: • • • • • • • • • • • • tid():The ID of the current thread. uid(): The ID of the current user. cpu(): The current CPU number. gettimeofday_s(): The number of seconds since UNIX epoch (January 1, 1970) probefunc(): Probe function pid(): PID execname: Executable name thread_ident(): Provide indentation to nicely format printing of function call entry and return target(): specify the pid on the command line print_backtrace(): Print the complete stack trace print_regs(): print CPU registers kernel_string(). Useful to print char type in data structures
  • 66. CFS Scheduler Tuning • CFS scheduler: – Provides fair share of CPU resources to all running tasks – Tasks are assigned weights (priority) to control the time a task can run on the CPU. • Involuntary context switch: A task has consumed its time slot or is preempted by higher priority task • Task voluntary relinquishes the CPU when it blocks on a resource: IO (disk, net), locks.. • CFS supports various scheduling policies: FIFO, BATCH, IDLE, OTHER (default), RR
  • 67. CFS Tunable – Compute Intensive Workload • Performance goal of Batch workload is to complete the given task in the shortest time possible. SCHED_BATCH policy is more appropriate for batch processing workloads • Task running with SCHED_BATCH policy gets bigger time-slice and thus does not get involuntary context switched as frequently and that allows computed tasks to run longer and gets better use of CPU caches.
  • 68. CFS Tunable – Compute Intensive Workload CFS tunables can also be set to reduce context switching activity: • sched_latency_ns: period in which each runnable task should run once. Larger value offers bigger CPU slice, that may improve compute performance. Interactive application performance may suffer Default: 6ms * (1 + log2(ncpus)). Example: 4 CPU cores = 18ms (default). Change it to 36 ms • sched_min_granularity_ns: Threshold on minimum amount of CPU cycles each task should get. Larger value helps compute workload. Default: 0.75 * (1 + log2(ncpus)). Example: 4 CPU cores: 2.25ms (default). Change it to 5ms Internal Testing at Netflix shows 2-5% performance improvement of compute intensive tasks when running the workload with SCHED_BATCH policy as compared to SCHED_OTHER.
  • 69. Avoid OOM Killer To overcome memory and swap shortages the Linux kernel may kill random processes to free memory. This mechanism is called Out-Of-Memory Killer. Tunable Discussion Heuristic overcommit overcommit_memory=0 (default) Allows to overcommit some reasonable amount of memory as determined by free memory, swap and other heuristics. No reservation of memory and swap. Thus memory and swap may run out before application uses all of its memory. This may result in application failure due to OOM. Always overcommit overcommit_memory=1 Allow wild overcommit. Any size of memory allocation (malloc) will be successful. As in the case of Heuristic, memory and swap may run out and trigger OOM killer. Strict overcommit overcommit_memory=2 overcommit_ratio=80 Prevents overcommit. It does not count free memory or swap when making decisions about commit limit. When application calls malloc(1GB), kernel reserves or deducts 1G from free memory and swap. This guarantees that memory committed to application will be available if needed. This prevents OOM due to no overcommit allowed.
  • 70. Avoid OOM Killer (continue..) • When strict overcommit is enforced, total memory that can be allocated system-wide is restricted to: overcommit Limit = Physical Memory x overcommit_ratio + swap where: overcommit_ratio=50% (default). Tune overcommit_ratio = 80% • • • New program may fail to allocate memory even when the system is reporting plenty of free memory and swap. This is due to memory and swap reserved on behalf of the process. This feature does not effect memory use by file system page cache. Page cache memory is always counted as free. Use “/proc/meminfo” statistics to monitor memory already been committed. CommitLimit : Total amount of memory that can be allocated system-wide Committed_AS: Memory already been committed on behalf of application MemoryAvailable: CommitLimit - Committed_AS Any attempt to allocate memory over “MemoryAvailable” will fail when strict overcommit is used.
  • 71. Tuning for Higher Throughput Tunable Discussion dirty_ratio Throttle writes when dirty pages in the file system cache reaches to 40%. For write intensive workload increase it to 60-80% dirty_background_ratio Wakes up pdflush when dirty pages reach 10% of total memory. Reducing the value (5%) allows pdflush to wake up early and that may keep dirty pages growth in check dirty_expire_centisecs Data can stay dirty in the page cache for 30 secs. Increase it to 60-300 seconds on large memory systems to prevent heavy IO to the storage due to short deadline. Drawback of tuning is that unexpected outage may result in loss of data not committed. swappiness Controls Linux periodic swapping activities. Large value favors growing page cache by steeling application in-active pages. Setting value to zero disables periodic swapping. Large value may improve application write throughput. Value of zero is recommended for latency sensitive application
  • 72. Linux Block Layer – IO Tuning • sysfs (/sys) is used to set device specific attributes (tunables): /sys/block/<dev>/queue/.. • • • nr_requests: Limits number of IO requests queued per device to 128. To improve IO throughput consider doubling this value for RAID (multiple disks) devices or SSD. scheduler: VM instances use Xen virtualization layer and thus have no knowledge of underlying geometry of disks. noop IO scheduler is recommended considering it is FIFO and has least overhead. read_ahead: Improves sequential IO performance. Larger value mean fetch more data into page cache to improve application IO throughput. noop IO scheduler nr_requests
  • 73. Block Layer: IO Affinity • • • Linux IO Affinity feature distributes IO processing work across multiple CPUs When the application blocks on IO, the kernel records the CPU and dispatches IO. When the IO is marked completed by the storage driver, the block layer performs IO processing on the same CPU that has originally issued the IO. This feature is very helpful when dealing with high IOPS rates such as SSD systems given the IO completion processing will be distributed across multiple CPUs. Tunable Discussion rq_affinity = 1 (default) Block layer will migrate IO completion to the CPU group that originally submitted the request rq_affinity = 2 Forces the IO completion on the CPU that originally issued the IO. Thus bypass the “group” logic. This option maximizes distribution of the IO completion
  • 74. RPS/RFS - Network Performance and Scalability • • • • • RPS (Receive Packet Steering) and RFS (Receive Flow Steering) can help system to scale better by distributing network stack processing across multiple CPUs Without this feature network stack processing is restricted to the same CPU that serviced the NIC interrupts, and that may induce latencies and lower the network throughput NIC driver calls netif_rx() to enqueue the packet for processing. RPS function get_rps_cpu() selects the appropriate queue that should process the request and thus distributes the work across multiple CPUs. RPS make decision by hash lookup that uses CPU bitmask to decide which CPU should process the packet RFS steers the processing to the CPU where the application thread, that eventually consumes the data, is running. It uses the hash as an index into the network flow lookup table that maps the flow to the CPUs. This improves CPU cache locality.
  • 75. RPS/RFS - Network Performance and Scalability (continue..) Tunable Discussion core. rps_sock_flow_entries=32768 global flow table containing the desired CPU to flow. Each table value is a CPU index that is updated during socket calls. /sys/class/net/eth?/queues/rx-0 rps_flow_cnt=32768 Number of entries in the per-queue flow table. Value of flow is determined by number of active connections. Setting 32768 is a good start for moderately loaded server. For a single queue device (as in the case of AWS instances), the value of two tunables should be the same. core.rps_sock_flow_entries should be set in order for it to work. /sys/class/net/eth?/queues/rx-0 rps_cpus=0xf It is set as a bitmask of CPUs. Disable when set to zero (means packets are processed on the interrupted CPU). Set to all CPU or CPUs that are part of the same NUMA node (large server). Setting value 0xf will cause CPU 0,1,2,3 to do network stack processing
  • 76. Network Stack Tuning Packet Transmit Path: • Network stack converts application payload written in socket buffer into TCP segments (or UDP datagrams), calculates the best route and then writes the packet into NIC driver queue. • QOS is provided by inserting various queue disciplines (FIFO, RED, CBQ..). Queue size is set to txqueuelen • NIC driver process packets one-by-one by writing (DMA) to NIC transmit descriptors. In case of Xen, packet is written into Xen shared IO ring (Xen split device driver model)
  • 77. Network Stack Tuning Packet Receive Path: • Device writes (DMA) packet into kernel memory and raises interrupt. • In case of Xen, packet is written into IO shared ring and notification is sent via event channel • NIC driver interrupt handler copies the packet into input queue (per-cpu queue). Queue is maintained by network stack and its size is set to netdev_max_backlog. • Packets are processed on the same CPU that received the interrupt. If RPS/RFS feature is enabled then network stack processing is distributed across multiple CPUs • Packet is eventually written to socket buffer. Application wakes up and process the payload
  • 78. TCP Congestion and Receiver Advertise Window TCP tuning requires understanding of some critical parameters Paramters Discussion receiver window size (rwnd) sender window size (swnd) congestion window (cwnd) cwnd controls number of packets a sender can send without needing an acknowledgment. TCP cwnd starts with 10 segments (slow start) and increase exponentially until it reaches receiver advertise window size (rwnd). Thus TCP cwnd will continue to grow if rwnd and swnd are set to a large value. However, setting rwnd and swnd too large may result in packet loss due to congestion and this may cut the cwnd to half of rwnd or to TCP slow start value resulting in slower throughput. Proportional Rate Reduction (PRR) and Early Retransmit (ER) features (kernel 3.2) help recover from packet losses quickly by retransmit early and pacing out retransmission across received ACKs during TCP fast recovery Bandwidthdelay product (BDP) rwnd and swnd should be set larger than BDP. Otherwise, TCP throughput will be limited. BDP = Link Bandwidth * RTT = 1000 * 0.001 sec /8 = 128KB Socket Buffer size tcp_wmem, tcp_rmem, rmem_max, wmem_max Limits amount data application can send/receive to/from network stack. To improve application throughput socket size should be set large enough to utilize the TCP window fully
  • 79. Network Stack Tunables: Higher Throughput Tunable Value Discussion tcp_slow_start_after_idle (default: 1 means enable) 0 (disable) prevents TCP slow start value (10 segments) to be used as a new advertise window for connections sitting idle for 3 seconds. Better throughput due to continue use of receiver advertise window instead of slow start. tcp_fin_timeout (default: 60 sec) 10 sec This tunable limits number of connections in TCP TIME_WAIT state to avoid running out of available ports. Recommended for site with high socket churn rate and server application initiating connection close. TIME_WAIT timeout = 2 * tcp_fin_timeout tcp_early_retrans (default=0 means disable) 1 (enable) It allows fast retransmit to trigger after 2 duplicate (instead of 3 or more) ACKs for the same segment is received. Allows connection to recover quickly due to packet loss or network congestion. http://research.google.com/pubs/pub37486.html netdev_max_backlog (default: 1000 packets) 5000 packets received by NIC driver are queued into per CPU input queue for network stack processing. Packets will be dropped if input queue is full and cause TCP retransmits
  • 80. Network Stack Tunables: Higher Throughput Tunable Value Discussion txqueuelen (default: 1000) 5000 Controls amount of data that can be queued by network stack for NIC driver processing. For latency sensitive application, consider reducing the value (means less buffering) so that TCP congestion avoidance kicks in early in case of packet loss. rmem_max wmem_max 16777216 or higher Maximum receive and send socket buffer size for all protocols. Set the same as tcp_wmem and tcp_rmem. It sets the maximum TCP receive window size. Larger the receive buffer, more data can be sent before requiring acknowledgement. Caution: Larger buffer may cause memory pressure tcp_wmem tcp_rmem 8388608, 1258291, 16777216 or higher Control socket receive and send buffer size. Triplet: Min: Minimum socket buffer size during memory pressure (default: 4096) Default: socket buffer size (receive buffer: 87380 | send buffer: 16384) Max: Maximum socket buffer size (auto-tuned)