SlideShare a Scribd company logo
1 of 63
Servers and
 Processes
 Behavior and Analysis
The Next 90 Minutes

Introduction

Servers, a mental model

Getting hands on

Processes

Wrapping it up
Caveats
Tutorial aimed at people barely familiar
with Linux consoles

Little server knowledge is assumed

Many advanced things are glossed over

...but feel free to ask!

The slides will be available online
Your Presenter

Mark Smith <mark@dreamwidth.org>

Co-founded Dreamwidth Studios, but
works at Bump Technologies
(http://bu.mp/)

Spent time at Google, Mozilla, others

Sysadmin, MySQL DBA, engineer, ...
Servers
Servers
Machines that take input and make output

Made up of components: RAM, CPU, I/O

Each component has various capacities

Systems Administration: the
understanding, care, and feeding of all
these disparate components (among other
things)
Components

Capacity

Latency

Throughput

Full state

Thrash state
RAM
Capacity measured in bytes (GB usually)

Latency measured in nanoseconds

Throughput measured in bytes/second

Full state: can’t add more, but no real loss
of performance

Thrash state: not very relevant
Disk (Rotational)
Capacity measured in bytes (GB or TB)

Latency measured in milliseconds

Throughput measured in bytes/second

Full state: can’t add more, but otherwise
fine

Thrash state: server and process
starvation, performance drops drastically
Disk (SSD)
Capacity measured in bytes (GB or TB)

Latency measured in milliseconds (but
100x faster than rotational disks)

Throughput measured in bytes/second

Full state: can’t add more, but otherwise
fine

Thrash state: obviated by lack of rotation
CPU
Capacity measured in operations per
second, also known as hertz (MHz, GHz,
etc)

Throughput and latency of a CPU are very
advanced things most sysadmins don’t
need to worry about (e.g., optimizing for L1
cache and local RAM in NUMA systems)

Full/thrash state: system/process
starvation
Network
Capacity not relevant

Latency measured in milliseconds (usually)

Throughput measured in bits/second and
usually 1 Gbps (10 Gbps becoming
common)

Full state: dropped packets, behavior
depends on protocol (i.e., TCP or UDP)

Thrash state: not relevant
Timing Comparisons

1 second - tick, tock, tick, tock, ...

1,000 milliseconds (ms) per second

1,000,000 microseconds (µs) per second

1,000,000,000 nanoseconds (ns) per
second
Timing (Part 2)

One seek on a rotational disk is ~6ms

SSD seeks are about 100µs: 60x faster
than a rotational seek

RAM seeks are about 60ns: 1,666x faster
than an SSD seek (100,000x faster than a
rotational seek!)
Hands On Time!
SSH to the VM

Open your local terminal (PuTTY in
Windows, iTerm/Terminal/etc in Mac OS
X, whatever you like in Linux)

ssh -p 2222 demo@182.255.123.52

Password is “demo”

Please be nice :)
It’s dark in here.

Heartbeat the machine

uptime      How’s it doing?

free -m     How’s the RAM?

df -h       How’re the disks?
Load Average

It’s a seat-of-the-pants number

Rule of thumb: low is good, high might be
bad

You have to learn how your machines
work for this number to mean much
Top of the World

Easy way to see what’s running and what
is consuming the most resources

top

Press “P” to sort by Processor usage

Press “M” to sort by Memory usage
Exhibit #1

Now I will do something on the machine

Run through your heartbeat steps again:
uptime, free -m, df -h, top

Remember to sort top by P and M

What has changed? What is going on?
Results #1

You probably noticed 1-cpu.pl

It’s pushing the CPU to 100%

Is it broken? Is this bad?

Know your software and systems (very
important to know what normal is)
Exhibit #2

Now I will do something else

Run through your heartbeat steps again:
uptime, free -m, df -h, top

Remember to sort top by P and M

What has changed? What is going on?
Results #2

Lots of memory is being consumed

It’s some 2-memory.pl command

Does the machine feel sluggish? Each
command takes a second to start and
stop?

What is going on here?
vmstat
The vmstat tool tells us useful things
about the state of the kernel and resource
usage

Try: vmstat -SM 1

Watch while I run the test again

Note the si/so and bi/bo columns

Now notice the CPU columns on the right
Swap
RAM is a finite resource

Not all RAM is used equally

Kernel tracks usage of pages

Kernel can write RAM to disk and free it up

This is called swapping: you store RAM on
disk. Remember the timing slide!
Swap (Part 2)
Swap is useful mostly on consumer
machines

In most server environments, swap is
death

Disks are hundreds to thousands of times
(or more!) slower than RAM

Generally, any active swapping is bad
Exhibit #3

Try uptime, free -m, df -h, top again

Also, try: iostat -kx 1

Watch the %util column as this test runs

Also the bi/bo columns in vmstat

What is going on here?
Results #3

Disk usage is high

RAM is not full

CPU is not pegged

Machine responds well

Disk utilization at 100%
What does it mean?

Based on the various data you’ve
gathered, is the machine healthy and
happy with this program running on it?

Why or why not?

Discussion.
Solutions?
This program is using more RAM or CPU
than the machine has available

Program can be optimized to use less

Machine can be upgraded to have more

Simple problem, straightforward solutions

(Straightforward does not always mean
easy)
Programs
Programs

Software that runs on a machine

Has traits such as single- or multi-
threaded, compiled or interpreted, etc

Requires certain resources and inputs

Makes certain outputs
More Constraints

Programs have more constraints to
consider

Open files and sockets (file descriptors)

Permissions (depend on user/group)

CPU limits (depends on threads)
Exhibit #4

There’s a program running now, but
something is wrong with it

Use the usual tools (uptime, free -m,
df -h, top)

System looks OK...
File Limits

Programs have certain limits

Get the PID of the 4-files.pl program

ps aufx | grep 4-files

cat /proc/PID/limits
lsof

See what files a program has open

lsof -np PID

Woah, lots! At the limit? Count them:

lsof -np PID | wc -l
But... a problem?

But is this a problem? Well, it is if the
program is trying to open more files

How do we tell?

Software calls open, which is a system
call
System Calls

The kernel provides certain services

Almost all I/O goes through the kernel

Current time, fork, cd, exec, etc etc

Requires a small context switch

Can lead to “sys” CPU usage
strace

System calls made by a process can be
traced

Let’s look at 4-files again:

sudo strace -p PID

Look at the “open” line, is it OK?
Results #4

Clearly this program is broken

Several fixes... open fewer files, raise your
limits, etc

(We won’t cover the specifics of raising
limits, you can search Google if you need
it)
It’s all turtles.

Linux uses “files” and “filesystems” a lot

Sockets are just “files”, they use the same
file descriptor number space

Result: “Max open files” includes sockets

They also show up in lsof, too!
Exhibit #5

Let me give us a new program

Get the PID, remember how?

ps aufx | grep 5-network

Look at the files: lsof -np PID

Note the “TCP” file!
Test the Server

telnet 182.255.123.52 7000

(This server is slow, it might take a bit)

A very simple timeserver

Now: strace -p PID
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
The Trace
accept(3, {sa_family=AF_INET, sin_port=htons(39474),
           sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...
lseek(4, 0, SEEK_CUR)                    = -1 ESPIPE (Illegal seek)
fcntl(4, F_SETFD, FD_CLOEXEC)            = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "The time is: Wed Feb   6 13:34:22"..., 38) = 38
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fff73f28880)        = 0
write(4, "Thank you for visiting!n", 24) = 24
close(4)                                 = 0
Results #5

Tracing shows you data, too

Can be very valuable for finding moving
parts that aren’t moving well

Combined with the other tools you can
really see what is going on in your system
Kernel
Invisible Glue

Kernel issues are fairly rare, but usually
frustrating if they show up

Usually the result of some sort of limit hit

Tons of caches, buckets, and limits

Be suspicious of “powers of two” numbers
Common Checks


Try: sudo dmesg

Kernel message log shows many problems

Look for suspicious messages
“Suspicious”
Out of memory: Kill process
19393 (2-memory.pl) score 90 or
sacrifice child

nf_conntrack: Table full,
dropping packet

ata7.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x6
frozen
More Places to Look

The /var/log directory has much data

Generally in a problem state, look for
recently updated files: ls -lart

Loud logs are often unhappy logs

Hardware failure is often noted in one of
the log files
Summary
Process

Check the components: CPU, RAM, disks

Find what limits are being hit and by what

If the system is fine, it’s probably software

Trace the program, check the logs

Analyze well before you fix
Familiarity!

Systems administration done only as an
afterthought will be painful and hard

Be familiar with your servers and your
software

Keep a shell open, watch top throughout
the day, watch the disks, etc
Next Steps
Certain tools make life easier

Nagios for monitoring (e.g., alert you when
CPU exceeds 90%)

Cacti/Ganglia/OpenTSDB for trending

Fabric for multiple machine operations

Puppet/Chef for configuration
management
Thanks!
 Questions?

More Related Content

What's hot

EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 
It802 bruning
It802 bruningIt802 bruning
It802 bruning
mrbruning
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
Kirill Kolyshkin
 

What's hot (20)

The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
 
Introduction to Perf
Introduction to PerfIntroduction to Perf
Introduction to Perf
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
It802 bruning
It802 bruningIt802 bruning
It802 bruning
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysis
 
Solaris DTrace, An Introduction
Solaris DTrace, An IntroductionSolaris DTrace, An Introduction
Solaris DTrace, An Introduction
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
 
Linux monitoring
Linux monitoringLinux monitoring
Linux monitoring
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 

Viewers also liked

Oiseaux Suz
Oiseaux SuzOiseaux Suz
Oiseaux Suz
cab3032
 
Art Of Don Dixon
Art Of Don DixonArt Of Don Dixon
Art Of Don Dixon
DonDixon
 
prograteleCAV2012_ Las series de animación para adultos
prograteleCAV2012_ Las series de animación para adultosprograteleCAV2012_ Las series de animación para adultos
prograteleCAV2012_ Las series de animación para adultos
Maria Carrera Blanco
 
Speaking Experience
Speaking ExperienceSpeaking Experience
Speaking Experience
hsjones
 
Radfahren in freiburg stefan bouzos_chouras_b2
Radfahren in freiburg stefan bouzos_chouras_b2Radfahren in freiburg stefan bouzos_chouras_b2
Radfahren in freiburg stefan bouzos_chouras_b2
Maria Chatzigiossi
 
trabajo de biologia cotrregido porsia
trabajo de biologia cotrregido porsia trabajo de biologia cotrregido porsia
trabajo de biologia cotrregido porsia
Rulox Avila Tobar
 

Viewers also liked (17)

JPJ (Jerih Payah Jariku)
JPJ (Jerih Payah Jariku)JPJ (Jerih Payah Jariku)
JPJ (Jerih Payah Jariku)
 
Portfólio - Apresentação Jetline
Portfólio - Apresentação JetlinePortfólio - Apresentação Jetline
Portfólio - Apresentação Jetline
 
Oiseaux Suz
Oiseaux SuzOiseaux Suz
Oiseaux Suz
 
Art Of Don Dixon
Art Of Don DixonArt Of Don Dixon
Art Of Don Dixon
 
Moteris - kodel verta ja myleti?
Moteris - kodel verta ja myleti?Moteris - kodel verta ja myleti?
Moteris - kodel verta ja myleti?
 
prograteleCAV2012_ Las series de animación para adultos
prograteleCAV2012_ Las series de animación para adultosprograteleCAV2012_ Las series de animación para adultos
prograteleCAV2012_ Las series de animación para adultos
 
Speaking Experience
Speaking ExperienceSpeaking Experience
Speaking Experience
 
Laboratori dalbasso avviso
Laboratori dalbasso avvisoLaboratori dalbasso avviso
Laboratori dalbasso avviso
 
Restore presentation v3
Restore presentation v3Restore presentation v3
Restore presentation v3
 
Us Fsi Emr 060611
Us Fsi Emr 060611Us Fsi Emr 060611
Us Fsi Emr 060611
 
Radfahren in freiburg stefan bouzos_chouras_b2
Radfahren in freiburg stefan bouzos_chouras_b2Radfahren in freiburg stefan bouzos_chouras_b2
Radfahren in freiburg stefan bouzos_chouras_b2
 
How to Upgrade Your Google AdWords Accounts to Enhanced Campaigns
How to Upgrade Your Google AdWords Accounts to Enhanced CampaignsHow to Upgrade Your Google AdWords Accounts to Enhanced Campaigns
How to Upgrade Your Google AdWords Accounts to Enhanced Campaigns
 
Flowchar5
Flowchar5Flowchar5
Flowchar5
 
Muskan Films Ppt
Muskan Films PptMuskan Films Ppt
Muskan Films Ppt
 
La web en la educación
La web en la educaciónLa web en la educación
La web en la educación
 
trabajo de biologia cotrregido porsia
trabajo de biologia cotrregido porsia trabajo de biologia cotrregido porsia
trabajo de biologia cotrregido porsia
 
The Antipodean Agenda
The Antipodean AgendaThe Antipodean Agenda
The Antipodean Agenda
 

Similar to Servers and Processes: Behavior and Analysis

How to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux ServersHow to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux Servers
Richard Cunningham
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Jagadisha Maiya
 
5.6 Basic computer structure microprocessors
5.6 Basic computer structure   microprocessors5.6 Basic computer structure   microprocessors
5.6 Basic computer structure microprocessors
lpapadop
 
Input and Output Devices and Systems
Input and Output Devices and SystemsInput and Output Devices and Systems
Input and Output Devices and Systems
Najma Alam
 

Similar to Servers and Processes: Behavior and Analysis (20)

How to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux ServersHow to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux Servers
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
OS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchOS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switch
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Computer System.ppt
Computer System.pptComputer System.ppt
Computer System.ppt
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
 
5.6 Basic computer structure microprocessors
5.6 Basic computer structure   microprocessors5.6 Basic computer structure   microprocessors
5.6 Basic computer structure microprocessors
 
Data race
Data raceData race
Data race
 
Information processing cycle
Information processing cycleInformation processing cycle
Information processing cycle
 
Input and Output Devices and Systems
Input and Output Devices and SystemsInput and Output Devices and Systems
Input and Output Devices and Systems
 
Interview questions
Interview questionsInterview questions
Interview questions
 
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 

More from dreamwidth

How We Learned To Stop Worrying And Love (or at least live with) GitHub
How We Learned To Stop Worrying And Love (or at least live with) GitHubHow We Learned To Stop Worrying And Love (or at least live with) GitHub
How We Learned To Stop Worrying And Love (or at least live with) GitHub
dreamwidth
 
Hacking In-Group Bias for Fun and Profit
Hacking In-Group Bias for Fun and ProfitHacking In-Group Bias for Fun and Profit
Hacking In-Group Bias for Fun and Profit
dreamwidth
 

More from dreamwidth (16)

From the Inside Out: How Self-Talk Affects Your Community
From the Inside Out: How Self-Talk Affects Your CommunityFrom the Inside Out: How Self-Talk Affects Your Community
From the Inside Out: How Self-Talk Affects Your Community
 
Chenoweth os bridge 2015 pp
Chenoweth os bridge 2015 ppChenoweth os bridge 2015 pp
Chenoweth os bridge 2015 pp
 
How We Learned To Stop Worrying And Love (or at least live with) GitHub
How We Learned To Stop Worrying And Love (or at least live with) GitHubHow We Learned To Stop Worrying And Love (or at least live with) GitHub
How We Learned To Stop Worrying And Love (or at least live with) GitHub
 
When your code is nearly old enough to vote
When your code is nearly old enough to voteWhen your code is nearly old enough to vote
When your code is nearly old enough to vote
 
Hacking In-Group Bias for Fun and Profit
Hacking In-Group Bias for Fun and ProfitHacking In-Group Bias for Fun and Profit
Hacking In-Group Bias for Fun and Profit
 
Slytherin 101: How to Win Friends and Influence People
Slytherin 101: How to Win Friends and Influence PeopleSlytherin 101: How to Win Friends and Influence People
Slytherin 101: How to Win Friends and Influence People
 
Keeping your culture afloat through a tidal wave
Keeping your culture afloat through a tidal waveKeeping your culture afloat through a tidal wave
Keeping your culture afloat through a tidal wave
 
LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Go
 
User Created Content: Maintain accessibility in content you don't control
User Created Content: Maintain accessibility in content you don't controlUser Created Content: Maintain accessibility in content you don't control
User Created Content: Maintain accessibility in content you don't control
 
Kicking impostor syndrome in the head
Kicking impostor syndrome in the headKicking impostor syndrome in the head
Kicking impostor syndrome in the head
 
Care and Feeding of Volunteers
Care and Feeding of VolunteersCare and Feeding of Volunteers
Care and Feeding of Volunteers
 
Sowing the Seeds of Diversity
Sowing the Seeds of DiversitySowing the Seeds of Diversity
Sowing the Seeds of Diversity
 
Be Kind To Your Wrists (you’ll miss them when they’re gone)
Be Kind To Your Wrists (you’ll miss them when they’re gone)Be Kind To Your Wrists (you’ll miss them when they’re gone)
Be Kind To Your Wrists (you’ll miss them when they’re gone)
 
Web Accessibility for the 21st Century
Web Accessibility for the 21st CenturyWeb Accessibility for the 21st Century
Web Accessibility for the 21st Century
 
Overcoming Impostor Syndrome
Overcoming Impostor SyndromeOvercoming Impostor Syndrome
Overcoming Impostor Syndrome
 
Build Your Own Contributors, One Part At A Time
Build Your Own Contributors, One Part At A TimeBuild Your Own Contributors, One Part At A Time
Build Your Own Contributors, One Part At A Time
 

Recently uploaded

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Servers and Processes: Behavior and Analysis

  • 1. Servers and Processes Behavior and Analysis
  • 2. The Next 90 Minutes Introduction Servers, a mental model Getting hands on Processes Wrapping it up
  • 3. Caveats Tutorial aimed at people barely familiar with Linux consoles Little server knowledge is assumed Many advanced things are glossed over ...but feel free to ask! The slides will be available online
  • 4. Your Presenter Mark Smith <mark@dreamwidth.org> Co-founded Dreamwidth Studios, but works at Bump Technologies (http://bu.mp/) Spent time at Google, Mozilla, others Sysadmin, MySQL DBA, engineer, ...
  • 6. Servers Machines that take input and make output Made up of components: RAM, CPU, I/O Each component has various capacities Systems Administration: the understanding, care, and feeding of all these disparate components (among other things)
  • 8. RAM Capacity measured in bytes (GB usually) Latency measured in nanoseconds Throughput measured in bytes/second Full state: can’t add more, but no real loss of performance Thrash state: not very relevant
  • 9. Disk (Rotational) Capacity measured in bytes (GB or TB) Latency measured in milliseconds Throughput measured in bytes/second Full state: can’t add more, but otherwise fine Thrash state: server and process starvation, performance drops drastically
  • 10. Disk (SSD) Capacity measured in bytes (GB or TB) Latency measured in milliseconds (but 100x faster than rotational disks) Throughput measured in bytes/second Full state: can’t add more, but otherwise fine Thrash state: obviated by lack of rotation
  • 11. CPU Capacity measured in operations per second, also known as hertz (MHz, GHz, etc) Throughput and latency of a CPU are very advanced things most sysadmins don’t need to worry about (e.g., optimizing for L1 cache and local RAM in NUMA systems) Full/thrash state: system/process starvation
  • 12. Network Capacity not relevant Latency measured in milliseconds (usually) Throughput measured in bits/second and usually 1 Gbps (10 Gbps becoming common) Full state: dropped packets, behavior depends on protocol (i.e., TCP or UDP) Thrash state: not relevant
  • 13. Timing Comparisons 1 second - tick, tock, tick, tock, ... 1,000 milliseconds (ms) per second 1,000,000 microseconds (µs) per second 1,000,000,000 nanoseconds (ns) per second
  • 14. Timing (Part 2) One seek on a rotational disk is ~6ms SSD seeks are about 100µs: 60x faster than a rotational seek RAM seeks are about 60ns: 1,666x faster than an SSD seek (100,000x faster than a rotational seek!)
  • 16. SSH to the VM Open your local terminal (PuTTY in Windows, iTerm/Terminal/etc in Mac OS X, whatever you like in Linux) ssh -p 2222 demo@182.255.123.52 Password is “demo” Please be nice :)
  • 17. It’s dark in here. Heartbeat the machine uptime How’s it doing? free -m How’s the RAM? df -h How’re the disks?
  • 18. Load Average It’s a seat-of-the-pants number Rule of thumb: low is good, high might be bad You have to learn how your machines work for this number to mean much
  • 19. Top of the World Easy way to see what’s running and what is consuming the most resources top Press “P” to sort by Processor usage Press “M” to sort by Memory usage
  • 20. Exhibit #1 Now I will do something on the machine Run through your heartbeat steps again: uptime, free -m, df -h, top Remember to sort top by P and M What has changed? What is going on?
  • 21. Results #1 You probably noticed 1-cpu.pl It’s pushing the CPU to 100% Is it broken? Is this bad? Know your software and systems (very important to know what normal is)
  • 22. Exhibit #2 Now I will do something else Run through your heartbeat steps again: uptime, free -m, df -h, top Remember to sort top by P and M What has changed? What is going on?
  • 23. Results #2 Lots of memory is being consumed It’s some 2-memory.pl command Does the machine feel sluggish? Each command takes a second to start and stop? What is going on here?
  • 24. vmstat The vmstat tool tells us useful things about the state of the kernel and resource usage Try: vmstat -SM 1 Watch while I run the test again Note the si/so and bi/bo columns Now notice the CPU columns on the right
  • 25. Swap RAM is a finite resource Not all RAM is used equally Kernel tracks usage of pages Kernel can write RAM to disk and free it up This is called swapping: you store RAM on disk. Remember the timing slide!
  • 26. Swap (Part 2) Swap is useful mostly on consumer machines In most server environments, swap is death Disks are hundreds to thousands of times (or more!) slower than RAM Generally, any active swapping is bad
  • 27. Exhibit #3 Try uptime, free -m, df -h, top again Also, try: iostat -kx 1 Watch the %util column as this test runs Also the bi/bo columns in vmstat What is going on here?
  • 28. Results #3 Disk usage is high RAM is not full CPU is not pegged Machine responds well Disk utilization at 100%
  • 29. What does it mean? Based on the various data you’ve gathered, is the machine healthy and happy with this program running on it? Why or why not? Discussion.
  • 30. Solutions? This program is using more RAM or CPU than the machine has available Program can be optimized to use less Machine can be upgraded to have more Simple problem, straightforward solutions (Straightforward does not always mean easy)
  • 32. Programs Software that runs on a machine Has traits such as single- or multi- threaded, compiled or interpreted, etc Requires certain resources and inputs Makes certain outputs
  • 33. More Constraints Programs have more constraints to consider Open files and sockets (file descriptors) Permissions (depend on user/group) CPU limits (depends on threads)
  • 34. Exhibit #4 There’s a program running now, but something is wrong with it Use the usual tools (uptime, free -m, df -h, top) System looks OK...
  • 35. File Limits Programs have certain limits Get the PID of the 4-files.pl program ps aufx | grep 4-files cat /proc/PID/limits
  • 36. lsof See what files a program has open lsof -np PID Woah, lots! At the limit? Count them: lsof -np PID | wc -l
  • 37. But... a problem? But is this a problem? Well, it is if the program is trying to open more files How do we tell? Software calls open, which is a system call
  • 38. System Calls The kernel provides certain services Almost all I/O goes through the kernel Current time, fork, cd, exec, etc etc Requires a small context switch Can lead to “sys” CPU usage
  • 39. strace System calls made by a process can be traced Let’s look at 4-files again: sudo strace -p PID Look at the “open” line, is it OK?
  • 40. Results #4 Clearly this program is broken Several fixes... open fewer files, raise your limits, etc (We won’t cover the specifics of raising limits, you can search Google if you need it)
  • 41. It’s all turtles. Linux uses “files” and “filesystems” a lot Sockets are just “files”, they use the same file descriptor number space Result: “Max open files” includes sockets They also show up in lsof, too!
  • 42. Exhibit #5 Let me give us a new program Get the PID, remember how? ps aufx | grep 5-network Look at the files: lsof -np PID Note the “TCP” file!
  • 43. Test the Server telnet 182.255.123.52 7000 (This server is slow, it might take a bit) A very simple timeserver Now: strace -p PID
  • 44. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 45. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 46. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 47. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 48. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 49. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 50. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 51. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 52. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0
  • 53. Results #5 Tracing shows you data, too Can be very valuable for finding moving parts that aren’t moving well Combined with the other tools you can really see what is going on in your system
  • 55. Invisible Glue Kernel issues are fairly rare, but usually frustrating if they show up Usually the result of some sort of limit hit Tons of caches, buckets, and limits Be suspicious of “powers of two” numbers
  • 56. Common Checks Try: sudo dmesg Kernel message log shows many problems Look for suspicious messages
  • 57. “Suspicious” Out of memory: Kill process 19393 (2-memory.pl) score 90 or sacrifice child nf_conntrack: Table full, dropping packet ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
  • 58. More Places to Look The /var/log directory has much data Generally in a problem state, look for recently updated files: ls -lart Loud logs are often unhappy logs Hardware failure is often noted in one of the log files
  • 60. Process Check the components: CPU, RAM, disks Find what limits are being hit and by what If the system is fine, it’s probably software Trace the program, check the logs Analyze well before you fix
  • 61. Familiarity! Systems administration done only as an afterthought will be painful and hard Be familiar with your servers and your software Keep a shell open, watch top throughout the day, watch the disks, etc
  • 62. Next Steps Certain tools make life easier Nagios for monitoring (e.g., alert you when CPU exceeds 90%) Cacti/Ganglia/OpenTSDB for trending Fabric for multiple machine operations Puppet/Chef for configuration management