SlideShare a Scribd company logo
1 of 135
Download to read offline
Performance
Visualizations
Brendan Gregg
Software Engineer
brendan.gregg@joyent.com

USENIX/LISA’10
November, 2010
• G’Day, I’m Brendan
• ... also known as “shouting guy”

2
硬

也會鬧情緒

•

3
• I do performance analysis
• and I’m a DTrace addict

4
•

5
Agenda
•

Performance

•
•

•

Workload Analysis and Resource Monitoring
Understanding available and ideal metrics before plotting

Visualizations

•

Current examples

•
•

•

Latency
Utilization

Future opportunities

•

6

Cloud Computing
Visualizations like these
•

The “rainbow pterodactyl”

Latency

Time

•

7

... which needs quite a bit of explanation
Primary Objectives
•

•

See the value of visualizations

•

8

Consider performance metrics before plotting

Remember key examples
Secondary Objectives
•

Consider performance metrics before plotting

•
•

•

Why studying latency is good
... and studying IOPS can be bad

See the value of visualizations

•
•

•

Why heat maps are needed
... and line graphs can be bad

Remember key examples

•
•

9

I/O latency, as a heat map
CPU utilization by CPU, as a heat map
Content based on
•
•

10

“Visualizing System Latency”, Communications of the ACM
July 2010, by Brendan Gregg
and more
Performance
Understanding the metrics before we
visualize them

11
Performance Activities
•

Workload analysis

•
•

Where is the issue?

•

•

Is there an issue? Is an issue real?

Will the proposed fix work? Did it work?

Resource monitoring

•
•

12

How utilized are the environment components?
Important activity for capacity planning
Workload Analysis
•

Applied during:

•
•

proof of concept testing

•

regression testing

•

benchmarking

•

13

software and hardware development

monitoring
Workload Performance Issues
•

•

14

Load

Architecture
Workload Performance Issues
•

Load

•
•

Too much for the system?

•

•

Workload applied

Poorly constructed?

Architecture

•
•

15

System configuration
Software and hardware bugs
Workload Analysis Steps
•

Identify or confirm if a workload has a performance issue

•

•

Locate issue

•

•

Quantify

Determine, apply and verify solution

•

16

Quantify

Quantify
Quantify

•

17

Finding a performance issue isn’t the problem ... it’s finding
the issue that matters
bugs.mysql.com “performance”

•

18
bugs.opensolaris.org “performance”

•

19
bugs.mozilla.org: “performance”

•

20
“performance” bugs

•
•

21

... and those are just the known performance bugs
... and usually only of a certain type (architecture)
How to Quantify
•

Observation based

•
•

•

Choose a reliable metric
Estimate performance gain from resolving issue

Experimentation based

•
•

22

Apply fix
Quantify before vs. after using a reliable metric
Observation based
•

For example:

•
•

Suggestion: replace disks with flash-memory based SSDs, with
an expected latency of ~100 us

•

23

Observed: 9 ms of which is disk I/O

•

•

Observed: application I/O takes 10 ms

Estimated gain: 10 ms -> 1.1 ms (10 ms - 9 ms + 0.1 ms)
=~ 9x gain

Very useful - but not possible without accurate quantification
Experimentation based
•

For example:

•
•

Observed: Application transaction latency average 2 ms

•

24

Experiment: Added more DRAM to increase cache hits and
reduce average latency

•

•

Observed: Application transaction latency average 10 ms

Gain: 10 ms -> 2 ms = 5x

Also very useful - but risky without accurate quantification
Metrics to Quantify Performance
•

Choose reliable metrics to quantify performance:

•
•

transactions/second

•

throughput

•

utilization

•

•

IOPS

latency

Ideally

•
•

25

interpretation is straightforward
reliable
Metrics to Quantify Performance
•

Choose reliable metrics to quantify performance:

•
•

transactions/second

•

throughput

•

utilization

•

•

IOPS

latency

Ideally

•
•

26

interpretation is straightforward
reliable

generally better suited for:

Capacity Planning

Workload Analysis
Metrics Availability
•

Ideally (given the luxury of time):

•
•

then see if they exist, or,

•

•

design the desired metrics

implement them (eg, DTrace)

Non-ideally

•
•

27

see what already exists
make-do (eg, vmstat -> gnuplot)
Assumptions to avoid
•

Available metrics are implemented correctly

•

all software has bugs
•

•

•

sometimes added by the programmer to only debug their code

Available metrics are complete

•

28

trust no metric without double checking from other sources

Available metrics are designed by performance experts

•

•

eg, CR: 6687884 nxge rbytes and obytes kstat are wrong

you won’t always find what you really need
Getting technical
•

This will be explained using two examples:

•
•

29

Workload Analysis
Capacity Planning
Example: Workload Analysis
•

Quantifying performance issues with IOPS vs latency

•
•

30

IOPS is commonly presented by performance analysis tools
eg: disk IOPS via kstat, SNMP, iostat, ...
IOPS
•

Depends on where the I/O is measured

•

•

app -> library -> syscall -> VFS -> filesystem -> RAID -> device

Depends on what the I/O is

•
•

random or sequential

•

•

synchronous or asynchronous

size

Interpretation difficult

•
•

31

what value is good or bad?
is there a max?
Some disk IOPS problems
•

IOPS Inflation

•
•

Filesystem metadata

•

•

Library or Filesystem prefetch/read-ahead

RAID stripes

IOPS Deflation

•
•

32

Write cancellation

•

•

Read caching

Filesystem I/O aggregation

IOPS aren’t created equal
IOPS example: iostat -xnz 1
•

Consider this disk: 86 IOPS == 99% busy
r/s
86.6

•

%w
0

%b device
99 c1d0

Versus this disk: 21,284 IOPS == 99% busy
r/s
21284.7

33

w/s
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t
655.5
0.0 0.0 1.0
0.0
11.5

extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
0.0 10642.4
0.0 0.0 1.8
0.0
0.1
2 99 c1d0
IOPS example: iostat -xnz 1
•

Consider this disk: 86 IOPS == 99% busy
r/s
86.6

•

%b device
99 c1d0

extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
0.0 10642.4
0.0 0.0 1.8
0.0
0.1
2 99 c1d0

... they are the same disk, different I/O types

•

1) 8 Kbyte random

•
34

%w
0

Versus this disk: 21,284 IOPS == 99% busy
r/s
21284.7

•

w/s
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t
655.5
0.0 0.0 1.0
0.0
11.5

2) 512 byte sequential (on-disk DRAM cache)
Using IOPS to quantify issues
•

to identify

•

•

to locate

•

•

90% of IOPS are random. Is that the problem?

to verify

•

35

is 100 IOPS an problem? Per disk?

A filesystem tunable caused IOPS to reduce.
Has this fixed the issue?
Using IOPS to quantify issues
•

to identify

•

•

to locate

•

•

36

90% of IOPS are random. Is that the problem? (depends...)

to verify

•
•

is 100 IOPS an problem? Per disk? (depends...)

A filesystem tunable caused IOPS to reduce.
Has this fixed the issue? (probably, assuming...)

We can introduce more metrics to understand these, but standalone
IOPS is tricky to interpret
Using latency to quantify issues
•

to identify

•

•

to locate

•

•

90ms of the 100ms is lock contention. Is that the problem?

to verify

•

37

is a 100ms I/O a problem?

A filesystem tunable caused the I/O latency to reduce to 1ms.
Has this fixed the issue?
Using latency to quantify issues
•

to identify

•

•

to locate

•

•

38

90ms of the 100ms is lock contention. Is that the problem? (yes)

to verify

•

•

is a 100ms I/O a problem? (probably - if synchronous to the app.)

A filesystem tunable caused the I/O latency to reduce to 1ms.
Has this fixed the issue? (probably - if 1ms is acceptable)

Latency is much more reliable, easier to interpret
Latency
•

Time from I/O or transaction request to completion

•

Synchronous latency has a direct impact on performance

•
•

•

Application is waiting
higher latency == worse performance

Not all latency is synchronous:

•
•

Filesystem prefetch
no one is waiting at this point

•

39

Asynchronous filesystem threads flushing dirty buffers to disk
eg, zfs TXG synchronous thread

TCP buffer and congestion window: individual packet latency may
be high, but pipe is kept full for good throughput performance
Turning other metrics into latency
•

Currency converter (* -> ms):

•
•

CPU utilization == code path execution latency

•

CPU saturation == dispatcher queue latency

•

40

disk saturation == I/O wait queue latency

•

•

random disk IOPS == I/O service latency

...

Quantifying as latency allows different components to be
compared, ratios examined, improvements estimated.
Example: Resource Monitoring
•

Different performance activity

•
•

Information is used for capacity planning

•

41

incl. CPUs, disks, network interfaces, memory, I/O bus, memory
bus, CPU interconnect, I/O cards, network switches, etc.

•

•

Focus is environment components, not specific issues

Identifying future issues before they happen

Quantifying resource monitoring with IOPS vs utilization
IOPS vs Utilization
•

Another look at this disk:
r/s
86.6
[...]
r/s
21284.7

•

42

w/s
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t
655.5
0.0 0.0 1.0
0.0
11.5

%w
0

%b device
99 c1d0

extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
0.0 10642.4
0.0 0.0 1.8
0.0
0.1
2 99 c1d0

Q. does this system need more spindles for IOPS capacity?
IOPS vs Utilization
•

Another look at this disk:
r/s
86.6
[...]
r/s
21284.7

•

w/s
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t
655.5
0.0 0.0 1.0
0.0
11.5

%b device
99 c1d0

extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
0.0 10642.4
0.0 0.0 1.8
0.0
0.1
2 99 c1d0

Q. does this system need more spindles for IOPS capacity?

•

IOPS (r/s + w/s): ???

•

43

%w
0

Utilization (%b): yes (even considering NCQ)
IOPS vs Utilization
•

Another look at this disk:
r/s
86.6
[...]
r/s
21284.7

•

w/s
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t
655.5
0.0 0.0 1.0
0.0
11.5

Q. does this system need more spindles for IOPS capacity?
IOPS (r/s + w/s): ???

•

Utilization (%b): yes (even considering NCQ)

•

44

%b device
99 c1d0

extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
0.0 10642.4
0.0 0.0 1.8
0.0
0.1
2 99 c1d0

•

•

%w
0

Latency (wsvc_t): no

Latency will identify the issue once it is an issue; utilization
will forecast the issue - capacity planning
Performance Summary
•

Metrics matter - need to reliably quantify performance

•
•

•

to identify, locate, verify
try to think, design

Workload Analysis

•

•

Resource Monitoring

•

•

45

latency

utilization

Other metrics are useful to further understand the nature of
the workload and resource behavior
Objectives
•

Consider performance metrics before plotting

•
•

•

Why latency is good
... and IOPS can be bad

See the value of visualizations

•
•

•

Why heat maps are needed
... and line graphs can be bad

Remember key examples

•
•

46

I/O latency, as a heat map
CPU utilization by CPU, as a heat map
Visualizations
Current Examples
Latency

47
Visualizations
•

So far we’ve picked:

•

Latency

•

•

Utilization

•

48

for workload analysis

for resource monitoring
Latency
•

For example, disk I/O

•

Raw data looks like this:
# iosnoop -o
DTIME
UID
PID
125
100
337
138
100
337
127
100
337
135
100
337
118
100
337
108
100
337
87
100
337
9148
100
337
8806
100
337
2262
100
337
76
100
337
[...many pages of output...]

•

BLOCK
72608
72624
72640
72656
72672
72688
72696
113408
104738
13600
13616

SIZE
8192
8192
8192
8192
8192
4096
3072
8192
7168
1024
1024

COMM
bash
bash
bash
bash
bash
bash
bash
tar
tar
tar
tar

iosnoop is DTrace based

•
49

D
R
R
R
R
R
R
R
R
R
R
R

examines latency for every disk (back end) I/O

PATHNAME
/usr/sbin/tar
/usr/sbin/tar
/usr/sbin/tar
/usr/sbin/tar
/usr/sbin/tar
/usr/sbin/tar
/usr/sbin/tar
/etc/default/lu
/etc/default/lu
/etc/default/cron
/etc/default/devfsadm
Latency Data
•

tuples

•
•

•

50

I/O completion time
I/O latency

can be 1,000s of these per second
Summarizing Latency
•

iostat(1M) can show per second average:
$ iostat -xnz 1
[...]
r/s
471.0
r/s
631.0

w/s
0.0

r/s
472.0
[...]

51

w/s
7.0

w/s
0.0

extended
kr/s
kw/s
786.1
12.0
extended
kr/s
kw/s
1063.1
0.0
extended
kr/s
kw/s
529.0
0.0

device statistics
wait actv wsvc_t asvc_t
0.1 1.2
0.2
2.5
device statistics
wait actv wsvc_t asvc_t
0.2 1.0
0.3
1.6
device statistics
wait actv wsvc_t asvc_t
0.0 1.0
0.0
2.1

%w
4

%b device
90 c1d0

%w
9

%b device
92 c1d0

%w
0

%b device
94 c1d0
Per second
•

Condenses I/O completion time

•

Almost always a sufficient resolution

•

52

(So far I’ve only had one case where examining raw completion
time data was crucial: an interrupt coalescing bug)
Average/second
•

Average loses latency outliers

•

Average loses latency distribution

•

... but not disk distribution:

$ iostat -xnz 1
[...]
r/s
43.9
47.6
42.7
41.4
45.6
45.0
42.9
44.9
[...]

•

%w
0
0
0
0
0
0
0
0

%b
34
36
35
32
34
34
33
35

device
c0t5000CCA215C46459d0
c0t5000CCA215C4521Dd0
c0t5000CCA215C45F89d0
c0t5000CCA215C42A4Cd0
c0t5000CCA215C45541d0
c0t5000CCA215C458F1d0
c0t5000CCA215C450E3d0
c0t5000CCA215C45323d0

only because iostat(1M) prints this per-disk

•
53

w/s
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t
351.5
0.0 0.0 0.4
0.0
10.0
381.1
0.0 0.0 0.5
0.0
9.8
349.9
0.0 0.0 0.4
0.0
10.1
331.5
0.0 0.0 0.4
0.0
9.6
365.1
0.0 0.0 0.4
0.0
9.2
360.3
0.0 0.0 0.4
0.0
9.4
343.5
0.0 0.0 0.4
0.0
9.9
359.5
0.0 0.0 0.4
0.0
9.3

but that gets hard to read for 100s of disks, per second!
Latency outliers
•

Occasional high-latency I/O

•

Can be the sole reason for performance issues

•

Can be lost in an average

•
•

54

1 slow I/O @ 500ms

•

•

10,000 fast I/O @ 1ms

average = 1.05 ms

Can be seen using max instead of (or as well as) average
Maximum/second
•
•

can be visualized along with average/second

•

does identify outliers

•

55

iostat(1M) doesn’t show this, however DTrace can

doesn’t identify latency distribution details
Latency distribution
•

Apart from outliers and average, it can be useful to examine
the full profile of latency - all the data.

•

•

For such a crucial metric, keep as much details as possible

For latency, distributions we’d expect to see include:

•
•

tri-modal: multiple cache layers

•

56

bi-modal: cache hit vs cache miss

flat: random disk I/O
Latency Distribution Example
•

Using DTrace:

# ./disklatency.d
Tracing... Hit Ctrl-C to end.
^C
sd4 (28,256), us:
value
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072

57

------------- Distribution ------------- count
|
0
|
82
|@@@
621
|@@@@@
833
|@@@@
641
|@@@
615
|@@@@@@@
1239
|@@@@@@@@@
1615
|@@@@@@@@
1483
|
76
|
1
|
0
|
2
|
0
disklatency.d
•

not why we are here, but before someone asks...
#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
printf("Tracing... Hit Ctrl-C to end.n");
}
io:::start
{
start_time[arg0] = timestamp;
}
io:::done
/this->start = start_time[arg0]/
{
this->delta = (timestamp - this->start) / 1000;
@[args[1]->dev_statname, args[1]->dev_major, args[1]->dev_minor] =
quantize(this->delta);
start_time[arg0] = 0;
}
dtrace:::END
{
printa("
}

58

%s (%d,%d), us:n%@dn", @);
Latency Distribution Example
# ./disklatency.d
Tracing... Hit Ctrl-C to end.
^C
sd4 (28,256), us:
value
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072

------------- Distribution ------------- count
|
0
|
82
|@@@
621
|@@@@@
833
|@@@@
641
|@@@
615
|@@@@@@@
1239
|@@@@@@@@@
1615
|@@@@@@@@
1483
|
76
|
1
|
0
|
2
|
0

•
•
59

... but can we see this distribution per second?
... how do we visualize a 3rd dimension?

65 - 131 ms
outliers
Column Quantized Visualization
aka “heat map”
•

For example:

•

60
Heat Map: Offset Distribution

•

x-axis: time

•

y-axis: offset

•

z-axis (color scale): I/O count for that time/offset range

•
•
61

Identified random vs. sequential very well
Similar heat maps have been used before by defrag tools
Heat Map: Latency Distribution
•

For example:

•
•

y-axis: latency

•

62

x-axis: time

z-axis (color saturation): I/O count for that time/latency range
Heat Map: Latency Distribution
•

... in fact, this is a great example:

•

reads
served
from:

DRAM
disk

DRAM
flash-memory based SSD
disk
ZFS “L2ARC” enabled

63
Heat Map: Latency Distribution
•

... in fact, this is a great example:

•

reads
served
from:

DRAM
disk

DRAM
flash-memory based SSD
disk
ZFS “L2ARC” enabled

64
Latency Heat Map
•

A color shaded matrix of pixels

•

Each pixel is a time and latency range

•

Color shade picked based on number of I/O in that range

•

Adjusting saturation seems to work better than color hue.
Eg:

•
•

65

darker == more I/O
lighter == less I/O
Pixel Size
•

Large pixels (and corresponding time/latency ranges)

•

•

•

increases likelyhood that adjacent pixels
include I/O, have color, and combine to
form patterns
allows color to be more easily seen

Smaller pixels (and time/latency ranges)

•
•

66

can make heat map look like a scatter plot
of the same color - if ranges are so small
only one I/O is typically included
Color Palette
•

Linear scale can make subtle details (outliers) difficult to see

•
•

outliers are usually < 1% of the I/O

•

•

observing latency outliers is usually of high importance

assigning < 1% of the color scale to them will washout patterns

False color palette can be used to emphasize these details

•

67

although color comparisons become more confusing - non-linear
Outliers
•

Heat maps show these very well

•

However, latency outliers can
compress the bulk of the heat
map data

•

•

68

eg, 1 second outlier while most
I/O is < 10 ms

Users should have some control
to be able to zoom/truncate details

•

outlier

both x and y axis

data bulk
Data Storage
•

Since heat-maps are three dimensions, storing this data can
become costly (volume)

•

Most of the data points are zero

•

and you can prevent storing zero’s by only storing populated
elements: associative array

•

You can reduce to a sufficiently high resolution, and
resample lower as needed

•

You can also be aggressive at reducing resolution at higher
latencies

•
•
69

10 us granularity not as interesting for I/O > 1 second
non-linear resolution
Other Interesting Latency Heat Maps
•
•

The “Rainbow Pterodactyl”

•

70

The “Icy Lake”

Latency Levels
The “Icy Lake” Workload
•

About as simple as it gets:

•
•

•

71

Single client, single thread, sequential synchronous 8 Kbyte
writes to an NFS share
NFS server: 22 x 7,200 RPM disks, striped pool

The resulting latency heat map was unexpected
The “Icy Lake”

•

72
“Icy Lake” Analysis: Observation
•

Examining single disk latency:

•

Pattern match with NFS latency: similar lines

•
73

each disk contributing some lines to the overall pattern
Pattern Match?
•

We just associated NFS latency with disk device latency,
using our eyeballs

•

see the titles on the previous heat maps

•

You can programmatically do this (DTrace), but that can get
difficult to associate context across software stack layers
(but not impossible!)

•

Heat Maps allow this part of the problem to be offloaded to
your brain

•

74

and we are very good at pattern matching
“Icy Lake” Analysis: Experimentation
•

Same workload, single disk pool:

•

No diagonal lines

•

75

but more questions - see the line (false color palette enhanced) at
9.29 ms? this is < 1% of the I/O. (I’m told, and I believe, that this
is due to adjacent track seek latency.)
“Icy Lake” Analysis: Experimentation
•

Same workload, two disk striped pool:

•

Ah-hah! Diagonal lines.

•

76

... but still more questions: why does the angle sometimes
change? why do some lines slope upwards and some down?
“Icy Lake” Analysis: Experimentation
•

... each disk from that pool:

•

77
“Icy Lake” Analysis: Questions
•

Remaining Questions:

•
•

78

Why does the slope sometimes change?
What exactly seeds the slope in the first place?
“Icy Lake” Analysis: Mirroring
•

79

Trying mirroring the pool disks instead of striping:
Another Example: “X marks the spot”

80
The “Rainbow Pterodactyl” Workload
•

48 x 7,200 RPM disks, 2 disk enclosures

•

Sequential 128 Kbyte reads to each disk (raw device),
adding disks every 2 seconds

•

Goal: Performance analysis of system architecture

•

81

identifying I/O throughput limits by driving I/O subsystem to
saturation, one disk at a time (finds knee points)
The “Rainbow Pterodactyl”

82
The “Rainbow Pterodactyl”

83
The “Rainbow Pterodactyl”
Buldge

Beak
84

Head

Wing

Neck

Shoulders

Body
The “Rainbow Pterodactyl”: Analysis
•

Hasn’t been understood in detail

•

•

85

Would never be understood (or even known) without heat maps

It is repeatable
The “Rainbow Pterodactyl”: Theories
•
•

“Head”: 9th disk, contention on the 2 x4 SAS ports

•

“Buldge”: ?

•

“Neck”: ?

•

“Wing”: contention?

•

“Shoulders”: ?

•

86

“Beak”: disk cache hit vs disk cache miss -> bimodal

“Body”: PCI-gen1 bus contention
Latency Levels Workload
•
•

87

Same as “Rainbow Pterodactyl”, stepping disks
Instead of sequential reads, this is repeated 128 Kbyte
reads (read -> seek 0 -> read -> ...), to deliberately hit from
the disk DRAM cache to improve test throughput
Latency Levels

•

88
Latency Levels Theories
•

89

???
Bonus Latency Heat Map

•

90

This time we do know the source of the latency...
硬

也會鬧情緒

•

91
Latency Heat Maps: Summary
•

Shows latency distribution over time

•

Shows outliers (maximums)

•

Indirectly shows average

•

Shows patterns

•

92

allows correlation with other software stack layers
Similar Heat Map Uses
•

These all have a dynamic y-axis scale:

•
•

•

93

I/O size
I/O offset

These aren’t a primary measure of performance (like
latency); they provide secondary information to understand
the workload
Heat Map: I/O Offset

•

94

y-axis: I/O offset (in this case, NFSv3 file location)
Heat Map: I/O Size

•

95

y-axis: I/O size (bytes)
Heat Map Abuse
•

96

What can we ‘paint’ by adjusting the workload?
I/O Size

•

97

How was this done?
I/O Offset

•

98

How was this done?
I/O Latency

•

99

How was this done?
Visualizations
Current Examples
Utilization

100
CPU Utilization
•

Commonly used indicator of CPU performance

•

eg, vmstat(1M)

$ vmstat 1 5
kthr
memory
r b w
swap free re
0 0 0 95125264 28022732
0 0 0 91512024 25075924
0 0 0 91511864 25075796
0 0 0 91511228 25075164
0 0 0 91510824 25074940

101

page
disk
faults
cpu
mf pi po fr de sr s0 s1 s2 s3
in
sy
cs us sy id
301 1742 1 17 17 0 0 -0 -0 -0 6 5008 21927 3886 4 1 94
6 55 0 0 0 0 0 0 0 0 0 4665 18228 4299 10 1 89
9 24 0 0 0 0 0 0 0 0 0 3504 12757 3158 8 0 92
3 163 0 0 0 0 0 0 0 0 0 4104 15375 3611 9 5 86
5 66 0 0 0 0 0 0 0 0 0 4607 19492 4394 10 1 89
CPU Utilization: Line Graph
•

102

Easy to plot:
CPU Utilization: Line Graph
•

Easy to plot:

•

Average across all CPUs:

•

103

Identifies how utilized all CPUs are, indicating remaining
headroom - provided sufficient threads to use CPUs
CPU Utilization by CPU
•

mpstat(1M) can show utilization by-CPU:

$ mpstat 1
[...]
CPU minf mjf xcal
0
0
0
2
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
6
2
0
0
7
0
0
0
8
0
0
0
9
0
0
30
10
0
0
0
11
0
0
0
12
0
0
0
13
38
0
15
14
0
0
0
15
0
0
0

•

csw icsw migr smtx
315
0
24
4
190
0
12
4
152
0
12
1
274
1
21
3
229
0
9
2
138
0
7
3
296
0
8
2
0
9
0
1
311
0
22
2
274
0
16
4
445
0
13
7
303
0
7
8
312
0
10
1
336
2
10
1
209
0
10
38
10
0
4
2

can identify a single hot CPU (thread)

•
104

intr ithr
313 105
65
28
64
20
127
74
32
5
46
19
109
32
30
8
169
68
111
54
69
29
78
36
74
34
456 285
2620 2497
20
8

and un-balanced configurations

srw syscl
0 1331
0
576
0
438
0
537
0
902
0
521
0 1266
0
0
0
847
0
868
0 2559
0 1041
0 1250
0 1408
0
259
0
2

usr sys
5
1
1
1
0
1
1
1
1
1
1
0
4
0
100
0
2
1
2
0
7
1
2
0
7
1
5
2
1
3
0
0

wt idl
0 94
0 98
0 99
0 98
0 98
0 99
0 96
0
0
0 97
0 98
0 92
0 98
0 92
0 93
0 96
0 100
CPU Resource Monitoring
•

Monitor overall utilization for capacity planning

•

Also valuable to monitor individual CPUs

•
•

•

can identify un-balanced configurations
such as a single hot CPU (thread)

The virtual CPUs on a single host can now reach the 100s

•
•

105

its own dimension
how can we display this 3rd dimension?
Heat Map: CPU Utilization

100%

0%
60s

•
•

y-axis: percent utilization

•

106

x-axis: time

z-axis (color saturation): # of CPUs in that time/utilization range
Heat Map: CPU Utilization
idle CPUs

single ‘hot’ CPU
100%

0%
60s

•
•

107

Single ‘hot’ CPUs are a common problem due to application
scaleability issues (single threaded)
This makes identification easy, without reading pages of
mpstat(1M) output
Heat Map: Disk Utilization
•

Ditto for disks

•

Disk Utilization heat map can identify:

•
•

unbalanced configurations

•

•

overall utilization

single hot disks (versus all disks busy)

Ideally, the disk utilization heat map is tight (y-axis) and
below 70%, indicating a well balanced config with headroom

•

108

which can’t be visualized with line graphs
Back to Line Graphs...
•

Are typically used to visualize performance, be it IOPS or
utilization

•

Show patterns over time more clearly than text (higher
resolution)

•

But graphical environments can do much more

•

•

109

As shown by the heat maps (to start with); which convey details
line graphs cannot

Ask: what “value add” does the GUI bring to the data?
Resource Utilization Heat Map Summary
•

Can exist for any resource with multiple components:

•

CPUs

•

Disks

•

Network interfaces

•

I/O busses

•

...

•

Quickly identifies single hot component versus all
components

•

Best suited for physical hardware resources

•
110

difficult to express ‘utilization’ for a software resource
Visualizations
Future Opportunities

111
Cloud Computing
So far analysis has been for a single server

What about the cloud?

112
From one to thousands of servers
Workload Analysis:
latency I/O x cloud
Resource Monitoring:
# of CPUs x cloud
# of disks x cloud
etc.

113
Heat Maps for the Cloud
•

Heat Maps are promising for cloud computing observability:

•

•

Find outliers regardless of node

•

•

similar to CPU and disk utilization heat maps

mpstat and iostat’s output are already getting too long

•

114

cloud-wide latency heat map just has more I/O

Examine how applications are load balanced across nodes

•

•

additional dimension accommodates the scale of the cloud

multiply by 1000x for the number of possible hosts in a large
cloud application
Proposed Visualizations
•

Include:

•
•

Latency heat maps for cloud application components

•

CPU utilization by cloud node

•

CPU utilization by CPU

•

Thread/process utilization across entire cloud

•

Network interface utilization by cloud node

•

Network interface utilization by port

•

115

Latency heat map across entire cloud

lots, lots more
Cloud Latency Heat Map
•

Latency at different layers:

•
•

PHP/Ruby/...

•

MySQL

•

DNS

•

Disk I/O

•

CPU dispatcher queue latency

•

116

Apache

and pattern match to quickly identify and locate latency
Latency Example: MySQL
•

Query latency (DTrace):
query time (ns)
value ------------- Distribution ------------- count
1024 |
0
2048 |
2
4096 |@
99
8192 |
20
16384 |@
114
32768 |@
105
65536 |@
123
131072 |@@@@@@@@@@@@@
1726
262144 |@@@@@@@@@@@
1515
524288 |@@@@
601
1048576 |@@
282
2097152 |@
114
4194304 |
61
8388608 |@@@@@
660
16777216 |
67
33554432 |
12
67108864 |
7
134217728 |
4
268435456 |
5
536870912 |
0

117
Latency Example: MySQL
•

Query latency (DTrace):
query time (ns)
value ------------- Distribution ------------1024 |
2048 |
4096 |@
8192 |
16384 |@
32768 |@
65536 |@
131072 |@@@@@@@@@@@@@
262144 |@@@@@@@@@@@
524288 |@@@@
1048576 |@@
2097152 |@
4194304 |
8388608 |@@@@@
16777216 |
What is this?
33554432 |
67108864 |
(8-16 ms latency)
134217728 |
268435456 |
536870912 |

118

count
0
2
99
20
114
105
123
1726
1515
601
282
114
61
660
67
12
7
4
5
0
Latency Example: MySQL
query time (ns)
value ------------- Distribution ------------- count
1024 |
0
2048 |
2
4096 |@
99
8192 |
20
16384 |@
114
32768 |@
105
65536 |@
123
131072 |@@@@@@@@@@@@@
1726
262144 |@@@@@@@@@@@
1515
524288 |@@@@
601
1048576 |@@
282
2097152 |@
114
4194304 |
61
8388608 |@@@@@
660
16777216 |
67
33554432 |
12
67108864 |
7
134217728 |
4
oh...
268435456 |
5
536870912 |
0
innodb srv sleep (ns)
value ------------- Distribution ------------4194304 |
8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
16777216 |
119

count
0
841
0
Latency Example: MySQL
•
•

innodb thread concurrency back-off sleep latency: 8 - 16 ms

•

Both have a similar magnitude (see “count” column)

•

Add the dimension of time as a heat map, for more
characteristics to compare

•

120

Spike of MySQL query latency: 8 - 16 ms

... quickly compare heat maps from different components of
the cloud to pattern match and locate latency
Cloud Latency Heat Map
•

Identify latency outliers, distributions, patterns

•

Can add more functionality to identify these by:

•

cloud node

•

application, cloud-wide

•

I/O type (eg, query type)

•

Targeted observability (DTrace) can be used to fetch this

•

Or, we could collect it for everything

•

121

... do we need a 4th dimension?
4th Dimension!
•

Bryan Cantrill @Joyent coded this 11 hours ago

•
•

122

assuming it’s now about 10:30am during this talk
... and I added these slides about 7 hours ago
4th Dimension Example: Thread Runtime
100 ms

0 ms

•
•

y-axis: thread runtime

•

123

x-axis: time

z-axis (color saturation): count at that time/runtime range
4th Dimension Example: Thread Runtime
100 ms

0 ms

•
•

z-axis (color saturation): count at that time/runtime range

•

124

y-axis: thread runtime

•

•

x-axis: time

omega-axis (color hue): application

blue == “coreaudiod”
4th Dimension Example: Thread Runtime
100 ms

0 ms

•
•

z-axis (color saturation): count at that time/runtime range

•

125

y-axis: thread runtime

•

•

x-axis: time

omega-axis (color hue): application

green == “iChat”
4th Dimension Example: Thread Runtime
100 ms

0 ms

•
•

z-axis (color saturation): count at that time/runtime range

•

126

y-axis: thread runtime

•

•

x-axis: time

omega-axis (color hue): application

violet == “Chrome”
4th Dimension Example: Thread Runtime
100 ms

0 ms

•
•

z-axis (color saturation): count at that time/runtime range

•

127

y-axis: thread runtime

•

•

x-axis: time

omega-axis (color hue): application

All colors
“Dimensionality”
•

While the data supports the 4th dimension, visualizing this
properly may become difficult (we are eager to find out)

•

•

The image itself is still only 2 dimensional

May be best used to view a limited set, to limit the number of
different hues; uses can include:

•
•

•

128

Highlighting different cloud application types: DB, web server, etc.
Highlighting one from many components: single node, CPU, disk,
etc.

Limiting the set also helps storage of data
More Visualizations
•

We plan much more new stuff

•
•

129

We are building a team of engineers to work on it; including Bryan
Cantrill, Dave Pacheo, and mysqlf
Dave and I have only been at Joyent for 2 1/2 weeks
Beyond Performance Analysis
•

Visualizations such as heat maps could also be applied to:

•

Security, with pattern matching for:

•
•

inter-keystroke-latency analysis

•

•

robot identification based on think-time latency analysis

brute force username latency attacks?

System Administration

•

•

130

monitoring quota usage by user, filesystem, disk

Other multi-dimensional datasets
Objectives
•

Consider performance metrics before plotting

•
•

•

Why latency is good
... and IOPS can be bad

See the value of visualizations

•
•

•

Why heat maps are needed
... and line graphs can be bad

Remember key examples

•
•

131

I/O latency, as a heat map
CPU utilization by CPU, as a heat map
Heat Map: I/O Latency

•

Latency matters

•

•

Heat map shows

•
132

synchronous latency has a direct impact on performance

outliers, balance, cache layers, patterns
Heat Map: CPU Utilization
100%

0%
60s

•

Identify single threaded issues

•

•

Heat map shows

•
133

single CPU hitting 100%

fully utilized components, balance, overall headroom, patterns
Tools Demonstrated
•

For Reference:

•

DTraceTazTool

•

•

Analytics

•

•

2008; Oracle Sun ZFS Storage Appliance

“new stuff” (not named yet)

•

134

2006; based on TazTool by Richard McDougall 1995. Open
source, unsupported, and probably no longer works (sorry).

2010; Joyent; Bryan Cantrill, Dave Pacheco, Brendan Gregg
Question Time
•

Thank you!

•

How to find me on the web:

•
•

http://blogs.sun.com/brendan <-- is my old blog

•
135

http://dtrace.org/blogs/brendan

twitter @brendangregg

More Related Content

What's hot

[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureSplunk
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
軟體開發程序
軟體開發程序軟體開發程序
軟體開發程序Steven liaw
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
 
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)涛 吴
 
Linux Performance Tunning Memory
Linux Performance Tunning MemoryLinux Performance Tunning Memory
Linux Performance Tunning MemoryShay Cohen
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scaleVinay Kumar Chella
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta LakeDatabricks
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 

What's hot (20)

[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
軟體開發程序
軟體開發程序軟體開發程序
軟體開發程序
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)
 
Mips1
Mips1Mips1
Mips1
 
Linux Performance Tunning Memory
Linux Performance Tunning MemoryLinux Performance Tunning Memory
Linux Performance Tunning Memory
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta Lake
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 

Viewers also liked

DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: IntroductionBrendan Gregg
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014Brendan Gregg
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsBrendan Gregg
 
From DTrace to Linux
From DTrace to LinuxFrom DTrace to Linux
From DTrace to LinuxBrendan Gregg
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologiesBrendan Gregg
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems PerformanceBrendan Gregg
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudBrendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesCharles Nutter
 

Viewers also liked (20)

DTraceCloud2012
DTraceCloud2012DTraceCloud2012
DTraceCloud2012
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
From DTrace to Linux
From DTrace to LinuxFrom DTrace to Linux
From DTrace to Linux
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 

Similar to LISA2010 visualizations

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
Progress OE performance management
Progress OE performance managementProgress OE performance management
Progress OE performance managementYassine MOALLA
 
Progress Openedge performance management
Progress Openedge performance managementProgress Openedge performance management
Progress Openedge performance managementYassine MOALLA
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Nikolay Savvinov
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelDaniel Coupal
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
End-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL ServerEnd-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL ServerKevin Kline
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data WarehousesConnor McDonald
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksSenturus
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachLaurent Leturgez
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Swift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StorySwift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StoryBrian Cline
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 

Similar to LISA2010 visualizations (20)

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Progress OE performance management
Progress OE performance managementProgress OE performance management
Progress OE performance management
 
Progress Openedge performance management
Progress Openedge performance managementProgress Openedge performance management
Progress Openedge performance management
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
End-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL ServerEnd-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL Server
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approach
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Swift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StorySwift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer Story
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
Performance tuning in sql server
Performance tuning in sql serverPerformance tuning in sql server
Performance tuning in sql server
 

More from Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 

Recently uploaded

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 

Recently uploaded (20)

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 

LISA2010 visualizations

  • 2. • G’Day, I’m Brendan • ... also known as “shouting guy” 2
  • 4. • I do performance analysis • and I’m a DTrace addict 4
  • 6. Agenda • Performance • • • Workload Analysis and Resource Monitoring Understanding available and ideal metrics before plotting Visualizations • Current examples • • • Latency Utilization Future opportunities • 6 Cloud Computing
  • 7. Visualizations like these • The “rainbow pterodactyl” Latency Time • 7 ... which needs quite a bit of explanation
  • 8. Primary Objectives • • See the value of visualizations • 8 Consider performance metrics before plotting Remember key examples
  • 9. Secondary Objectives • Consider performance metrics before plotting • • • Why studying latency is good ... and studying IOPS can be bad See the value of visualizations • • • Why heat maps are needed ... and line graphs can be bad Remember key examples • • 9 I/O latency, as a heat map CPU utilization by CPU, as a heat map
  • 10. Content based on • • 10 “Visualizing System Latency”, Communications of the ACM July 2010, by Brendan Gregg and more
  • 11. Performance Understanding the metrics before we visualize them 11
  • 12. Performance Activities • Workload analysis • • Where is the issue? • • Is there an issue? Is an issue real? Will the proposed fix work? Did it work? Resource monitoring • • 12 How utilized are the environment components? Important activity for capacity planning
  • 13. Workload Analysis • Applied during: • • proof of concept testing • regression testing • benchmarking • 13 software and hardware development monitoring
  • 15. Workload Performance Issues • Load • • Too much for the system? • • Workload applied Poorly constructed? Architecture • • 15 System configuration Software and hardware bugs
  • 16. Workload Analysis Steps • Identify or confirm if a workload has a performance issue • • Locate issue • • Quantify Determine, apply and verify solution • 16 Quantify Quantify
  • 17. Quantify • 17 Finding a performance issue isn’t the problem ... it’s finding the issue that matters
  • 21. “performance” bugs • • 21 ... and those are just the known performance bugs ... and usually only of a certain type (architecture)
  • 22. How to Quantify • Observation based • • • Choose a reliable metric Estimate performance gain from resolving issue Experimentation based • • 22 Apply fix Quantify before vs. after using a reliable metric
  • 23. Observation based • For example: • • Suggestion: replace disks with flash-memory based SSDs, with an expected latency of ~100 us • 23 Observed: 9 ms of which is disk I/O • • Observed: application I/O takes 10 ms Estimated gain: 10 ms -> 1.1 ms (10 ms - 9 ms + 0.1 ms) =~ 9x gain Very useful - but not possible without accurate quantification
  • 24. Experimentation based • For example: • • Observed: Application transaction latency average 2 ms • 24 Experiment: Added more DRAM to increase cache hits and reduce average latency • • Observed: Application transaction latency average 10 ms Gain: 10 ms -> 2 ms = 5x Also very useful - but risky without accurate quantification
  • 25. Metrics to Quantify Performance • Choose reliable metrics to quantify performance: • • transactions/second • throughput • utilization • • IOPS latency Ideally • • 25 interpretation is straightforward reliable
  • 26. Metrics to Quantify Performance • Choose reliable metrics to quantify performance: • • transactions/second • throughput • utilization • • IOPS latency Ideally • • 26 interpretation is straightforward reliable generally better suited for: Capacity Planning Workload Analysis
  • 27. Metrics Availability • Ideally (given the luxury of time): • • then see if they exist, or, • • design the desired metrics implement them (eg, DTrace) Non-ideally • • 27 see what already exists make-do (eg, vmstat -> gnuplot)
  • 28. Assumptions to avoid • Available metrics are implemented correctly • all software has bugs • • • sometimes added by the programmer to only debug their code Available metrics are complete • 28 trust no metric without double checking from other sources Available metrics are designed by performance experts • • eg, CR: 6687884 nxge rbytes and obytes kstat are wrong you won’t always find what you really need
  • 29. Getting technical • This will be explained using two examples: • • 29 Workload Analysis Capacity Planning
  • 30. Example: Workload Analysis • Quantifying performance issues with IOPS vs latency • • 30 IOPS is commonly presented by performance analysis tools eg: disk IOPS via kstat, SNMP, iostat, ...
  • 31. IOPS • Depends on where the I/O is measured • • app -> library -> syscall -> VFS -> filesystem -> RAID -> device Depends on what the I/O is • • random or sequential • • synchronous or asynchronous size Interpretation difficult • • 31 what value is good or bad? is there a max?
  • 32. Some disk IOPS problems • IOPS Inflation • • Filesystem metadata • • Library or Filesystem prefetch/read-ahead RAID stripes IOPS Deflation • • 32 Write cancellation • • Read caching Filesystem I/O aggregation IOPS aren’t created equal
  • 33. IOPS example: iostat -xnz 1 • Consider this disk: 86 IOPS == 99% busy r/s 86.6 • %w 0 %b device 99 c1d0 Versus this disk: 21,284 IOPS == 99% busy r/s 21284.7 33 w/s 0.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 655.5 0.0 0.0 1.0 0.0 11.5 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 0.0 0.1 2 99 c1d0
  • 34. IOPS example: iostat -xnz 1 • Consider this disk: 86 IOPS == 99% busy r/s 86.6 • %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 0.0 0.1 2 99 c1d0 ... they are the same disk, different I/O types • 1) 8 Kbyte random • 34 %w 0 Versus this disk: 21,284 IOPS == 99% busy r/s 21284.7 • w/s 0.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 655.5 0.0 0.0 1.0 0.0 11.5 2) 512 byte sequential (on-disk DRAM cache)
  • 35. Using IOPS to quantify issues • to identify • • to locate • • 90% of IOPS are random. Is that the problem? to verify • 35 is 100 IOPS an problem? Per disk? A filesystem tunable caused IOPS to reduce. Has this fixed the issue?
  • 36. Using IOPS to quantify issues • to identify • • to locate • • 36 90% of IOPS are random. Is that the problem? (depends...) to verify • • is 100 IOPS an problem? Per disk? (depends...) A filesystem tunable caused IOPS to reduce. Has this fixed the issue? (probably, assuming...) We can introduce more metrics to understand these, but standalone IOPS is tricky to interpret
  • 37. Using latency to quantify issues • to identify • • to locate • • 90ms of the 100ms is lock contention. Is that the problem? to verify • 37 is a 100ms I/O a problem? A filesystem tunable caused the I/O latency to reduce to 1ms. Has this fixed the issue?
  • 38. Using latency to quantify issues • to identify • • to locate • • 38 90ms of the 100ms is lock contention. Is that the problem? (yes) to verify • • is a 100ms I/O a problem? (probably - if synchronous to the app.) A filesystem tunable caused the I/O latency to reduce to 1ms. Has this fixed the issue? (probably - if 1ms is acceptable) Latency is much more reliable, easier to interpret
  • 39. Latency • Time from I/O or transaction request to completion • Synchronous latency has a direct impact on performance • • • Application is waiting higher latency == worse performance Not all latency is synchronous: • • Filesystem prefetch no one is waiting at this point • 39 Asynchronous filesystem threads flushing dirty buffers to disk eg, zfs TXG synchronous thread TCP buffer and congestion window: individual packet latency may be high, but pipe is kept full for good throughput performance
  • 40. Turning other metrics into latency • Currency converter (* -> ms): • • CPU utilization == code path execution latency • CPU saturation == dispatcher queue latency • 40 disk saturation == I/O wait queue latency • • random disk IOPS == I/O service latency ... Quantifying as latency allows different components to be compared, ratios examined, improvements estimated.
  • 41. Example: Resource Monitoring • Different performance activity • • Information is used for capacity planning • 41 incl. CPUs, disks, network interfaces, memory, I/O bus, memory bus, CPU interconnect, I/O cards, network switches, etc. • • Focus is environment components, not specific issues Identifying future issues before they happen Quantifying resource monitoring with IOPS vs utilization
  • 42. IOPS vs Utilization • Another look at this disk: r/s 86.6 [...] r/s 21284.7 • 42 w/s 0.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 655.5 0.0 0.0 1.0 0.0 11.5 %w 0 %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 0.0 0.1 2 99 c1d0 Q. does this system need more spindles for IOPS capacity?
  • 43. IOPS vs Utilization • Another look at this disk: r/s 86.6 [...] r/s 21284.7 • w/s 0.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 655.5 0.0 0.0 1.0 0.0 11.5 %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 0.0 0.1 2 99 c1d0 Q. does this system need more spindles for IOPS capacity? • IOPS (r/s + w/s): ??? • 43 %w 0 Utilization (%b): yes (even considering NCQ)
  • 44. IOPS vs Utilization • Another look at this disk: r/s 86.6 [...] r/s 21284.7 • w/s 0.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 655.5 0.0 0.0 1.0 0.0 11.5 Q. does this system need more spindles for IOPS capacity? IOPS (r/s + w/s): ??? • Utilization (%b): yes (even considering NCQ) • 44 %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 0.0 0.1 2 99 c1d0 • • %w 0 Latency (wsvc_t): no Latency will identify the issue once it is an issue; utilization will forecast the issue - capacity planning
  • 45. Performance Summary • Metrics matter - need to reliably quantify performance • • • to identify, locate, verify try to think, design Workload Analysis • • Resource Monitoring • • 45 latency utilization Other metrics are useful to further understand the nature of the workload and resource behavior
  • 46. Objectives • Consider performance metrics before plotting • • • Why latency is good ... and IOPS can be bad See the value of visualizations • • • Why heat maps are needed ... and line graphs can be bad Remember key examples • • 46 I/O latency, as a heat map CPU utilization by CPU, as a heat map
  • 48. Visualizations • So far we’ve picked: • Latency • • Utilization • 48 for workload analysis for resource monitoring
  • 49. Latency • For example, disk I/O • Raw data looks like this: # iosnoop -o DTIME UID PID 125 100 337 138 100 337 127 100 337 135 100 337 118 100 337 108 100 337 87 100 337 9148 100 337 8806 100 337 2262 100 337 76 100 337 [...many pages of output...] • BLOCK 72608 72624 72640 72656 72672 72688 72696 113408 104738 13600 13616 SIZE 8192 8192 8192 8192 8192 4096 3072 8192 7168 1024 1024 COMM bash bash bash bash bash bash bash tar tar tar tar iosnoop is DTrace based • 49 D R R R R R R R R R R R examines latency for every disk (back end) I/O PATHNAME /usr/sbin/tar /usr/sbin/tar /usr/sbin/tar /usr/sbin/tar /usr/sbin/tar /usr/sbin/tar /usr/sbin/tar /etc/default/lu /etc/default/lu /etc/default/cron /etc/default/devfsadm
  • 50. Latency Data • tuples • • • 50 I/O completion time I/O latency can be 1,000s of these per second
  • 51. Summarizing Latency • iostat(1M) can show per second average: $ iostat -xnz 1 [...] r/s 471.0 r/s 631.0 w/s 0.0 r/s 472.0 [...] 51 w/s 7.0 w/s 0.0 extended kr/s kw/s 786.1 12.0 extended kr/s kw/s 1063.1 0.0 extended kr/s kw/s 529.0 0.0 device statistics wait actv wsvc_t asvc_t 0.1 1.2 0.2 2.5 device statistics wait actv wsvc_t asvc_t 0.2 1.0 0.3 1.6 device statistics wait actv wsvc_t asvc_t 0.0 1.0 0.0 2.1 %w 4 %b device 90 c1d0 %w 9 %b device 92 c1d0 %w 0 %b device 94 c1d0
  • 52. Per second • Condenses I/O completion time • Almost always a sufficient resolution • 52 (So far I’ve only had one case where examining raw completion time data was crucial: an interrupt coalescing bug)
  • 53. Average/second • Average loses latency outliers • Average loses latency distribution • ... but not disk distribution: $ iostat -xnz 1 [...] r/s 43.9 47.6 42.7 41.4 45.6 45.0 42.9 44.9 [...] • %w 0 0 0 0 0 0 0 0 %b 34 36 35 32 34 34 33 35 device c0t5000CCA215C46459d0 c0t5000CCA215C4521Dd0 c0t5000CCA215C45F89d0 c0t5000CCA215C42A4Cd0 c0t5000CCA215C45541d0 c0t5000CCA215C458F1d0 c0t5000CCA215C450E3d0 c0t5000CCA215C45323d0 only because iostat(1M) prints this per-disk • 53 w/s 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 351.5 0.0 0.0 0.4 0.0 10.0 381.1 0.0 0.0 0.5 0.0 9.8 349.9 0.0 0.0 0.4 0.0 10.1 331.5 0.0 0.0 0.4 0.0 9.6 365.1 0.0 0.0 0.4 0.0 9.2 360.3 0.0 0.0 0.4 0.0 9.4 343.5 0.0 0.0 0.4 0.0 9.9 359.5 0.0 0.0 0.4 0.0 9.3 but that gets hard to read for 100s of disks, per second!
  • 54. Latency outliers • Occasional high-latency I/O • Can be the sole reason for performance issues • Can be lost in an average • • 54 1 slow I/O @ 500ms • • 10,000 fast I/O @ 1ms average = 1.05 ms Can be seen using max instead of (or as well as) average
  • 55. Maximum/second • • can be visualized along with average/second • does identify outliers • 55 iostat(1M) doesn’t show this, however DTrace can doesn’t identify latency distribution details
  • 56. Latency distribution • Apart from outliers and average, it can be useful to examine the full profile of latency - all the data. • • For such a crucial metric, keep as much details as possible For latency, distributions we’d expect to see include: • • tri-modal: multiple cache layers • 56 bi-modal: cache hit vs cache miss flat: random disk I/O
  • 57. Latency Distribution Example • Using DTrace: # ./disklatency.d Tracing... Hit Ctrl-C to end. ^C sd4 (28,256), us: value 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 57 ------------- Distribution ------------- count | 0 | 82 |@@@ 621 |@@@@@ 833 |@@@@ 641 |@@@ 615 |@@@@@@@ 1239 |@@@@@@@@@ 1615 |@@@@@@@@ 1483 | 76 | 1 | 0 | 2 | 0
  • 58. disklatency.d • not why we are here, but before someone asks... #!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN { printf("Tracing... Hit Ctrl-C to end.n"); } io:::start { start_time[arg0] = timestamp; } io:::done /this->start = start_time[arg0]/ { this->delta = (timestamp - this->start) / 1000; @[args[1]->dev_statname, args[1]->dev_major, args[1]->dev_minor] = quantize(this->delta); start_time[arg0] = 0; } dtrace:::END { printa(" } 58 %s (%d,%d), us:n%@dn", @);
  • 59. Latency Distribution Example # ./disklatency.d Tracing... Hit Ctrl-C to end. ^C sd4 (28,256), us: value 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 ------------- Distribution ------------- count | 0 | 82 |@@@ 621 |@@@@@ 833 |@@@@ 641 |@@@ 615 |@@@@@@@ 1239 |@@@@@@@@@ 1615 |@@@@@@@@ 1483 | 76 | 1 | 0 | 2 | 0 • • 59 ... but can we see this distribution per second? ... how do we visualize a 3rd dimension? 65 - 131 ms outliers
  • 60. Column Quantized Visualization aka “heat map” • For example: • 60
  • 61. Heat Map: Offset Distribution • x-axis: time • y-axis: offset • z-axis (color scale): I/O count for that time/offset range • • 61 Identified random vs. sequential very well Similar heat maps have been used before by defrag tools
  • 62. Heat Map: Latency Distribution • For example: • • y-axis: latency • 62 x-axis: time z-axis (color saturation): I/O count for that time/latency range
  • 63. Heat Map: Latency Distribution • ... in fact, this is a great example: • reads served from: DRAM disk DRAM flash-memory based SSD disk ZFS “L2ARC” enabled 63
  • 64. Heat Map: Latency Distribution • ... in fact, this is a great example: • reads served from: DRAM disk DRAM flash-memory based SSD disk ZFS “L2ARC” enabled 64
  • 65. Latency Heat Map • A color shaded matrix of pixels • Each pixel is a time and latency range • Color shade picked based on number of I/O in that range • Adjusting saturation seems to work better than color hue. Eg: • • 65 darker == more I/O lighter == less I/O
  • 66. Pixel Size • Large pixels (and corresponding time/latency ranges) • • • increases likelyhood that adjacent pixels include I/O, have color, and combine to form patterns allows color to be more easily seen Smaller pixels (and time/latency ranges) • • 66 can make heat map look like a scatter plot of the same color - if ranges are so small only one I/O is typically included
  • 67. Color Palette • Linear scale can make subtle details (outliers) difficult to see • • outliers are usually < 1% of the I/O • • observing latency outliers is usually of high importance assigning < 1% of the color scale to them will washout patterns False color palette can be used to emphasize these details • 67 although color comparisons become more confusing - non-linear
  • 68. Outliers • Heat maps show these very well • However, latency outliers can compress the bulk of the heat map data • • 68 eg, 1 second outlier while most I/O is < 10 ms Users should have some control to be able to zoom/truncate details • outlier both x and y axis data bulk
  • 69. Data Storage • Since heat-maps are three dimensions, storing this data can become costly (volume) • Most of the data points are zero • and you can prevent storing zero’s by only storing populated elements: associative array • You can reduce to a sufficiently high resolution, and resample lower as needed • You can also be aggressive at reducing resolution at higher latencies • • 69 10 us granularity not as interesting for I/O > 1 second non-linear resolution
  • 70. Other Interesting Latency Heat Maps • • The “Rainbow Pterodactyl” • 70 The “Icy Lake” Latency Levels
  • 71. The “Icy Lake” Workload • About as simple as it gets: • • • 71 Single client, single thread, sequential synchronous 8 Kbyte writes to an NFS share NFS server: 22 x 7,200 RPM disks, striped pool The resulting latency heat map was unexpected
  • 73. “Icy Lake” Analysis: Observation • Examining single disk latency: • Pattern match with NFS latency: similar lines • 73 each disk contributing some lines to the overall pattern
  • 74. Pattern Match? • We just associated NFS latency with disk device latency, using our eyeballs • see the titles on the previous heat maps • You can programmatically do this (DTrace), but that can get difficult to associate context across software stack layers (but not impossible!) • Heat Maps allow this part of the problem to be offloaded to your brain • 74 and we are very good at pattern matching
  • 75. “Icy Lake” Analysis: Experimentation • Same workload, single disk pool: • No diagonal lines • 75 but more questions - see the line (false color palette enhanced) at 9.29 ms? this is < 1% of the I/O. (I’m told, and I believe, that this is due to adjacent track seek latency.)
  • 76. “Icy Lake” Analysis: Experimentation • Same workload, two disk striped pool: • Ah-hah! Diagonal lines. • 76 ... but still more questions: why does the angle sometimes change? why do some lines slope upwards and some down?
  • 77. “Icy Lake” Analysis: Experimentation • ... each disk from that pool: • 77
  • 78. “Icy Lake” Analysis: Questions • Remaining Questions: • • 78 Why does the slope sometimes change? What exactly seeds the slope in the first place?
  • 79. “Icy Lake” Analysis: Mirroring • 79 Trying mirroring the pool disks instead of striping:
  • 80. Another Example: “X marks the spot” 80
  • 81. The “Rainbow Pterodactyl” Workload • 48 x 7,200 RPM disks, 2 disk enclosures • Sequential 128 Kbyte reads to each disk (raw device), adding disks every 2 seconds • Goal: Performance analysis of system architecture • 81 identifying I/O throughput limits by driving I/O subsystem to saturation, one disk at a time (finds knee points)
  • 85. The “Rainbow Pterodactyl”: Analysis • Hasn’t been understood in detail • • 85 Would never be understood (or even known) without heat maps It is repeatable
  • 86. The “Rainbow Pterodactyl”: Theories • • “Head”: 9th disk, contention on the 2 x4 SAS ports • “Buldge”: ? • “Neck”: ? • “Wing”: contention? • “Shoulders”: ? • 86 “Beak”: disk cache hit vs disk cache miss -> bimodal “Body”: PCI-gen1 bus contention
  • 87. Latency Levels Workload • • 87 Same as “Rainbow Pterodactyl”, stepping disks Instead of sequential reads, this is repeated 128 Kbyte reads (read -> seek 0 -> read -> ...), to deliberately hit from the disk DRAM cache to improve test throughput
  • 90. Bonus Latency Heat Map • 90 This time we do know the source of the latency...
  • 92. Latency Heat Maps: Summary • Shows latency distribution over time • Shows outliers (maximums) • Indirectly shows average • Shows patterns • 92 allows correlation with other software stack layers
  • 93. Similar Heat Map Uses • These all have a dynamic y-axis scale: • • • 93 I/O size I/O offset These aren’t a primary measure of performance (like latency); they provide secondary information to understand the workload
  • 94. Heat Map: I/O Offset • 94 y-axis: I/O offset (in this case, NFSv3 file location)
  • 95. Heat Map: I/O Size • 95 y-axis: I/O size (bytes)
  • 96. Heat Map Abuse • 96 What can we ‘paint’ by adjusting the workload?
  • 101. CPU Utilization • Commonly used indicator of CPU performance • eg, vmstat(1M) $ vmstat 1 5 kthr memory r b w swap free re 0 0 0 95125264 28022732 0 0 0 91512024 25075924 0 0 0 91511864 25075796 0 0 0 91511228 25075164 0 0 0 91510824 25074940 101 page disk faults cpu mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 301 1742 1 17 17 0 0 -0 -0 -0 6 5008 21927 3886 4 1 94 6 55 0 0 0 0 0 0 0 0 0 4665 18228 4299 10 1 89 9 24 0 0 0 0 0 0 0 0 0 3504 12757 3158 8 0 92 3 163 0 0 0 0 0 0 0 0 0 4104 15375 3611 9 5 86 5 66 0 0 0 0 0 0 0 0 0 4607 19492 4394 10 1 89
  • 102. CPU Utilization: Line Graph • 102 Easy to plot:
  • 103. CPU Utilization: Line Graph • Easy to plot: • Average across all CPUs: • 103 Identifies how utilized all CPUs are, indicating remaining headroom - provided sufficient threads to use CPUs
  • 104. CPU Utilization by CPU • mpstat(1M) can show utilization by-CPU: $ mpstat 1 [...] CPU minf mjf xcal 0 0 0 2 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 2 0 0 7 0 0 0 8 0 0 0 9 0 0 30 10 0 0 0 11 0 0 0 12 0 0 0 13 38 0 15 14 0 0 0 15 0 0 0 • csw icsw migr smtx 315 0 24 4 190 0 12 4 152 0 12 1 274 1 21 3 229 0 9 2 138 0 7 3 296 0 8 2 0 9 0 1 311 0 22 2 274 0 16 4 445 0 13 7 303 0 7 8 312 0 10 1 336 2 10 1 209 0 10 38 10 0 4 2 can identify a single hot CPU (thread) • 104 intr ithr 313 105 65 28 64 20 127 74 32 5 46 19 109 32 30 8 169 68 111 54 69 29 78 36 74 34 456 285 2620 2497 20 8 and un-balanced configurations srw syscl 0 1331 0 576 0 438 0 537 0 902 0 521 0 1266 0 0 0 847 0 868 0 2559 0 1041 0 1250 0 1408 0 259 0 2 usr sys 5 1 1 1 0 1 1 1 1 1 1 0 4 0 100 0 2 1 2 0 7 1 2 0 7 1 5 2 1 3 0 0 wt idl 0 94 0 98 0 99 0 98 0 98 0 99 0 96 0 0 0 97 0 98 0 92 0 98 0 92 0 93 0 96 0 100
  • 105. CPU Resource Monitoring • Monitor overall utilization for capacity planning • Also valuable to monitor individual CPUs • • • can identify un-balanced configurations such as a single hot CPU (thread) The virtual CPUs on a single host can now reach the 100s • • 105 its own dimension how can we display this 3rd dimension?
  • 106. Heat Map: CPU Utilization 100% 0% 60s • • y-axis: percent utilization • 106 x-axis: time z-axis (color saturation): # of CPUs in that time/utilization range
  • 107. Heat Map: CPU Utilization idle CPUs single ‘hot’ CPU 100% 0% 60s • • 107 Single ‘hot’ CPUs are a common problem due to application scaleability issues (single threaded) This makes identification easy, without reading pages of mpstat(1M) output
  • 108. Heat Map: Disk Utilization • Ditto for disks • Disk Utilization heat map can identify: • • unbalanced configurations • • overall utilization single hot disks (versus all disks busy) Ideally, the disk utilization heat map is tight (y-axis) and below 70%, indicating a well balanced config with headroom • 108 which can’t be visualized with line graphs
  • 109. Back to Line Graphs... • Are typically used to visualize performance, be it IOPS or utilization • Show patterns over time more clearly than text (higher resolution) • But graphical environments can do much more • • 109 As shown by the heat maps (to start with); which convey details line graphs cannot Ask: what “value add” does the GUI bring to the data?
  • 110. Resource Utilization Heat Map Summary • Can exist for any resource with multiple components: • CPUs • Disks • Network interfaces • I/O busses • ... • Quickly identifies single hot component versus all components • Best suited for physical hardware resources • 110 difficult to express ‘utilization’ for a software resource
  • 112. Cloud Computing So far analysis has been for a single server What about the cloud? 112
  • 113. From one to thousands of servers Workload Analysis: latency I/O x cloud Resource Monitoring: # of CPUs x cloud # of disks x cloud etc. 113
  • 114. Heat Maps for the Cloud • Heat Maps are promising for cloud computing observability: • • Find outliers regardless of node • • similar to CPU and disk utilization heat maps mpstat and iostat’s output are already getting too long • 114 cloud-wide latency heat map just has more I/O Examine how applications are load balanced across nodes • • additional dimension accommodates the scale of the cloud multiply by 1000x for the number of possible hosts in a large cloud application
  • 115. Proposed Visualizations • Include: • • Latency heat maps for cloud application components • CPU utilization by cloud node • CPU utilization by CPU • Thread/process utilization across entire cloud • Network interface utilization by cloud node • Network interface utilization by port • 115 Latency heat map across entire cloud lots, lots more
  • 116. Cloud Latency Heat Map • Latency at different layers: • • PHP/Ruby/... • MySQL • DNS • Disk I/O • CPU dispatcher queue latency • 116 Apache and pattern match to quickly identify and locate latency
  • 117. Latency Example: MySQL • Query latency (DTrace): query time (ns) value ------------- Distribution ------------- count 1024 | 0 2048 | 2 4096 |@ 99 8192 | 20 16384 |@ 114 32768 |@ 105 65536 |@ 123 131072 |@@@@@@@@@@@@@ 1726 262144 |@@@@@@@@@@@ 1515 524288 |@@@@ 601 1048576 |@@ 282 2097152 |@ 114 4194304 | 61 8388608 |@@@@@ 660 16777216 | 67 33554432 | 12 67108864 | 7 134217728 | 4 268435456 | 5 536870912 | 0 117
  • 118. Latency Example: MySQL • Query latency (DTrace): query time (ns) value ------------- Distribution ------------1024 | 2048 | 4096 |@ 8192 | 16384 |@ 32768 |@ 65536 |@ 131072 |@@@@@@@@@@@@@ 262144 |@@@@@@@@@@@ 524288 |@@@@ 1048576 |@@ 2097152 |@ 4194304 | 8388608 |@@@@@ 16777216 | What is this? 33554432 | 67108864 | (8-16 ms latency) 134217728 | 268435456 | 536870912 | 118 count 0 2 99 20 114 105 123 1726 1515 601 282 114 61 660 67 12 7 4 5 0
  • 119. Latency Example: MySQL query time (ns) value ------------- Distribution ------------- count 1024 | 0 2048 | 2 4096 |@ 99 8192 | 20 16384 |@ 114 32768 |@ 105 65536 |@ 123 131072 |@@@@@@@@@@@@@ 1726 262144 |@@@@@@@@@@@ 1515 524288 |@@@@ 601 1048576 |@@ 282 2097152 |@ 114 4194304 | 61 8388608 |@@@@@ 660 16777216 | 67 33554432 | 12 67108864 | 7 134217728 | 4 oh... 268435456 | 5 536870912 | 0 innodb srv sleep (ns) value ------------- Distribution ------------4194304 | 8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16777216 | 119 count 0 841 0
  • 120. Latency Example: MySQL • • innodb thread concurrency back-off sleep latency: 8 - 16 ms • Both have a similar magnitude (see “count” column) • Add the dimension of time as a heat map, for more characteristics to compare • 120 Spike of MySQL query latency: 8 - 16 ms ... quickly compare heat maps from different components of the cloud to pattern match and locate latency
  • 121. Cloud Latency Heat Map • Identify latency outliers, distributions, patterns • Can add more functionality to identify these by: • cloud node • application, cloud-wide • I/O type (eg, query type) • Targeted observability (DTrace) can be used to fetch this • Or, we could collect it for everything • 121 ... do we need a 4th dimension?
  • 122. 4th Dimension! • Bryan Cantrill @Joyent coded this 11 hours ago • • 122 assuming it’s now about 10:30am during this talk ... and I added these slides about 7 hours ago
  • 123. 4th Dimension Example: Thread Runtime 100 ms 0 ms • • y-axis: thread runtime • 123 x-axis: time z-axis (color saturation): count at that time/runtime range
  • 124. 4th Dimension Example: Thread Runtime 100 ms 0 ms • • z-axis (color saturation): count at that time/runtime range • 124 y-axis: thread runtime • • x-axis: time omega-axis (color hue): application blue == “coreaudiod”
  • 125. 4th Dimension Example: Thread Runtime 100 ms 0 ms • • z-axis (color saturation): count at that time/runtime range • 125 y-axis: thread runtime • • x-axis: time omega-axis (color hue): application green == “iChat”
  • 126. 4th Dimension Example: Thread Runtime 100 ms 0 ms • • z-axis (color saturation): count at that time/runtime range • 126 y-axis: thread runtime • • x-axis: time omega-axis (color hue): application violet == “Chrome”
  • 127. 4th Dimension Example: Thread Runtime 100 ms 0 ms • • z-axis (color saturation): count at that time/runtime range • 127 y-axis: thread runtime • • x-axis: time omega-axis (color hue): application All colors
  • 128. “Dimensionality” • While the data supports the 4th dimension, visualizing this properly may become difficult (we are eager to find out) • • The image itself is still only 2 dimensional May be best used to view a limited set, to limit the number of different hues; uses can include: • • • 128 Highlighting different cloud application types: DB, web server, etc. Highlighting one from many components: single node, CPU, disk, etc. Limiting the set also helps storage of data
  • 129. More Visualizations • We plan much more new stuff • • 129 We are building a team of engineers to work on it; including Bryan Cantrill, Dave Pacheo, and mysqlf Dave and I have only been at Joyent for 2 1/2 weeks
  • 130. Beyond Performance Analysis • Visualizations such as heat maps could also be applied to: • Security, with pattern matching for: • • inter-keystroke-latency analysis • • robot identification based on think-time latency analysis brute force username latency attacks? System Administration • • 130 monitoring quota usage by user, filesystem, disk Other multi-dimensional datasets
  • 131. Objectives • Consider performance metrics before plotting • • • Why latency is good ... and IOPS can be bad See the value of visualizations • • • Why heat maps are needed ... and line graphs can be bad Remember key examples • • 131 I/O latency, as a heat map CPU utilization by CPU, as a heat map
  • 132. Heat Map: I/O Latency • Latency matters • • Heat map shows • 132 synchronous latency has a direct impact on performance outliers, balance, cache layers, patterns
  • 133. Heat Map: CPU Utilization 100% 0% 60s • Identify single threaded issues • • Heat map shows • 133 single CPU hitting 100% fully utilized components, balance, overall headroom, patterns
  • 134. Tools Demonstrated • For Reference: • DTraceTazTool • • Analytics • • 2008; Oracle Sun ZFS Storage Appliance “new stuff” (not named yet) • 134 2006; based on TazTool by Richard McDougall 1995. Open source, unsupported, and probably no longer works (sorry). 2010; Joyent; Bryan Cantrill, Dave Pacheco, Brendan Gregg
  • 135. Question Time • Thank you! • How to find me on the web: • • http://blogs.sun.com/brendan <-- is my old blog • 135 http://dtrace.org/blogs/brendan twitter @brendangregg