The Cloud Story or Less is More...

The
Cloud
Story
or
Less
is
More…

by
Slava
Vladyshevsky

slava[at]verizon.com

Dedicated
to
Lee,
Sarah,
David,

Andy
and
Jeff
as
well
as
many
others,

who
went
above
and
beyond
to
make
this
possible.

“Cache
is
evil.
Full
stop.”

Jeff

Table
of
Content

PART
I
–
BUILDING
TESTBED
......................................................................................................................................
6

PART
II
–
FIRST
TEST
....................................................................................................................................................
10

PART
III
–
STORAGE
STACK
PERFORMANCE
.....................................................................................................
12

PART
IV
–
DATABASE
OPTIMIZATION
..................................................................................................................
15

PART
V
–
PEELING
THE
ONION
................................................................................................................................
24

PART
VI
–
PFSENSE
........................................................................................................................................................
25

PART
VII
–
JMETER
........................................................................................................................................................
27

PART
VIII
–
ALMOST
THERE
......................................................................................................................................
28

PART
IX
–
CASSANDRA
.................................................................................................................................................
29

PART
X
–
HAPROXY
........................................................................................................................................................
34

PART
XI
–
TOMCAT
........................................................................................................................................................
40

PART
XII
–
JAVA
...............................................................................................................................................................
42

PART
XIII
–
OS
OPTIMIZATION
.................................................................................................................................
44

PART
XIV
–
NETWORK
STACK
..................................................................................................................................
44

Figure
Register

AWS
Application
Deployment
......................................................................................................................
6

Initial
VCC
Application
Deployment
..........................................................................................................
9

First
Test
Results
-‐
Comparison
Chart
..................................................................................................
10

First
Test
-‐
High
CPU
Load
on
DB
Server
.............................................................................................
11

First
Test
-‐
High
CPU
%iowait
on
DB
Server
......................................................................................
11

First
Test
-‐
Disk
I/O
Skew
on
DB
Server
..............................................................................................
11

Optimized
Storage
Subsystem
Throughput
........................................................................................
14

AWS
i2.8xlarge
CPU
load
-‐
Sysbench
Test
Completed
in
64.42
sec
..........................................
16

VCC
4C-‐28G
CPU
load
-‐
Sysbench
Test
Complete
in
283.51
sec
.................................................
16

InnoDB
Engine
Internals
.............................................................................................................................
17

Optimized
MySQL
DB
-‐
QPS
Graph
..........................................................................................................
20

Optimized
MySQL
DB
-‐
TPS
and
RT
Graph
..........................................................................................
20

Optimized
MySQL
DB
-‐RAID
Stripe
I/O
Metrics
...............................................................................
21

Optimized
MySQL
DB
-‐
CPU
Metrics
......................................................................................................
21

Optimized
MySQL
DB
-‐
Network
Metrics
.............................................................................................
22

Jennifer
APM
Console
...................................................................................................................................
25

Initial
Application
Deployment
-‐
Network
Diagram
.......................................................................
25

Jennifer
XView
-‐
Transaction
Response
Time
Scatter
Graph
......................................................
26

Jennifer
APM
-‐
Transaction
Introspection
...........................................................................................
26

Iterative
Optimization
Progress
Chart
..................................................................................................
28

Jennifer
XView
-‐
Transaction
Response
Time
Surges
.....................................................................
29

VCC
Cassandra
Cluster
CPU
Usage
During
the
Test
.........................................................................
30

AWS
Cassandra
Cluster
CPU
Usage
During
the
Test
.......................................................................
31

High-‐Level
Cassandra
Architecture
........................................................................................................
32

Jennifer
APM
-‐
Concurrent
Connections
and
Per-‐server
Arrival
Rate
....................................
35

Jennifer
APM
-‐
Connection
Statistics
After
Optimization
..............................................................
35

Jennifer
APM
-‐
DB
Connection
Pool
Usage
..........................................................................................
41

JVM
Garbage
Collection
Analysis
.............................................................................................................
42

JVM
Garbage
Collection
Analysis
–
Optimized
Run
.........................................................................
43

XEN
PV
Driver
and
Network
Device
Architecture
...........................................................................
45

Recommended
Network
Optimizations
...............................................................................................
47

Last
Performance
Test
Results
.................................................................................................................
52

Table
Register

Major
Infrastructure
Limits
..........................................................................................................................
7

AWS
Infrastructure
Mapping
and
Sizing
.................................................................................................
7

VCC
Infrastructure
Mapping
and
Sizing
..................................................................................................
8

Optimized
MySQL
DB
-‐
Recommended
Settings
...............................................................................
22

Optimized
Cassandra
-‐
Recommended
Settings
...............................................................................
33

Network
Parameter
Comparison
............................................................................................................
49

PREFACE

One
of
the
market
leading
enterprises,
hereinafter
called
Customer,
has
multiple

business
units
working
in
various
areas,
ranging
from
consumer
electronics
to
mobile

communications
and
cloud
services.
One
of
their
strategic
initiatives
is
to
expand

software
capabilities
to
get
on
top
of
the
competition.

The
Customer
started
to
use
AWS
platform
for
development
purposes
and
as
the
main

hosting
platform
for
their
cloud
services.
Over
past
years
the
usage
of
AWS
grew

significantly
with
over
30
production
applications
currently
hosted
on
AWS

infrastructure.

While
Customer
reliance
on
AWS
increased,
the
number
of
pain
points
grew
as
well.

They
experienced
multiple
outages
and
had
to
spend
unnecessary
high
costs
to
grow

application
performance
and
to
accommodate
unbalanced
CPU/Memory
hardware

profiles.
Although
achieved
application
performance
was
satisfactory
in
general,
several

major
challenges
and
trends
emerged
over
time:

-‐ Scalability
and
growth
issues

-‐ Very
high
overall
infrastructure
and
support
costs

-‐ Single
service
provider
lock-‐in.

Verizon
proposed
to
trial
the
Verizon
Cloud
Compute
(VCC)
beta
product
as
an

alternative
hosting
platform
with
the
goal
to
demonstrate
that
on
par
application

performance
can
be
achieved
at
a
much
lower
cost,
effectively
addressing
one
of
the

biggest
challenges.
An
alternative
hosting
platform
would
give
the
Customer
a
freedom

of
choice,
thus
addressing
another
issue.
Last,
but
not
least,
the
unique
VCC
platform

architecture
and
infrastructure
stack
built
for
low
latency
and
high
performance

workloads
would
definitely
help
to
address
another
pain
point
–
application

performance
and
scalability.

Senior
executives
from
both
companies
supported
this
initiative
and
one
of
the

Customer’s
applications
was
selected
for
the
proof
of
concept
project.
The
objective
was

to
compare
side-‐by-‐side
AWS
and
VCC
deployments
from
both
capability
and

performance
perspectives,
execute
performance
tests
and
deliver
report
to
senior

management.

The
proof
of
concept
project
has
been
successfully
executed
in
a
close
collaboration

between
various
Verizon
teams
as
well
as
Customer’s
SMEs.
It
was
demonstrated
that

the
application
hosted
on
the
VCC
platform,
given
appropriate
tuning,
is
capable
of

delivering
better
performance
than
when
hosted
on
a
more
powerful
AWS
based

footprint.

PART
I
–
BUILDING
TESTBED

The
agreed
high-‐level
plan
was
clear
and
straightforward:

• (Verizon)
Mirror
AWS
hosting
infrastructure
using
VCC
platform

• (Verizon)
Setup
Infrastructure,
OS
and
Applications
per
specification
sheet

• (Customer)
Adjust
necessary
configurations
and
settings
on
VCC
platform

• (Customer)
Upload
test
data
–
10
million
users,
100
million
contacts

• (Customer)
Execute
smoke,
performance
and
aging
test
in
AWS
environment

• (Customer)
Execute
smoke,
performance
and
aging
test
in
VCC
environment

• (Customer)
Compare
AWS
and
VCC
results
and
captured
metrics

• (Customer)
Deliver
report
to
senior
management

The
high-‐level
diagram
below
is
depicting
the
application
infrastructure
hosted
on
AWS

platform.

Figure
1:
AWS
Application
Deployment

Although
both
AWS
and
VCC
platforms
are
using
XEN
hypervisor
in
their
core,
the

initial
step
–
mirroring
AWS
hosting
environment
by
provisioning
equally
sized
VMs
in

VCC
raised
first
challenge.
Verizon
Cloud
Compute
platform
in
its
early
beta
stage
has

imposed
number
of
limitations.
To
be
fair,
those
limitations
were
not
by
design,
nor

hardware
limits,
rather
software
or
configuration
settings
pertinent
to
corresponding

product
release.

The
table
below
summarizes
most
important
infrastructure
limits
for
both
cloud

platforms
as
of
February
2014:

Resource
Limit
VCC
AWS

VPUs
per
VM
8
32

RAM
per
VM
28
GB
244
GB

Volumes
per
VM
5
20+

IOPS
per
Volume
(SSD)
3000
4000

Max
Volume
Size
1
TB
1
TB

Guaranteed
IOPS
per
VM
15K
40K

Throughput
per
vNIC
500
Mbps
10
Gbps

Table
1:
Major
Infrastructure
Limits

Besides
obvious
points,
like
the
number
of
CPUs
or
huge
difference
in
network

throughput,
it’s
also
worth
mentioning
that
the
CPU/RAM
–
processor
count
to
memory

size
ratio
is
quite
different
as
well
-‐
1:4.5
for
VCC
and
1:7.625
for
AWS
correspondingly.

This
ratio
is
crucial
for
certain
types
of
applications,
specifically
for
databases.

Despite
aforementioned
differences,
it
was
jointly
decided
with
the
Customer
to
move

forward
with
smaller
VCC
VMs
and
consider
sizing
ratio
while
comparing
performance

and
test
results.
This
already
set
the
expectation
that
VCC
results
might
be
lower

comparing
to
AWS,
assuming
linear
application
scalability
and
4-‐8
times
hardware

footprint
differences.

The
table
below
summarizes
infrastructure
sizing
and
mapping
for
corresponding

service
layers
hosted
on
both
cloud
platforms.
Resources
sized
differently
on
the

corresponding
platforms
are
highlighted.

VM
Role
AWS
VM
Profile

Count
VPUs
RAM,
GB
IOPS
Net,

Mbps

Tomcat
2
4
34.2
-‐
1000

MySQL
1
32
244
10K
10000

Cassandra
8
8
68.4
5K
1000

HA
Proxy
4
2
7.5
-‐
1000

DB
Cache
2
4
34.2
-‐
1000

Table
2:
AWS
Infrastructure
Mapping
and
Sizing

VM
Role
VCC
VM
Profile

Count
VPUs
RAM,
GB
IOPS
Net,

Mbps

Tomcat
2
4
28
-‐
500

MySQL
1
4
28
9K
500

Cassandra
12
4
28
5K
500

HA
Proxy
4
2
4
-‐
500

DB
Cache
2
4
28
-‐
500

Table
3:
VCC
Infrastructure
Mapping
and
Sizing

The
initial
setup
of
the
disk
volumes
required
special
creativity
in
order
to
get
as
close

as
possible
to
the
required
number
of
IOPS.
In
addition
to
the
per-‐disk
storage
limits

mentioned
above,
initially
there
was
another
VCC
limitation
in
place
that
was
luckily

addressed
later
–
all
disks
connected
to
a
particular
VM
had
to
be
provisioned
with
the

exact
same
IOPS
rate.

The
most
common
setup
used
was
based
on
LVM2
with
a
linear
extension
for
the
boot

disk
volume
group
and
either
two
or
three
additional
disks
aggregated
into
an
LVM

stripe
set.
This
setup
allowed
setting
up
disk
volumes
with
up
to
3TB
size
and
9000

IOPS,
getting
close
enough
to
the
required
10K
IOPS
for
database
VMs.

Besides
technical
limitations
the
sheer
volume
of
provisioning
and
configuration
work

presented
challenge
in
itself.
The
hosting
platform
requirements
were
captured
in
a

spreadsheet
listing
system
parameters
for
every
VM.
Following
this
spreadsheet

manually
and
building
out
environment
sequentially
would
have
required
significant

time
and
tremendous
manual
effort.
Additionally,
this
may
have
resulted
in
a
number
of

human
errors
and
omissions.
Automating
and
scripting
major
parts
of
the
installation

and
setup
process
addressed
this.

The
automation
suite
implemented
based
on
the
vzDeploymentFramework
shell
library

(Verizon
internal
development),
made
it
possible
in
a
matter
of
minutes
to:

-‐ Parse
specification
spreadsheet
for
inputs
and
updates

-‐ Generate
updated
OS
and
Application
configurations

-‐ Create
LVM
volumes
or
software
RAID
arrays

-‐ Roll-‐out
updated
settings
to
multiple
systems
based
on
their
functional
role

-‐ Change
of
Linux
iptables
based
firewall
configurations
across
the
board

-‐ Validate
required
connectivity
between
hosts

-‐ Install
required
software
packages

Having
all
configurations
in
version
controlled
repository
allowed
auditing
and

comparing
configurations
between
master
and
on-‐host
deployed
versions,
providing

rudimentary
configuration
management
capabilities.

Below
is
a
high-‐level
architecture
for
the
originally
implemented
test
environment.

Figure
2:
Initial
VCC
Application
Deployment

The
test
load
was
initiated
by
a
JMeter
Master
(Test
Controller
and
Management
GUI)

and
generated
by
several
JMeter
Slaves
(Load
Generators
or
Test
Agents).
The

generated
virtual
user
(VU)
requests
were
load-‐balanced
between
two
Tomcat

application
servers
each
running
single
application
instance.

Since
F5
LTM
instances
were
not
available
during
the
build
time,
the
proposed
design

utilized
pfSense
appliances
as
routers,
load-‐balancers
or
firewalls
for
corresponding

VLANs.

The
tomcat
servers
communicated
via
another
pair
of
HAProxy
load-‐balancers
with
two

persistent
storage
back-‐ends
–
MySQL
(SQL
DB)
and
Cassandra
(NOSQL
DB),
employing

Couchbase
(DB
Cache)
as
a
caching
layer.

Most
systems
were
additionally
instrumented
with
NMON
collectors
for
gathering
key

performance
metrics.
A
Jennifer
APM
application
has
been
deployed
to
perform
real-‐
time
transaction
monitoring
and
code
introspection.

Following
the
initial
plan,
the
hosting
environment
was
timely
handed
over
to
the

Customer
for
adjusting
configurations
and
uploading
test
data.

PART
II
–
FIRST
TEST

The
first
test
was
conducted
on
both
AWS
and
VCC
platforms
and
Customer
did
share

the
test
results.

During
the
test
the
load
was
ramped
up
using
100
VU
increments
for

each
subsequent
10
minutes
long
test
run.
During
each
run
the
corresponding
number

of
virtual
users
performed
various
API
calls
emulating
human
behavior
using
patterns

observed
and
measured
on
the
production
application.

The
chart
below
depicts
the
number
of
application
transactions
successfully
processed

by
each
platform
during
the
10
minutes
test
runs.

Figure
3:
First
Test
Results
-‐
Comparison
Chart

It
was
obvious
that
the
AWS
infrastructure
is
more
powerful,
processing
more
than
two

times
higher
throughput,
which
did
not
come
as
big
surprise.
However,
Customer

expressed
several
concerns
about
overall
VCC
platform
stability,
low
MySQL
DB
server

performance
and
uneven
load
distribution
between
striped
data
volumes,
dubbed
as

I/O
skews.

321

462

539

627
637
645
651
654

203

256
269
257
275

249
268

247

0

100

200

300

400

500

600

700

200
300
400
500
600
700
800
900

TPS
per
VU
count
AWS
TPS
Verizon
TPS

Indeed,
application
“Transactions
per
Second”
(TPS)
measurements
did
not
correlate

well
with
the
generated
application
load
and
even
with
a
growing
number
of
users

something
prevented
the
application
from
taking-‐off.
After
short
increases
overall

throughput
consistently
dropped
again,
clearly
pointing
to
a
bottleneck
limiting
the

transaction
stream.

According
to
Jennifer
APM
monitors
the
increase
in
application
transaction
times
was

caused
by
slow
DB
responses,
taking
5
second
and
more,
per
single
DB
operation.
At
the

same
time
DB
server
was
showing
very
high
CPU
%iowait,
fluctuating
about
85-‐90%.

Figure
4:
First
Test
-‐
High
CPU
Load
on
DB
Server

Figure
5:
First
Test
-‐
High
CPU
%iowait
on
DB
Server

Furthermore,
out
of
three
stripes,
parts
of
the
data
volume,
one
volume
constantly

reported
significantly
higher
device
wait
times
and
utilization
percentage,
effectively

causing
disk
I/O
skews.

Figure
6:
First
Test
-‐
Disk
I/O
Skew
on
DB
Server

Obviously,
these
test
results
were
not
acceptable.
To
investigate
and
identify

bottlenecks
and
performance
limiting
factors
good
knowledge
of
the
application

architecture
and
its
internals
was
required
as
well
as
deep
VCC
product
and
storage

stack
knowledge,
since
the
latter
two
issues
seemed
to
be
platform
and
infrastructure

related.
To
address
this
a
dedicated
cross-‐team
taskforce
was
established.

PART
III
–
STORAGE
STACK
PERFORMANCE

The
VCC
Storage
Stack
was
validated
once
more
and
it’s
been
reconfirmed,
that
there

are
no
limiting
factors
or
shortcomings
on
layers
below
block
device.
The
resulting

conclusion
was
that
the
limitations
had
to
be
on
the
hypervisor,
OS,
or
application
layer.

On
the
other
hand
Customer
confirmed
that
AWS
deployment
is
using
exactly
the
same

configuration
and
application
versions
as
VCC.
The
only
possible
logical
conclusion
was

that
the
setup
and
configuration
optimal
for
AWS
does
not
perform
the
same
way
on

VCC.
Or
in
other
words,
the
VCC
platform
required
its
own
optimal
configuration.

Further
efforts
have
been
aligned
with
the
following
objectives:

-‐ Improve
storage
throughput
and
address
I/O
skews

-‐ Identify
the
root
cause
for
low
DB
server
performance

-‐ Improve
DB
server
performance
and
scalability

-‐ Work
with
Customer
on
improving
overall
VCC
deployment
performance

-‐ Re-‐run
performance
tests
and
demonstrate
improved
throughput
and

predictable
performance
levels

Originally
the
storage
volumes
were
setup
using
Customer
specifications
and
OS

defaults
for
other
parameters.

After
performing
research
and
a
number
of
component

performance
tests,
several
interesting
discoveries
were
made,
in
particular:

-‐ Different
Linux
distributions
(Ubuntu
and
CentOS)
are
using
a
different

approach
to
disk
partitioning.
Ubuntu
did
align
partitions
for
4k
block
sizes,

while
CentOS
did
not

-‐ The
default
block
device
scheduler
CFQ
is
not
a
good
choice
in
environments

using
virtualized
storage

-‐ MDADM
and
LVM
volume
managers
are
using
quite
different
algorithms
for
I/O

batching
or
compaction

-‐ XFS
and
EXT4
file-‐systems
yield
very
different
results
depending
on
the
number

of
concurrent
threads
performing
I/O

-‐ Due
to
all
Linux
optimizations
and
multiple
caching
levels
it’s
hard
enough
to

measure
net
storage
throughput
from
within
VM,
let
alone
through
the
entire

application
stack

After
number
of
trials
and
studying
platform
behavior,
the
following
was
suggested
for

achieving
optimal
I/O
performance
on
VCC
storage
stack:

-‐ Use
raw
block
devices
instead
of
partitions
for
RAID
stripes
to
circumvent
any

partition
block
alignment
issues

-‐ Use
MDADM
software
RAID
instead
of
LVM
(the
latter
is
more
flexible
and
may

be
used
in
combination
with
MDADM,
however
it
does
perform
certain
amount

of
“optimization”
assuming
spindle
based
storage
that
may
interfere
with

performance
in
VCC)

-‐ Use
proper
stripe
settings
and
block
sizes
for
software
RAID
(don’t
let
system

guess
–
specify!)

-‐ Use
EXT4
file-‐system
instead
of
XFS.
EXT4
does
provide
journaling
for
meta-‐data

and
data
instead
of
meta-‐data
only
with
neglectable
performance
overhead
for

the
load
observed.

-‐ Use
optimal
(and
safe)
settings
for
EXT4
file-‐system
creation
and
mounts

-‐ Ensure
NOOP
block
device
scheduler
is
used
(which
lets
the
underlying
storage

stack
from
the
hypervisor
down
optimize
block
I/O
more
effectively)

-‐ Separate
various
I/O
profiles,
e.g.
sequential
I/O
(redo/bin-‐log
files)
and
random

I/O
(data
files)
for
DB
server
by
writing
corresponding
data
to
separate
logical

disks.

-‐ Use
DIRECT_IO
wherever
possible
and
avoid
OS/file-‐system
caching
(cache
may

give
in
certain
situations
false
impression
of
high
performance
which
is
then

abruptly
interrupted
by
flushing
massive
caches
during
which
the
entire
VM
gets

blocked)

-‐ Avoid
I/O
bursts
due
to
cache
flushing
and
keep
device
queue
length
close
to
8.

This
corresponds
to
a
hardware
limitation
on
the
chassis
NPU.
In
VCC
storage
is

very
low
latency
and
quick,
but
if
the
storage
queue
locks
up
the
entire
VM
gets

blocked.
Writing
early
and
often
at
a
consistent
rate
performs
dramatically

better
under
load
than
caching
in
RAM
as
long
as
possible
and
then
flooding
the

I/O
queue
when
the
cache
has
been
exhausted.

-‐ Make
sure
network
device
driver
is
not
competing
with
block
device
drivers
and

application
for
CPU
time
by
relocating
associated
interrupts
to
different
vCPU

cores
inside
the
VM.

-‐ Use
4K
blocks
for
I/O
operations
wherever
possible
for
more
optimal
storage

stack
operation.

After
implementing
these
suggestions
on
a
DB
server,
storage
subsystem
yielded

predictable
and
consistent
performance.
For
example,
data
volumes
setup
with
10K

IOPS,
have
been
reporting
~39MB/s
throughput,
which
is
expected
maximum
assuming

4K
I/O
block
size:

(4K
*
10000
IOPS)
/1024
=
39.06M,
the
maximum
possible
throughput

(4K
*
15000
IOPS)
/1024
=
58.59M,
the
maximum
possible
throughput

With
15K
IOPS
setup
using
3
stripes
(5K
IOPS
each)
the
~55-‐56MB/s
throughput
was

achieved
as
shown
on
the
screenshot
below:

Figure
7:
Optimized
Storage
Subsystem
Throughput

Although
some
minor
I/O
figures
deviation
(+/-‐
5%)
was
still
observed,
this
is
typically

considered
acceptable
and
within
normal
range.

While
performing
additional
tests
on
optimized
systems,
it
was
observed
that
all
block

device
interrupts
are
being
served
by
CPU0,
which
was
becoming
a
hot
spot
even
with

netdev
interrupts
moved
off,
to
a
different
CPUs.
The
following
method
may
be
used
to

spread
block
device
interrupts
evenly
for
devices
implementing
RAID
stripes:

#
distribute
block
device
interrupts
between
CPU4-‐CPU7

cat
/proc/interrupts

cat
/proc/irq/183[3-‐6]/smp_affinity*

echo
80
>
/proc/irq/1836/smp_affinity

echo
40
>

echo
20
>

echo
10
>

echo
8
>

Please
note
that
IRQ
numbers
and
assignment
may
differ
on
your
system.
You
have
to

consult
/proc/interrupts
table
for
specific
assignments
pertinent
to
your
system.

For
additional
details
and
theory,
please
refer
to
the
following
online
materials:

http://www.percona.com/blog/2011/06/09/aligning-‐io-‐on-‐a-‐hard-‐disk-‐raid-‐the-‐
theory/

https://www.kernel.org/doc/ols/2009/ols2009-‐pages-‐235-‐238.pdf

http://people.redhat.com/msnitzer/docs/io-‐limits.txt

PART
IV
–
DATABASE
OPTIMIZATION

Since
Customer
didn’t
share
application
and
testing
know-‐how
yet,
the
only
way
to

reproduce
abnormal
DB
behavior
during
the
test
was
to
replay
DB
transaction
log

against
recovered
from
backup
DB
snapshot.
This
was
slow,
cumbersome
and
not
really

fully
repeatable
process.
Percona
tools
were
really
instrumental
for
this
task
allowing

multithreaded
transaction
replay
inserting
delays
between
transactions
as
recorded.
A

plain
SQL
script
import
would
have
been
processed
by
single
thread
only
and
all

requests
would
be
processed
as
one
stream.

Although
the
transaction
replay
has
created
some
DB
server
load,
the
load
type
and
its

I/O
patterns
were
quite
different
compared
to
I/O
patterns
observed
during
the
test.

Transaction
logs
included
only
DML
statements
(insert,
update,
delete),
but
no
data

read
(select)
requests.
Knowing
that
those
“select”
requests
represented
75%
of
all

requests,
it
quickly
became
apparent
that
such
testing
approach
is
flawed
and
will
not

be
able
to
recreate
real-‐life
conditions.

We
came
to
a
point
where
more
advanced
tools
and
techniques
were
required
for

iterating
over
various
DB
parameters
in
a
repeatable
fashion
while
measuring
their

impact
on
DB
performance
and
underlying
subsystems.
Moreover,
it
was
not
clear

whether
unexpected
DB
behavior
and
performance
issues
were
caused
by
the

virtualization
infrastructure,
the
DB
engine
settings,
or
the
way
DB
was
used,
i.e.

combination
of
application
logic
and
data
stored
in
DB
tables.

To
separate
those
concerns
it
was
proposed
to
perform
load-‐tests
using
synthetic
OLTP

transactions
generated
by
sysbench,
a
well-‐known
load-‐testing
toolkit.
Such
tests
have

been
executed
on
both
VCC
and
AWS
platforms.
The
results
were
speaking
for

themselves.

Figure
8:
AWS
i2.8xlarge
CPU
load
-‐
Sysbench
Test
Completed
in
64.42
sec

Figure
9:
VCC
4C-‐28G
CPU
load
-‐
Sysbench
Test
Complete
in
283.51
sec

At
this
point
it
was
clear
that
DB
server’s
performance
issues
have
nothing
to
do
with

application
logic
and
are
not
specific
to
SQL
workload
and
rather
related
to

configuration
and
infrastructure.
The
OLTP
test
provided
the
capability
to
stress
test

the
DB
engine
and
optimize
it
independently,
without
having
to
rely
on
Customer’s

application
know-‐how
and
the
solution
wide
test
harness.

Thorough
research
and
study
of
InnoDB
engine
began…
Studying
source
code
as
well
as

consulting
with
the
following
online
resources
was
a
key
to
a
clear
understanding
of
DB

engine
internals
and
its
behavior:

-‐ http://www.mysqlperformanceblog.com

-‐ http://www.percona.com

-‐ http://dimitrik.free.fr/blog/

-‐ https://blog.mariadb.org

The
drawing
below
published
by
Percona
engineers
is
showing
key
factors
and
settings

impacting
DB
engine
throughput
and
performance.

Figure
10:
InnoDB
Engine
Internals

Obviously,
there
is
no
quick
win
and
no
single
dial
to
turn
in
order
to
achieve
the

optimal
result.

It’s
easy
to
explain
main
factors
impacting
InnoDB
engine
performance,

though
optimizing
those
factors
practically
is
a
quite
challenging
task.

InnoDB
Performance
–
Theory
and
Practice

The
two
most
important
parameters
for
InnoDB
performance
are

innodb_buffer_pool_size
and
innodb_log_file_size.
InnoDB
works
with
data
in
memory,

and
all
changes
to
data
are
performed
in
memory.
In
order
to
survive
a
crash
or
system

failure,
InnoDB
is
logging
changes
into
InnoDB
transaction
logs.
The
size
of
the
InnoDB

transaction
log
defines
up
to
how
many
changed
blocks
are
tolerated
in
memory
for
any

given
point
in
time.
The
obvious
question
is:
“why
can’t
we
simply
use
a
gigantic

InnoDB
transaction
log?”
The
answer
is
that
the
size
of
the
transaction
log
affects

recovery
time
after
a
crash.
The
rule
of
thumb
(until
recent)
was
-‐
the
bigger
the
log,
the

longer
the
recovery
time.
Okay,
so
we
have
another
innodb_log_file_size
variable.
Let’s

imagine
it
as
some
distance
on
imaginary
axis:

Our
current
state
is
checkpoint
age,
which
is
the
age
of
the
oldest
modified
non-‐flushed

page.
Checkpoint
age
is
located
somewhere
between
0
and
Point
0

means
there
are
no
modified
pages.
Checkpoint
age
can’t
grow
past
innodb_log_file_size,

as
that
would
mean
we
would
not
be
able
to
recover
after
a
crash.

In
fact,
InnoDB
has
two
safety
nets
or
protection
points:
“async”
and
“sync”.
When

checkpoint
age
reaches
“async”
point,
InnoDB
tries
to
flush
as
many
pages
as
possible,

while
still
allowing
other
queries,
however,
throughput
drops
down
to
the
floor.
The

“sync”
stage
is
even
worse.
When
we
reach
“sync”
point,
InnoDB
blocks
other
queries

while
trying
to
flush
pages
and
return
checkpoint
age
to
a
point
before
“async”.
This
is

done
to
prevent
checkpoint
age
from
exceeding
These
are
both

abnormal
operational
stages
for
InnoDB
and
should
be
avoided
at
all
cost.
In
current

versions
of
InnoDB,
the
“sync”
point
is
at
about
7/8
of
innodb_log_file_size,
and
the

“async”
point
is
at
about
6/8
=
3/4
of

So,
there
is
one
critically
important
balancing
act:
on
the
one
hand
we
want
“checkpoint

age”
as
large
as
possible,
as
it
defines
performance
and
throughput.
But,
on
the
other

hand,
we
should
never
reach
the
“async”
point.

The
idea
is
to
define
another
point
T
(target),
which
is
located
before
“async”,
in
order

to
have
a
gap
for
flexibility,
and
try
at
all
cost
to
keep
checkpoint
age
from
going
past
T.

We
assume
that
if
we
can
keep
“checkpoint_age”
in
the
range
0
–
T,
we
will
achieve

stable
throughput
even
for
more-‐less
unpredictable
workload.

Now,
which
factors
affecting
checkpoint
age?
When
we
execute
DML
queries
that

change
data
(insert/update/delete),
we
perform
writes
to
the
log,
we
change
pages,
and

checkpoint
age
is
growing.
When
we
perform
flushing
of
changed
pages,
checkpoint
age

is
going
down
again.
So,
that
means
–
the
main
way
we
have
to
keep
checkpoint
age

about
point
“T”
is
to
change
the
number
of
pages
being
flushed
per
second
or
make
this

number
variable
and
suited
for
specific
workload.
That
way,
we
can
keep
checkpoint

age
down.
If
this
doesn’t
help
and
checkpoint
age
keeps
growing
beyond
“T”
towards

“async”–
we
have
a
second
control
mechanism:
we
can
add
a
delay
into

insert/update/delete
operations.
This
way
we
prevent
checkpoint
age
from
growing

and
reaching
“async”.

To
summarize,
the
idea
for
the
optimization
algorithm
is:
under
load
we
must
keep

checkpoint
age
around
point
“T”
by
increasing
or
decreasing
the
number
of
pages

flushed
per
second.
If
checkpoint
age
continues
to
grow,
we
need
to
throttle
throughput

to
prevent
further
growth.
The
throttling
depends
on
the
position
of
checkpoint
age
–
as

our
checkpoint
age
gets
closer
to
“async”,
we
need
higher
levels
of
throttling.

From
Theory
to
Practice
–
Test
Framework

There
is
a
saying
-‐

In
theory,
there
is
no
difference
between
theory
and
practice,
but
in
practice
there
is…

In
practice,
there
are
a
lot
more
variables
to
bear
in
mind.
There
are
also
such
factors
as

I/O
limits,
thread
contention
and
locking
coming
into
play
and
improving
performance

is
becoming
more
like
solving
equation
with
a
number
of
variables,
which
are

depending
on
each
other…

Obviously,
for
being
able
to
iterate
over
various
parameter
and
setting
combinations

there
is
a
need
to
execute
DB
tests
in
a
repeatable
and
well-‐defined
(read
automated)

manner,
while
capturing
test
results
for
correlation
and
further
analysis.
Quick
research

showed
that
although
there
are
many
load-‐testing
frameworks
available,
with
some

being
specifically
tailored
for
testing
MySQL
DB
performance,
unfortunately,
none
of

them
would
cover
all
requirements
and
provide
needed
tools
and
automation.

Eventually,
we
developed
our
own
fully
automated
and
flexible
load-‐testing
framework.

This
framework
was
mainly
used
to
stress
test
and
analyze
MySQL
and
InnoDB

behavior,
nonetheless,
it
is
open
enough
to
plug
in
any
other
tools
or
to
be
used
for

testing
different
applications.
The
developed
toolkit
includes
following
components:

-‐ Test
Runner

-‐ Remote
Test
Agent
(load
generator)

-‐ Data
Collector
(sampler)

-‐ Data
Processor

-‐ Graphing
facility

Using
this
framework
it
was
possible
to
identify
the
optimal
MySQL
and
InnoDB
engine

configuration.
The
goal
was
to
deliver
best
possible
InnoDB
engine
performance
in

terms
of
transactions
and
queries
served
per
second
(TPS
and
QPS)
while
eliminating

I/O
spikes
and
achieving
consistent
and
predictable
system
load,
in
other
words

fulfilling
the
critically
important
balancing
act
mentioned
above:
keeping
“checkpoint

age”
as
large
as
possible
at
the
same
time
trying
not
to
reach
the
“async”
(or
even
worse

“sync”)
point.

The
graphs
below
show
that
an
optimally
configured
DB
server
can
easily
deliver
1000+

OLTP
transactions,
translating
to
20+K
queries
per
second,
generated
by
500

concurrent
DB
connections
during
a
6
hour
long
test.

Queries per second (QPS) – green

Figure
11:
Optimized
MySQL
DB
-‐
QPS
Graph

After
a
warm-‐up
phase
the
system
consistently
delivered
about
22K
queries
per
second.

Transactions per second (TPS) – green Response Time (RT) - blue

Figure
12:
Optimized
MySQL
DB
-‐
TPS
and
RT
Graph

After
ramping
up
load
up
to
500
concurrent
users,
the
system
consistently
delivered

1200
TPS
in
average.
The
response
time
1600ms
average
is
measured
end
to
end
and

includes
both
network
and
communication
overhead
(~1000ms)
and
SQL
processing

time
(~600ms).

%util - red await - green avgqu-sz - blue

Figure
13:
Optimized
MySQL
DB
-‐RAID
Stripe
I/O
Metrics

It’s
easy
to
see
that
after
the
warm-‐up
and
stabilization
phases
the
disk
stripe

performed
consistently
utilizing
an
average
disk
queue
size
~8,
which
was
suggested
by

the
storage
team
as
the
optimum
value
for
VCC
storage
stack.
The
“await”
iostat
metric

is
constantly
below
20ms
,
which
is
the
average
time
for
I/O
requests
to
be
issued
to
the

device
and
to
be
served.
Device
utilization
is
<25%
in
average,
showing
that
there
is
still

plenty
of
spare
capacity
to
serve
I/O
requests.

%idle – red %user - green %system - blue %iowait - yellow

Figure
14:
Optimized
MySQL
DB
-‐
CPU
Metrics

The
CPU
metrics
are
showing
that
in
average
55%
of
CPUs
were
idle,
35%
were
spent
in

user
space,
i.e.
executing
applications,
5%
were
spent
by
kernel
(or
system)
tasks

including
interrupt
processing
and
just
5%
were
spent
waiting
for
device
I/O.

bytes sent - green bytes received - blue

Figure
15:
Optimized
MySQL
DB
-‐
Network
Metrics

The
network
traffic
measurement
suggests
that
network
capacity
is
fully
consumed,
or

using
other
words
–
network
is
saturated
with
~48
MB/s
sent
and
~2
MB/s
received.

These
50
MB/s
of
accumulative
traffic
getting
very
close
to
a
practical
maximum

throughput
that
can
be
achieved
on
the
500
Mbps
network
interface.

In
plain
English
this
means
that
network
is
the
limiting
factor
here
and
having
other

resources
available,
DB
server
could
deliver
much
higher
TPS
and
QPS
figures,
if

additional
network
capacity
can
be
provisioned.
The
ultimate
system
capacity
limit
was

not
established
due
to
time
constraints
and
the
fact
that
Customer
application
did
not

utilize
more
than
300
concurrent
DB
connections.

Optimal
DB
Configuration

Below
is
a
summary
of
major
changes
between
the
MySQL
database
configurations
on

the
AWS
and
VCC
platforms.
As
with
the
file-‐system
configuration
the
objective
was
to

achieve
consistent
and
predictable
performance
by
avoiding
resource
usage
surges
and

stalls.

The
proposed
optimizations
may
have
a
positive
effect
in
general,
however,
they
are

specific
to
a
certain
workload
and
use-‐case.
Therefore,
these
optimizations
cannot
be

considered
as
universally
applicable
in
VCC
environments
and
must
be
tailored
for
a

specific
workload.
Settings
marked
with
asterisk
(*)
are
defaults
for
the
DB
version
used.

<
…
removed
…
>

Table
4:
Optimized
MySQL
DB
-‐
Recommended
Settings

Besides
the
parameter
changes
listed
above,
binary
logs
(also
known
as
transaction

logs)
have
been
moved
to
a
separate
volume
where
Ext4
file-‐system
has
been
setup

with
the
following
parameters:

<
…
removed
…
>

Further
areas
for
DB
improvement:

-‐ Consider
using
the
latest
stable
Percona
XtraDB
version,
which
is
based
on

MariaDB
codebase
and
provides
many
improvements,
including
patches
from

Google
and
Facebook:

o Redesign
of
locking
subsystem,
no
reliance
on
kernel
mutexes

o Latest
versions
have
removed
number
of
known
contention
points

resulting
in
less
spins
and
lock
waits
and
eventually
in
a
better
overall

performance

o Dump
and
pre-‐load
buffer
pool
features
–
allowing
much
quicker
startup

and
warming-‐up
phases

o Online
DDL
–
changing
schema
does
not
require
downtime

o Better
query
analyzer
and
overall
query
performance

o Better
page
compression
support
and
performance

o Better
monitoring
and
integration
with
performance
schema

o More
intelligent
flushing
algorithm
taking
in
consideration
both
page

change
rates,
I/O
rates,
system
load
and
capabilities
and
thus
providing

better
performance
adjusted
to
workload
out
of
the
box

o Suited
better
for
fast
SSD-‐based
storage
(no
added
cost
for
random
I/O)

and
adaptive
algorithms
not
attempting
to
accommodate
for
spinning

disks
shortcomings

o Scales
better
on
SMP
(multi-‐core)
systems
and
better
utilizes
higher

number
of
CPU
threads

o Provides
fast-‐checksums
(hardware
assisted
CRC32)
allowing
to
lessen

CPU
overhead
while
retaining
data
consistency
and
security

o New
configuration
options
allowing
to
tailor
InnoDB
engine
even
better

to
a
specific
workload

-‐ Consider
using
more
efficient
memory
allocator,
e.g.
jemalloc
or
tc_malloc.

o The
memory
allocator
provided
as
a
part
of
GLIBC
is
known
to
fall
short

under
high
concurrency.

o GLIBC
malloc
wasn’t
designed
for
multithreaded
workloads
and
has

number
of
internal
contention
points.

o Using
modern
memory
allocators
suited
for
high-‐concurrency
can

significantly
improve
throughput
by
reducing
internal
locking
and

contention.

-‐ Perform
DB
optimization.
While
optimizing
infrastructure
may
result
in

significant
improvement,
even
better
results
may
be
achieved
by
tailoring
the
DB

structure
itself:

o Consider
cluster
indexes
to
avoid
locking
and
contention

o Consider
page
compression.
Besides
slight
CPU
penalty,
this
may

significantly
improve
throughput
while
reducing
on-‐disk
storage
several

times,
resulting
in
turn
in
quicker
replication
and
backups

o Monitor
performance
schema
to
find
out
more
about
in-‐flight
DB
engine

performance
and
adjust
required
parameters

o Monitor
performance
and
information
schemas
to
find
more
details
about

index
effectiveness
and
build
better,
more
effective
indexes

-‐ Perform
SQL
optimization.
No
infrastructure
optimization
can
accommodate

for
badly
written
SQL
requests.
Caching
and
other
optimization
techniques
often

mask
bad
code.
SQL
queries
joining
multi-‐million
record
tables
may
work
just

fine
in
development
and
completely
break
down
on
a
production
DB.

Continuously
analyze
the
most
expensive
SQL
queries
to
avoid
full
table
scans

and
on-‐disk
temporary
tables.

PART
V
–
PEELING
THE
ONION

It
is
a
common
saying
that
performance
improvement
is
like
peeling
an
onion.
After

addressing
one
issue,
the
next
one,
previously
masked,
is
uncovered
and
so
on…

Likewise,
in
our
case,
after
addressing
the
storage
and
DB
layers
and
improving
overall

application
throughput
it
is
became
apparent
something
else
was
holding
the

application
back
from
delivering
the
best
possible
performance.
By
this
time,
DB
layer

was
studied
very
well,
however,
the
overall
application
stack
and
associated
connection

flows
were
not
yet
completely
understood.

The
Customer
demonstrated
willingness
to
cooperate
and
assisted
by
providing

instructions
for
reproducing
JMeter
load
tests
as
well
as
on-‐site
resources
for
an

architecture
workshop.

From
this
point
on,
the
optimization
project
speed
up
tremendously.
Not
only
was
it

possible
to
iterate
reliably
and
perform
load-‐test
against
the
complete
application
stack,

the
understanding
of
the
application
architecture
and
access
to
Application

Performance
Management
(APM)
tool
Jennifer
made
a
huge
difference
in
terms
of

visibility
into
internal
application
operation
and
major
performance
metrics.

Figure
16:
Jennifer
APM
Console

Besides
providing
visual
feedback
and
displaying
a
number
of
metrics,
Jennifer
revealed

the
next
bottleneck
–
the
network.

PART
VI
–
PFSENSE

The
original
network
design,
replicating
network
structure
in
AWS,
was
proposed
and

agreed
with
the
Customer.
Separate
networks
were
created
to
replicate
the

functionality
of
AWS
VPC
and
pfSense
appliances
were
used
to
provide
network

segmentation,
routing
and
load
balancing.

<
…
removed
…
>

Figure
17:
Initial
Application
Deployment
-‐
Network
Diagram

The
pfSense
is
an
open
source
firewall/router
software
distribution
based
on
FreeBSD.

It
is
installed
on
a
VM
and
turns
this
VM
to
a
dedicated
firewall/router
for
a
network.
It

also
provides
additional
important
functions
such
as
load
balancing,
VPN,
DHCP.
It
is

easy
to
manage
using
web
based
UI
even
for
users
with
little
knowledge
about

underlying
FreeBSD
system.

The
FreeBSD
network
stack
is
known
for
it’s
exceptional
stability
and
performance.
The

pfSense
appliances
have
been
used
many
times
before
and
after,
thus
nobody
expected

issues
coming
from
that
side…

Watching
the
Jennifer
XView
chart
closely
in
real-‐time
is
fun
by
itself,
like
watching
fire.

It
also
is
a
powerful
analysis
tool
that
helps
to
understand
application
components

behavior.

Figure
18:
Jennifer
XView
-‐
Transaction
Response
Time
Scatter
Graph

On
the
graph
above,
distance
between
layers
is
exactly
10000ms,
which
is
pointing
to

the
fact
that
one
of
application
services
is
timing-‐out
with
10
second
interval
and

repeating
connection
attempts
several
times.

Figure
19:
Jennifer
APM
-‐
Transaction
Introspection

Network
socket
operations
were
taking
a
significant
amount
of
time
resulting
in

multiple
repeated
attempts
in
10-‐second
intervals.

Following
old
sysadmin
adage
–
“…always
blame
the
network…
”
application
flows
have

been
analyzed
again
and
pfSense
was
suspected
to
loose
or
delay
packets.
Interestingly

enough,
the
web
UI
has
reported
low
to
moderate
VM
load
and
didn’t
show
any
reasons

for
concern.

Nonetheless,
the
console
access
revealed
the
truth
–
the
load
created
by
number
of

short
thread
spins
was
not
properly
reported
in
the
web
UI
and
hidden
by
averaging

calculations.
A
closer
look
using
advanced
CPU
and
system
metrics
confirmed
that
the

appliance
was
experiencing
unexpectedly
high
CPU-‐load,
adding
to
latency
and

dropping
network
packets.

Adding
more
CPUs
to
the
pfSense
appliances
resulted
in
doubling
network
traffic

passed
by
them.
However,
even
with
the
maximum
CPU
count
the
network
was
not
yet

saturated,
suggesting
that
pfSense
appliances
may
be
still
limiting
application

performance.

Since
pfSense
appliances
were
not
an
essential
requirement
and
they
were
just
used
to

provide
routing
and
load-‐balancing
capability,
it
was
decided
to
remove
them
from

application
network
flow
and
access
subnets
by
adding
additional
network
cards
to

VMs,
with
either
NIC
connected
to
corresponding
subnet.

To
summarize
-‐
it
would
be
wrong
to
conclude
that
pfSense
does
not
fit
the
purpose
and

is
not
a
viable
option
for
building
virtual
network
deployments.
Most
definitely,

additional
research
and
tuning
would
help
to
overcome
the
observed
issues.
Due
to

time
constraints
this
area
was
not
fully
researched
and
is
still
pending
thorough

investigation.

PART
VII
–
JMETER

With
pfSense
removed
and
HAProxy
used
for
load
balancing,
overall
application

throughput
was
definitely
improved.
Increasing
the
number
of
CPUs
on
the
DB
servers

and
the
Cassandra
nodes
seemed
to
help
as
well.
The
collaborative
effort
with
the

Customer
yielded
great
results
and
we
were
definitely
on
the
right
track.

With
the
floodgates
wide
open
we
have
been
able
to
push
more
than
1000+
concurrent

users
during
our
tests.
About
the
same
time
we
started
seeing
another
anomaly
–
one

out
of
three
JMeter
load
agents
(generators)
was
behaving
quite
strange.
After
reaching

end
of
the
test
at
3600
seconds
time
frame,
java
threads
belonging
to
the
two
JMeter

servers
were
shutting
down
quickly
and
the
third
instance
shutdown
took
a
while,

effectively
increasing
test
window
duration
and
as
result
negatively
impacting
average

test
metrics.

All
three
JMeter
servers
were
reconfigured
to
use
the
same
settings.
For
some
reason

they
were
using
slightly
different
configurations
and
were
logging
data
to
different

paths.
It
didn’t
resolve
the
underlying
issue,
though.
Due
to
time
constraints
it
was

decided
to
build
a
replacement
VM
rather
than
to
troubleshoot
issues
with
one
of
the

existing
VMs.

Eventually,
a
fourth
JMeter
server
was
deployed.
Besides
fixing
the
issue
with
java

threads
startup
and
shutdown,
it
allowed
us
to
generate
higher
loads
and
provided

additional
flexibility
in
defining
load-‐patterns.

Lesson
learned:
for
low
to
moderate
loads
JMeter
is
working
just
fine.
For
high
loads,

JMeter
may
become
a
breaking
point
itself.
In
this
case,
it
is
recommended
to
use
scale-‐
out
approach
rather
than
scale-‐up,
keeping
the
number
of
java-‐threads
per
server

below
a
certain
threshold.

PART
VIII
–
ALMOST
THERE

Although
AWS
performance
measurements
were
still
better,
we
had
already

significantly
improved
performance
compared
to
the
figures
captured
during
the
first

round
of
performance
tests.

Removing
pfSense
an
average
of
587
TPS
with
800
VU
was
achieved.
In
this
test
load

was
spread
statically
rather
than
balanced
by
specifying
different
target
application

server
IP
addresses
manually
in
the
JMeter
configuration
files.
With
a
HAProxy
load-‐
balancer
put
in
place
TPS
figures
initially
went
down
to
544
VU
and
after
some

optimizations
(disabled
connection
tracking,
netfilter),
it
has
increased
up
to
607
TPS

with
800
VU
–
the
maximum
we’ve
seen
to
date.
This
represents
a
22%
increase
from

the
best
previous
result
(498
TPS/800
VU
with
pfSense
yet)
and
100%
increase
from

initial
performance
test.
Overall
the
results
were
looking
more
than
promising.

Figure
20:
Iterative
Optimization
Progress
Chart

Despite
good
progress
the
following
points
still
required
further
investigation:

-‐ Disk
I/O
skew
issues
still
remained

-‐ Cassandra
servers
disk
I/O
was
uneven
and
quite
high

Our
enthusiasm
rose
more
and
more
as
we
discovered
that
VCC
platform
could
serve

more
users
than
AWS.
The
AWS
test
results
showed
that
past
600VU
performance

started
to
decline
and
we
were
able
to
push
as
high
as
1600VU
with
application
being

able
to
support
the
load
and
showing
higher
throughput
numbers
(~760-‐780TPS),
until

…

The
next
day
something
happened,
which
became
another
turning
point
in
this
project.

The
application
became
unstable
and
the
application
throughput
that
we
saw
just
a

couple
hours
earlier
decreased
significantly.
More
importantly
it
started
to
fluctuate,

with
the
application
freezing
at
random
times.
The
TPS-‐scatter
landscape
in
Jennifer

was
showing
a
new
anomaly…

Figure
21:
Jennifer
XView
-‐
Transaction
Response
Time
Surges

Since
other
known
bottlenecks
have
ben
removed
and
MySQL
DB
was
not
a
weak
link
in

the
chain
any
more,
basically
being
bored
during
the
performance
test,
the
Cassandra

cluster
became
a
next
suspect.

PART
IX
–
CASSANDRA

The
tomcat
logs
were
pointing
to
Cassandra
as
well.
There
were
numerous
warning

messages
about
excluding
one
or
another
node
from
the
connection
pool
due
to

connectivity
timeouts.

After
having
a
closer
look
at
the
Cassandra
nodes
several
points
drew
our
attention:

-‐ There
was
no
consistency
in
the
Cassandra
ring
load

-‐ Amounts
of
data
stored
on
Cassandra
nodes
varied
significantly

-‐ Memory
usage
and
I/O
profiles
were
different
across
the
board.

As
a
common
trend
after
a
short
normal
run
period,
the
average
system
load
on
several

random
Cassandra
nodes
started
growing
exponentially
eventually
making
those
nodes

unresponsive.
During
this
time
the
I/O
subsystem
was
over-‐utilized
as
well,
yielding

very
high
CPU
%wait
and
queue
length
on
block
devices.

Everything
was
pointing
to
the
fact
that
certain
Cassandra
nodes
initiated
compaction

(internal
data
structure
optimization)
right
during
the
load
test,
spiraling
down
in
a

deadly
loop.

Another
quick
conversation
with
Customer’s
architect
confirmed
the
same

–
it
was
most
likely
the
SSTable
compaction
causing
the
issue.

Figure
22:
VCC
Cassandra
Cluster
CPU
Usage
During
the
Test

As
seen
on
the
graph
above,
during
the
various
test
runs,
one
or
another
Cassandra

node
maxed
out
CPU
utilization.
The
same
configuration
in
AWS
has
been
working
just

fine
with
not
perfect
but
still
quite
even
load
and
no
continuous
load
spikes.

Figure
23:
AWS
Cassandra
Cluster
CPU
Usage
During
the
Test

Comparing
both
VCC
and
AWS
Cassandra
deployments
led
to
quite
contradicting

conclusions:

-‐ VCC
has
more
nodes
–
12
vs.
8
in
AWS,
but
it
should
improve
performance,
right?

-‐ AWS
is
using
spinning
disks
for
Cassandra
VMs
and
VCC
storage
stack
is
SSD-‐
based,
which
should
improve
performance
too…

Like
with
MySQL,
it
was
clear
-‐
the
optimal,
or
even
“good
enough”
settings
taken
from

AWS
are
not
good
or
at
times
even
bad
for
using
on
the
VCC
platform.

For
historical
reasons
Customer’s
application
is
utilizing
both
SQL
and
NOSQL

databases.
When
mapping
AWS
infrastructure
to
VCC,
it
was
decided
to
build
a

Cassandra
ring
using
12
nodes
in
VCC
instead
of
8
nodes
in
AWS,
since
latter
were
lot

more
powerful
in
terms
of
individual
node
specifications.
As
further
tests
revealed
the

better
approach
would
have
been
just
opposite
-‐
to
use
bigger
number
of
smaller
VMs

for
the
Cassandra
cluster.
It
is
also
worth
mentioning
that
Cassandra
has
been
originally

designed
to
run
on
number
of
low-‐end
systems,
based
on
slow
spinning
disks.

During
the
past
couple
years,
SSD
started
to
appear
more
and
more
often
in
the
Data

Centers.
While
not
being
a
commodity
yet,
SSDs
became
a
heavily
used
component
in

modern
infrastructures
and
the
Cassandra
codebase
was
adjusted
to
make
internal

decisions
and
algorithms
more
suitable
for
use
in
conjunction
with
SSD,
and
not
only

spinning
disks.
Therefore
deploying
the
latest
stable
Cassandra
version
could
have

provided
additional
benefits
right
away.
Unfortunately,
the
specification
required

specific
version,
and
therefore
all
optimizations
have
been
performed
against
the
older

version.

Let’s
have
a
quick
look
at
Cassandra’s
architecture
and
some
key
definitions.

Figure
24:
High-‐Level
Cassandra
Architecture

Cassandra
is
a
distributed
key-‐value
store
initially
developed
at
Facebook.
It
was

designed
to
handle
large
amounts
of
data
spread
across
many
commodity
servers.

Cassandra
provides
high
availability
through
a
symmetric
architecture
that
contains
no

single
point
of
failure
and
replicates
data
across
nodes.

Cassandra’s
architecture
is
a
combination
of
Google’s
Big-‐
Table
and
Amazon’s
Dynamo.

Like
in
Dynamo’s
architecture,
all
Cassandra
nodes
form
a
ring
that
partitions
the
key

space
using
consistent
hashing
(see
figure
above).
This
is
known
as
distributed
hash

table
(DHT).
The
data
model
and
single
node
architecture
are
mainly
based
on
BigTable

in
its
terminology.
Cassandra
can
be
classified
as
an
extensible
row
store
since
it
can

store
a
variable
number
of
attributes
per
row.
Each
row
is
accessible
through
a
globally

unique
key.
Although
columns
can
differ
per
row,
columns
are
grouped
into
more
static

column
families.
These
are
treated
like
tables
in
a
relational
database.
Each
column

family
is
stored
in
separate
files.
In
order
to
allow
the
level
of
flexibility
of
a
different

schema
per
row,
Cassandra
stores
metadata
with
each
value.
The
metadata
contains
the

column
name
as
well
as
a
timestamp
for
versioning.

Like
BigTable,
Cassandra
has
an
in-‐memory
storage
structure
that
is
called
Memtable,

one
instance
per
column
family.
The
Memtable
acts
as
a
write
cache
that
allows
for
fast

sequential
writes
to
disk.
Data
on
disk
is
stored
in
immutable
Sorted
String
Tables

(SSTable).
SSTables
consist
of
three
structures,
a
key
index,
a
bloom
filter
and
a
data

file.
The
key
index
points
to
the
rows
in
the
SSTable,
the
bloom
filter
enables
checking

for
the
existence
of
keys
in
the
table.
Due
to
the
limited
size
of
the
bloom
filter
it
is
also

cached
in
memory.
The
data
file
is
ordered
for
faster
scanning
and
merging.

For
consistency
and
fault
tolerance,
all
updates
are
first
written
to
a
sequential
log

(Commit
Log)
after
which
they
can
be
confirmed.
In
addition
to
the
Memtable,

Cassandra
provides
optional
row
caches
and
key
cache.
The
row
cache
stores
a

consolidated,
up-‐to-‐date
version
of
a
row,
while
the
key
cache
acts
as
an
index
to
the

SSTables.
If
these
are
used,
write
operations
have
to
keep
them
updated.
It
is
worth

mentioning
that
only
previously
accessed
rows
are
cached
in
Cassandra
in
both
caches.

As
a
result,
new
rows
will
only
be
written
to
the
Memtable
but
not
the
cache.

In
order
to
deliver
the
least
possible
latency
and
best
performance
on
low-‐end

hardware,
data
writes
in
Cassandra
are
using
a
multi-‐step
process,
first
writing

requests
to
the
commit-‐log,
then
to
a
MemTable
structure
and
eventually,
when
flushed,

they
are
appended
to
and
becoming
immutable
SSTables
in
the
form
of
a
disk
file.
Over

time
as
the
number
of
SSTables
is
growing,
they
are
becoming
fragmented,
which
is

impacting
read
operations
performance.

To
make
it
simple,
flushing
and
compaction
operations
are
vitally
important
for

Cassandra.
However,
if
setup
incorrectly
or
executed
at
the
“wrong”
time,
they
can

decrease
performance
significantly,
at
times
making
the
entire
Cassandra
node

unresponsive.
Exactly
this
was
happening
and
during
the
test
when
several
nodes

stopped
responding
and
showed
very
high
system
load
and
performing
huge
amounts

of
I/O.
Obviously,
Cassandra’s
configuration
was
tuned
for
spinning
disks
on
AWS,

resulting
in
unexpected
behavior
on
the
SSD-‐based
VCC
storage
stack.

As
a
first
measure
to
gain
better
visibility
into
Cassandra’s
operation,
the
DataStax

OpsCenter
application
was
deployed.
It
allowed
iterating
over
various
parameters
and

executing
a
number
of
tests
against
the
Cassandra
cluster
while
measuring
their
impact

and
helping
to
observe
overall
cluster
behavior.

Applying
all
the
lessons
learned
earlier
and
working
with
VCC
storage
team
the

following
configuration
changes
were
applied:

<
…
removed
…
>

Table
5:
Optimized
Cassandra
-‐
Recommended
Settings

Similar
to
the
MySQL
optimization,
the
basic
idea
is
to
use
more
frequent
I/O
to

saturate
block
device
queues
less
and
as
a
result
more
optimally
utilizing
storage
stack

resources.

Besides
the
recommended
option
changes,
the
commit-‐log
was
moved
to
a
separate

volume.
Those
changes
led
to
predictable
and
consistent
Cassandra
performance,

evenly
and
constantly
forcing
in-‐memory
data
to
disk
and
avoiding
I/O
spikes
and

minimizing
stalls
due
to
compaction.
Below
is
a
summary
of
the
volumes
created
for
the

Cassandra
nodes:

xvda

600
IOPS
–
boot
and
root

xvdb

600
IOPS
–
lvm2
root
extension

xvdc
4600
IOPS
–
data
mdadm
stripe
disk
1
–
no
partitioning

xvde
4600
IOPS
–
data
mdadm
stripe
disk
2
–
no
partitioning

xvdf
4600
IOPS
–
data
mdadm
stripe
disk
3
–
no
partitioning

xvdg
5000
IOPS
–
commit
log
disk
–
no
partitioning

There
are
two
more
parameters
worth
mentioning,
which
are
controlling
the
streaming

and
compaction
throughput
limits
within
the
Cassandra
cluster.
Both
values
were
set
to

50MB/s,
which
is
sufficient
for
normal
cluster
operation
and
in
line
with
storage
sub-‐
system
throughput
configured
on
the
Cassandra
nodes.
However,
sometimes
those

thresholds
may
need
to
be
changed.
In
case
of
cluster
rebalancing,
maintenance,
and

similar
operations
the
following
handy
shortcuts
may
be
used
to
control
thresholds

cluster
wide.

#
for
n
in
01
02
03
04
05
06
07
08
09
10
11
12
;
do
./nodetool
-‐h
node$n
-‐
p
9199
setcompactionthroughput
150
;
done

#
for
n
in
01
02
03
04
05
06
07
08
09
10
11
12
;
do
./nodetool
-‐h
node$n
-‐
p
9199
setstreamthroughput
150
;
done

Obviously,
after
maintenance
has
completed,
those
thresholds
should
be
set
back
to

appropriate
values
for
normal
production
use.

PART
X
–
HAPROXY

With
the
DB
layer
fixed,
application
performance
became
stable
across
tests,
although

two
points
were
still
raising
some
concerns:

-‐ After
an
initial
spike
at
the
beginning
of
a
load
test,
the
number
of
concurrent

connections
abruptly
dropped
almost
two
times

-‐ The
amount
of
Virtual
User
requests
reaching
either
application
server
was
quite

different
sometimes
reaching

a
1:2
ratio

Figure
25:
Jennifer
APM
-‐
Concurrent
Connections
and
Per-‐server
Arrival
Rate

It
was
time
to
take
a
closer
look
at
the
software
load-‐balancers
based
on
HAProxy.
This

application
is
known
to
be
able
to
serve
100K+
concurrent
connections,
so
just
one

thousand
concurrent
connections
should
not
get
even
close
to
the
limit.

Additional
research
showed
that
the
round-‐robin
load-‐balancing
scheme
is
not

performing
as
expected
and
was
causing
a
concentration
of
requests
on
one
or
another

system
in
an
unpredictable
manner.
The
most
even
request
distribution
was
achieved

by
using
least-‐connect
algorithm.

After
implementing
this
change,
the
load
eventually

evenly
spread
across
all
systems.

Figure
26:
Jennifer
APM
-‐
Connection
Statistics
After
Optimization

Furthermore,
a
number
of
SYN
flood
kernel
warnings
in
the
log
files
as
well
as

nf_conntrac
complaints
(Linux
connection
tracking
facility
used
by
iptables)
about
its

overrun
buffers
and
dropped
connections
pointed
to
next
optimization
steps.

Initially,
it
was
decided
to
increase
the
size
of
the
connection
tracking
tables
and

internal
structures
and
disable
the
SYN
flood
protection
mechanisms.

<
…
removed
…
>

This
did
show
some
improvement,
however,
eventually
it
was
decided
to
turn
iptables

off
completely
to
remove
any
possible
obstacles
and
latency
introduced
by
this
facility.

During
the
subsequent
tests
when
generated
load
was
increased
further,
HAProxy
hit

another
issue
often
referred
to
as
“TCP
socket
exhaustion”.

A
quick
reminder
–
there
were
two
layers
of
HAProxies
deployed.
The
first
layer
was

load-‐balancing
the
incoming
http
requests
originating
from
the
application
clients

between
the
java
application
server
(tomcat)
instances
and
the
second
layer
passing

requests
from
the
java
application
server
to
the
primary
and
stand-‐by
MySQL
DB

servers.

HAProxy
works
as
a
reverse-‐proxy
and
so
uses
its
own
IP
address
to
establish

connections
to
the
server.
Most
operating
systems
implementing
a
TCP
stack
typically

have
around
64K
(or
less)
TCP
source
ports
available
for
connections
to
a
remote

IP:port.
Once
a
combination
of
“source
IP:port
=>
destination
IP:port”
is
in
use,
it
cannot

be
re-‐used.

As
a
consequence
there
cannot
be
more
than
64K
open
connections
from

a
HAProxy
box
to
a
single
remote
IP:port
couple.

On
the
front
layer
the
http
request
rate
was
a
few
hundreds
per
second,
so
we
never

ever
reach
the
limit
of
64K
simultaneous
open
connections
to
the
remote
service.
On

the
backend
layer
there
should
not
have
been
more
than
a
couple
of
hundred
persistent

connections
during
peak
time
since
connection
pooling
was
used
on
the
application

server.
So
this
was
not
the
problem
either.

It
turns
out
that
there
was
an
issue
with
the
MySQL
client
implementation.
When
a

client
sends
its
“QUIT”
sequence,
it
performs
a
few
internal
operations
before

immediately
shutting
down
the
TCP
connection,
without
waiting
for
the
server
to
do
it.

A
basic
tcpdump
revealed
this
behavior.
Note
that
this
issue
cannot
be
reproduced
on
a

loopback
interface
or
on
the
same
system,
because
the
server
answers
fast
enough.
But

over
a
LAN
connection
with
2
different
machines
the
latency
raises
past
the
threshold

where
the
issue
becomes
apparent.
Basically,
here
is
the
sequence
performed
by
a

MySQL
client:

MySQL
Client
==>
"QUIT"
sequence
==>
MySQL
Server

MySQL
Client
==>

FIN

==>
MySQL
Server

MySQL
Client
<==

FIN
ACK

<==
MySQL
Server

MySQL
Client
==>

ACK

==>
MySQL
Server

This
results
in
the
client
connection
to
remain
unavailable
for
twice
the
MSL
(Maximum

Segment
Life)
time,
which
defaults
to
2
minutes.
Note
that
this
type
of
close
has
no

negative
impact
when
the
MySQL
connection
is
established
using
a
UNIX
socket.

The Cloud Story or Less is More...

The Cloud Story or Less is More...

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to The Cloud Story or Less is More...

Similar to The Cloud Story or Less is More... (20)

Recently uploaded

Recently uploaded (20)

The Cloud Story or Less is More...