This paper dedicated to the epic effort undertaken to improve application performance several times by performing systematic analysis and tuning of each application component.
The ultimate question: Which platform is the best for your application? This can be answered quite simple: The one that helps you realize all the performance your workload requires and which you are paying for.
VMware’s Software Defined Data Center (SDDC) vision leverages core data center virtualization technologies to transform data center economics and business agility through automation and non-disruptive deployment that embraces and extends existing compute, network and storage infrastructure investments. Enterprise data centers are already realizing the tremendous benefits of server and storage virtualization solutions to consolidate and repurpose infrastructure resources, reduce operational complexity and dynamically align and scale their application infrastructure in response to business priorities.
VMware’s Software Defined Data Center (SDDC) vision leverages core data center virtualization technologies to transform data center economics and business agility through automation and non-disruptive deployment that embraces and extends existing compute, network and storage infrastructure investments. Enterprise data centers are already realizing the tremendous benefits of server and storage virtualization solutions to consolidate and repurpose infrastructure resources, reduce operational complexity and dynamically align and scale their application infrastructure in response to business priorities.
XenDesktop relies on the hypervisor for many core functions, including VM creation, power operations, performance and redundancy. Therefore, it is important that you take the time to design an appropriate hypervisor infrastructure (XenServer, Hyper-V or vSphere). Otherwise, you may experience performance, functionality or even reliability issues.
Most information required to design a XenDesktop deployment on your chosen hypervisor platform is available publicly, but it can be hard to find since it’s spread across a multitude of knowledge base articles or white papers. In order to simplify and speed-up the design process, we’re in the process of consolidating the information that you need into a single document and augmenting it with recommendations and best practices. We’ve just finished incorporating the Hyper-V 2008 R2 and SCVMM 2012 planning section into the latest release of the Citrix Virtual Desktop Handbook, which includes important design decisions relating to this hypervisor, for example:
Selecting and sizing the right physical hardware for virtual machines
Knowing what storage options available for Hyper-V 2008 R2
What type of networks to build on the Hyper-V host
How to size the SCVMM servers
Designing a highly available SCVMM solution
Planning an effective failover cluster
The products covered in this current release of the handbook include XenDesktop 5.6, XenApp 6.5, Provisioning Services 6.x and XenClient Enterprise 4.5. A version of the Virtual Desktop Handbook covering XenDesktop 7.x, Provisioning Services 7, Hyper V 2012 R2 and SCVMM 2012 R2 is in the works with an initial release scheduled later in Q4. As always your feedback is welcomed.
http://blogs.citrix.com/2013/09/05/citrix-virtual-desktop-handbook-hyper-v-update/
VSAN is a new storage solution from VMware that is fully integrated with vSphere. It automatically aggregates server disks in a cluster to create shared storage that can be rapidly provisioned from VMware vCenter during VM creation.
Installing sql server 2012 failover cluster instanceDavid Muise
This document should help the reader when installing SQL Server 2012 as a failover instance. It presents information I didn't find on some blogs I read.
Cisco VXI (Virtualization Experience Infrastructure) is a new desktop virtualization and collaboration solution that combines the best of Cisco’s data center, borderless network and collaboration architectures.
Getting Started with OpenStack and VMware vSphereEMC
VMware vSphere® is the industry’s leading and most reliable virtualization and cloud computing platform. vSphere simplifies IT by separating applications and operating systems (OSs) from the underlying hardware. OpenStack is an open and scalable cloud management platform (CMP) for building public and private clouds. It is a system designed to provide infrastructure as a service (IaaS) on top of a diverse collection of hardware and software infrastructure technologies.
vSphere has a long history of being a stable and resilient platform that offers many benefits to host cloud infrastructures. As an enterprise-class hypervisor with production-level features and support, vSphere is an excellent solution for enhancing OpenStack.
Good news from the Worldwide Consulting Desktop & Apps (DnA) team! We’ve just finished updating theVirtual Desktop Handbook for XenDesktop 7, StoreFront 2.0 and XenServer 6.2.
The Virtual Desktop Handbook is an architect’s guide to desktop virtualization. It provides you with the methodology, experience and best practices you need to successfully design your own desktop virtualization solution.
Updates in this release include:
Resource requirements for Windows 8 and Server 2012
XD controller sizing
XenDesktop 7 policy guidelines
Database sizing for XenDesktop 7
SQL 2012 chapter
StoreFront 2.0 chapter
32-bit or 64-bit desktop OS guidance
Desktop group & StoreFront integration
In addition, we’ve also included a Citrix policy quick reference spreadsheet so that you can quickly identify default, baseline and template settings from XenDesktop 5 / XenApp 6 all the way up to XenDesktop 7. Thanks go out to Michael Havens, Maria Chang and Uzair Ali for creating this great reference spreadsheet.
I hope you find this handbook useful during your next desktop virtualization project.
And we’re not done yet, future updates will include:
Bandwidth
Hyper-V 2012
PVS 7
User data
And more …
The Virtual Desktop Handbook is not the only resource to guide you through your desktop virtualization journey. Citrix also provides Project Accelerator; an interactive online tool creating customized sizing and design recommendations based on the methodology, best practices and expert advice identified within this handbook.
You can still reach the XenDesktop 5 handbook using the old URL – CTX136546
Andy Baker – Architect
Worldwide Consulting
Desktop & Apps Team
http://blogs.citrix.com/2013/10/10/new-xendesktop-7-handbook-published/
A successful data migration process can be used for a one-time migration, or as a standard procedure for future migrations employing a consistent, reliable and repeatable methodology incorporating planning, tool implementation and validation. Dell data migration services can help. This whitepaper explores the Dell storage portfolio, Dell methodologies for migration and the use cases for migrating customers over to Dell Storage.
XenDesktop relies on the hypervisor for many core functions, including VM creation, power operations, performance and redundancy. Therefore, it is important that you take the time to design an appropriate hypervisor infrastructure (XenServer, Hyper-V or vSphere). Otherwise, you may experience performance, functionality or even reliability issues.
Most information required to design a XenDesktop deployment on your chosen hypervisor platform is available publicly, but it can be hard to find since it’s spread across a multitude of knowledge base articles or white papers. In order to simplify and speed-up the design process, we’re in the process of consolidating the information that you need into a single document and augmenting it with recommendations and best practices. We’ve just finished incorporating the Hyper-V 2008 R2 and SCVMM 2012 planning section into the latest release of the Citrix Virtual Desktop Handbook, which includes important design decisions relating to this hypervisor, for example:
Selecting and sizing the right physical hardware for virtual machines
Knowing what storage options available for Hyper-V 2008 R2
What type of networks to build on the Hyper-V host
How to size the SCVMM servers
Designing a highly available SCVMM solution
Planning an effective failover cluster
The products covered in this current release of the handbook include XenDesktop 5.6, XenApp 6.5, Provisioning Services 6.x and XenClient Enterprise 4.5. A version of the Virtual Desktop Handbook covering XenDesktop 7.x, Provisioning Services 7, Hyper V 2012 R2 and SCVMM 2012 R2 is in the works with an initial release scheduled later in Q4. As always your feedback is welcomed.
http://blogs.citrix.com/2013/09/05/citrix-virtual-desktop-handbook-hyper-v-update/
VSAN is a new storage solution from VMware that is fully integrated with vSphere. It automatically aggregates server disks in a cluster to create shared storage that can be rapidly provisioned from VMware vCenter during VM creation.
Installing sql server 2012 failover cluster instanceDavid Muise
This document should help the reader when installing SQL Server 2012 as a failover instance. It presents information I didn't find on some blogs I read.
Cisco VXI (Virtualization Experience Infrastructure) is a new desktop virtualization and collaboration solution that combines the best of Cisco’s data center, borderless network and collaboration architectures.
Getting Started with OpenStack and VMware vSphereEMC
VMware vSphere® is the industry’s leading and most reliable virtualization and cloud computing platform. vSphere simplifies IT by separating applications and operating systems (OSs) from the underlying hardware. OpenStack is an open and scalable cloud management platform (CMP) for building public and private clouds. It is a system designed to provide infrastructure as a service (IaaS) on top of a diverse collection of hardware and software infrastructure technologies.
vSphere has a long history of being a stable and resilient platform that offers many benefits to host cloud infrastructures. As an enterprise-class hypervisor with production-level features and support, vSphere is an excellent solution for enhancing OpenStack.
Good news from the Worldwide Consulting Desktop & Apps (DnA) team! We’ve just finished updating theVirtual Desktop Handbook for XenDesktop 7, StoreFront 2.0 and XenServer 6.2.
The Virtual Desktop Handbook is an architect’s guide to desktop virtualization. It provides you with the methodology, experience and best practices you need to successfully design your own desktop virtualization solution.
Updates in this release include:
Resource requirements for Windows 8 and Server 2012
XD controller sizing
XenDesktop 7 policy guidelines
Database sizing for XenDesktop 7
SQL 2012 chapter
StoreFront 2.0 chapter
32-bit or 64-bit desktop OS guidance
Desktop group & StoreFront integration
In addition, we’ve also included a Citrix policy quick reference spreadsheet so that you can quickly identify default, baseline and template settings from XenDesktop 5 / XenApp 6 all the way up to XenDesktop 7. Thanks go out to Michael Havens, Maria Chang and Uzair Ali for creating this great reference spreadsheet.
I hope you find this handbook useful during your next desktop virtualization project.
And we’re not done yet, future updates will include:
Bandwidth
Hyper-V 2012
PVS 7
User data
And more …
The Virtual Desktop Handbook is not the only resource to guide you through your desktop virtualization journey. Citrix also provides Project Accelerator; an interactive online tool creating customized sizing and design recommendations based on the methodology, best practices and expert advice identified within this handbook.
You can still reach the XenDesktop 5 handbook using the old URL – CTX136546
Andy Baker – Architect
Worldwide Consulting
Desktop & Apps Team
http://blogs.citrix.com/2013/10/10/new-xendesktop-7-handbook-published/
A successful data migration process can be used for a one-time migration, or as a standard procedure for future migrations employing a consistent, reliable and repeatable methodology incorporating planning, tool implementation and validation. Dell data migration services can help. This whitepaper explores the Dell storage portfolio, Dell methodologies for migration and the use cases for migrating customers over to Dell Storage.
South Korean ICT Development: Key Lessons for the Emerging EconomiesFaheem Hussain
This presentation highlights some of the key factors behind South Korea's ICT revolution. It moreover sheds lights on the key issues the emerging economies can learn from this success story.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Neuro-symbolic is not enough, we need neuro-*semantic*
The Cloud Story or Less is More...
1.
The
Cloud
Story
or
Less
is
More…
by
Slava
Vladyshevsky
slava[at]verizon.com
Dedicated
to
Lee,
Sarah,
David,
Andy
and
Jeff
as
well
as
many
others,
who
went
above
and
beyond
to
make
this
possible.
“Cache
is
evil.
Full
stop.”
Jeff
2.
3. Table
of
Content
PART
I
–
BUILDING
TESTBED
......................................................................................................................................
6
PART
II
–
FIRST
TEST
....................................................................................................................................................
10
PART
III
–
STORAGE
STACK
PERFORMANCE
.....................................................................................................
12
PART
IV
–
DATABASE
OPTIMIZATION
..................................................................................................................
15
PART
V
–
PEELING
THE
ONION
................................................................................................................................
24
PART
VI
–
PFSENSE
........................................................................................................................................................
25
PART
VII
–
JMETER
........................................................................................................................................................
27
PART
VIII
–
ALMOST
THERE
......................................................................................................................................
28
PART
IX
–
CASSANDRA
.................................................................................................................................................
29
PART
X
–
HAPROXY
........................................................................................................................................................
34
PART
XI
–
TOMCAT
........................................................................................................................................................
40
PART
XII
–
JAVA
...............................................................................................................................................................
42
PART
XIII
–
OS
OPTIMIZATION
.................................................................................................................................
44
PART
XIV
–
NETWORK
STACK
..................................................................................................................................
44
Figure
Register
AWS
Application
Deployment
......................................................................................................................
6
Initial
VCC
Application
Deployment
..........................................................................................................
9
First
Test
Results
-‐
Comparison
Chart
..................................................................................................
10
First
Test
-‐
High
CPU
Load
on
DB
Server
.............................................................................................
11
First
Test
-‐
High
CPU
%iowait
on
DB
Server
......................................................................................
11
First
Test
-‐
Disk
I/O
Skew
on
DB
Server
..............................................................................................
11
Optimized
Storage
Subsystem
Throughput
........................................................................................
14
AWS
i2.8xlarge
CPU
load
-‐
Sysbench
Test
Completed
in
64.42
sec
..........................................
16
VCC
4C-‐28G
CPU
load
-‐
Sysbench
Test
Complete
in
283.51
sec
.................................................
16
InnoDB
Engine
Internals
.............................................................................................................................
17
Optimized
MySQL
DB
-‐
QPS
Graph
..........................................................................................................
20
Optimized
MySQL
DB
-‐
TPS
and
RT
Graph
..........................................................................................
20
Optimized
MySQL
DB
-‐RAID
Stripe
I/O
Metrics
...............................................................................
21
Optimized
MySQL
DB
-‐
CPU
Metrics
......................................................................................................
21
Optimized
MySQL
DB
-‐
Network
Metrics
.............................................................................................
22
Jennifer
APM
Console
...................................................................................................................................
25
Initial
Application
Deployment
-‐
Network
Diagram
.......................................................................
25
Jennifer
XView
-‐
Transaction
Response
Time
Scatter
Graph
......................................................
26
Jennifer
APM
-‐
Transaction
Introspection
...........................................................................................
26
Iterative
Optimization
Progress
Chart
..................................................................................................
28
Jennifer
XView
-‐
Transaction
Response
Time
Surges
.....................................................................
29
VCC
Cassandra
Cluster
CPU
Usage
During
the
Test
.........................................................................
30
AWS
Cassandra
Cluster
CPU
Usage
During
the
Test
.......................................................................
31
High-‐Level
Cassandra
Architecture
........................................................................................................
32
Jennifer
APM
-‐
Concurrent
Connections
and
Per-‐server
Arrival
Rate
....................................
35
Jennifer
APM
-‐
Connection
Statistics
After
Optimization
..............................................................
35
Jennifer
APM
-‐
DB
Connection
Pool
Usage
..........................................................................................
41
JVM
Garbage
Collection
Analysis
.............................................................................................................
42
4. JVM
Garbage
Collection
Analysis
–
Optimized
Run
.........................................................................
43
XEN
PV
Driver
and
Network
Device
Architecture
...........................................................................
45
Recommended
Network
Optimizations
...............................................................................................
47
Last
Performance
Test
Results
.................................................................................................................
52
Table
Register
Major
Infrastructure
Limits
..........................................................................................................................
7
AWS
Infrastructure
Mapping
and
Sizing
.................................................................................................
7
VCC
Infrastructure
Mapping
and
Sizing
..................................................................................................
8
Optimized
MySQL
DB
-‐
Recommended
Settings
...............................................................................
22
Optimized
Cassandra
-‐
Recommended
Settings
...............................................................................
33
Network
Parameter
Comparison
............................................................................................................
49
5. PREFACE
One
of
the
market
leading
enterprises,
hereinafter
called
Customer,
has
multiple
business
units
working
in
various
areas,
ranging
from
consumer
electronics
to
mobile
communications
and
cloud
services.
One
of
their
strategic
initiatives
is
to
expand
software
capabilities
to
get
on
top
of
the
competition.
The
Customer
started
to
use
AWS
platform
for
development
purposes
and
as
the
main
hosting
platform
for
their
cloud
services.
Over
past
years
the
usage
of
AWS
grew
significantly
with
over
30
production
applications
currently
hosted
on
AWS
infrastructure.
While
Customer
reliance
on
AWS
increased,
the
number
of
pain
points
grew
as
well.
They
experienced
multiple
outages
and
had
to
spend
unnecessary
high
costs
to
grow
application
performance
and
to
accommodate
unbalanced
CPU/Memory
hardware
profiles.
Although
achieved
application
performance
was
satisfactory
in
general,
several
major
challenges
and
trends
emerged
over
time:
-‐ Scalability
and
growth
issues
-‐ Very
high
overall
infrastructure
and
support
costs
-‐ Single
service
provider
lock-‐in.
Verizon
proposed
to
trial
the
Verizon
Cloud
Compute
(VCC)
beta
product
as
an
alternative
hosting
platform
with
the
goal
to
demonstrate
that
on
par
application
performance
can
be
achieved
at
a
much
lower
cost,
effectively
addressing
one
of
the
biggest
challenges.
An
alternative
hosting
platform
would
give
the
Customer
a
freedom
of
choice,
thus
addressing
another
issue.
Last,
but
not
least,
the
unique
VCC
platform
architecture
and
infrastructure
stack
built
for
low
latency
and
high
performance
workloads
would
definitely
help
to
address
another
pain
point
–
application
performance
and
scalability.
Senior
executives
from
both
companies
supported
this
initiative
and
one
of
the
Customer’s
applications
was
selected
for
the
proof
of
concept
project.
The
objective
was
to
compare
side-‐by-‐side
AWS
and
VCC
deployments
from
both
capability
and
performance
perspectives,
execute
performance
tests
and
deliver
report
to
senior
management.
The
proof
of
concept
project
has
been
successfully
executed
in
a
close
collaboration
between
various
Verizon
teams
as
well
as
Customer’s
SMEs.
It
was
demonstrated
that
the
application
hosted
on
the
VCC
platform,
given
appropriate
tuning,
is
capable
of
delivering
better
performance
than
when
hosted
on
a
more
powerful
AWS
based
footprint.
6. PART
I
–
BUILDING
TESTBED
The
agreed
high-‐level
plan
was
clear
and
straightforward:
• (Verizon)
Mirror
AWS
hosting
infrastructure
using
VCC
platform
• (Verizon)
Setup
Infrastructure,
OS
and
Applications
per
specification
sheet
• (Customer)
Adjust
necessary
configurations
and
settings
on
VCC
platform
• (Customer)
Upload
test
data
–
10
million
users,
100
million
contacts
• (Customer)
Execute
smoke,
performance
and
aging
test
in
AWS
environment
• (Customer)
Execute
smoke,
performance
and
aging
test
in
VCC
environment
• (Customer)
Compare
AWS
and
VCC
results
and
captured
metrics
• (Customer)
Deliver
report
to
senior
management
The
high-‐level
diagram
below
is
depicting
the
application
infrastructure
hosted
on
AWS
platform.
Figure
1:
AWS
Application
Deployment
Although
both
AWS
and
VCC
platforms
are
using
XEN
hypervisor
in
their
core,
the
initial
step
–
mirroring
AWS
hosting
environment
by
provisioning
equally
sized
VMs
in
7. VCC
raised
first
challenge.
Verizon
Cloud
Compute
platform
in
its
early
beta
stage
has
imposed
number
of
limitations.
To
be
fair,
those
limitations
were
not
by
design,
nor
hardware
limits,
rather
software
or
configuration
settings
pertinent
to
corresponding
product
release.
The
table
below
summarizes
most
important
infrastructure
limits
for
both
cloud
platforms
as
of
February
2014:
Resource
Limit
VCC
AWS
VPUs
per
VM
8
32
RAM
per
VM
28
GB
244
GB
Volumes
per
VM
5
20+
IOPS
per
Volume
(SSD)
3000
4000
Max
Volume
Size
1
TB
1
TB
Guaranteed
IOPS
per
VM
15K
40K
Throughput
per
vNIC
500
Mbps
10
Gbps
Table
1:
Major
Infrastructure
Limits
Besides
obvious
points,
like
the
number
of
CPUs
or
huge
difference
in
network
throughput,
it’s
also
worth
mentioning
that
the
CPU/RAM
–
processor
count
to
memory
size
ratio
is
quite
different
as
well
-‐
1:4.5
for
VCC
and
1:7.625
for
AWS
correspondingly.
This
ratio
is
crucial
for
certain
types
of
applications,
specifically
for
databases.
Despite
aforementioned
differences,
it
was
jointly
decided
with
the
Customer
to
move
forward
with
smaller
VCC
VMs
and
consider
sizing
ratio
while
comparing
performance
and
test
results.
This
already
set
the
expectation
that
VCC
results
might
be
lower
comparing
to
AWS,
assuming
linear
application
scalability
and
4-‐8
times
hardware
footprint
differences.
The
table
below
summarizes
infrastructure
sizing
and
mapping
for
corresponding
service
layers
hosted
on
both
cloud
platforms.
Resources
sized
differently
on
the
corresponding
platforms
are
highlighted.
VM
Role
AWS
VM
Profile
Count
VPUs
RAM,
GB
IOPS
Net,
Mbps
Tomcat
2
4
34.2
-‐
1000
MySQL
1
32
244
10K
10000
Cassandra
8
8
68.4
5K
1000
HA
Proxy
4
2
7.5
-‐
1000
DB
Cache
2
4
34.2
-‐
1000
Table
2:
AWS
Infrastructure
Mapping
and
Sizing
VM
Role
VCC
VM
Profile
8. Count
VPUs
RAM,
GB
IOPS
Net,
Mbps
Tomcat
2
4
28
-‐
500
MySQL
1
4
28
9K
500
Cassandra
12
4
28
5K
500
HA
Proxy
4
2
4
-‐
500
DB
Cache
2
4
28
-‐
500
Table
3:
VCC
Infrastructure
Mapping
and
Sizing
The
initial
setup
of
the
disk
volumes
required
special
creativity
in
order
to
get
as
close
as
possible
to
the
required
number
of
IOPS.
In
addition
to
the
per-‐disk
storage
limits
mentioned
above,
initially
there
was
another
VCC
limitation
in
place
that
was
luckily
addressed
later
–
all
disks
connected
to
a
particular
VM
had
to
be
provisioned
with
the
exact
same
IOPS
rate.
The
most
common
setup
used
was
based
on
LVM2
with
a
linear
extension
for
the
boot
disk
volume
group
and
either
two
or
three
additional
disks
aggregated
into
an
LVM
stripe
set.
This
setup
allowed
setting
up
disk
volumes
with
up
to
3TB
size
and
9000
IOPS,
getting
close
enough
to
the
required
10K
IOPS
for
database
VMs.
Besides
technical
limitations
the
sheer
volume
of
provisioning
and
configuration
work
presented
challenge
in
itself.
The
hosting
platform
requirements
were
captured
in
a
spreadsheet
listing
system
parameters
for
every
VM.
Following
this
spreadsheet
manually
and
building
out
environment
sequentially
would
have
required
significant
time
and
tremendous
manual
effort.
Additionally,
this
may
have
resulted
in
a
number
of
human
errors
and
omissions.
Automating
and
scripting
major
parts
of
the
installation
and
setup
process
addressed
this.
The
automation
suite
implemented
based
on
the
vzDeploymentFramework
shell
library
(Verizon
internal
development),
made
it
possible
in
a
matter
of
minutes
to:
-‐ Parse
specification
spreadsheet
for
inputs
and
updates
-‐ Generate
updated
OS
and
Application
configurations
-‐ Create
LVM
volumes
or
software
RAID
arrays
-‐ Roll-‐out
updated
settings
to
multiple
systems
based
on
their
functional
role
-‐ Change
of
Linux
iptables
based
firewall
configurations
across
the
board
-‐ Validate
required
connectivity
between
hosts
-‐ Install
required
software
packages
Having
all
configurations
in
version
controlled
repository
allowed
auditing
and
comparing
configurations
between
master
and
on-‐host
deployed
versions,
providing
rudimentary
configuration
management
capabilities.
Below
is
a
high-‐level
architecture
for
the
originally
implemented
test
environment.
9.
Figure
2:
Initial
VCC
Application
Deployment
The
test
load
was
initiated
by
a
JMeter
Master
(Test
Controller
and
Management
GUI)
and
generated
by
several
JMeter
Slaves
(Load
Generators
or
Test
Agents).
The
generated
virtual
user
(VU)
requests
were
load-‐balanced
between
two
Tomcat
application
servers
each
running
single
application
instance.
Since
F5
LTM
instances
were
not
available
during
the
build
time,
the
proposed
design
utilized
pfSense
appliances
as
routers,
load-‐balancers
or
firewalls
for
corresponding
VLANs.
The
tomcat
servers
communicated
via
another
pair
of
HAProxy
load-‐balancers
with
two
persistent
storage
back-‐ends
–
MySQL
(SQL
DB)
and
Cassandra
(NOSQL
DB),
employing
Couchbase
(DB
Cache)
as
a
caching
layer.
10. Most
systems
were
additionally
instrumented
with
NMON
collectors
for
gathering
key
performance
metrics.
A
Jennifer
APM
application
has
been
deployed
to
perform
real-‐
time
transaction
monitoring
and
code
introspection.
Following
the
initial
plan,
the
hosting
environment
was
timely
handed
over
to
the
Customer
for
adjusting
configurations
and
uploading
test
data.
PART
II
–
FIRST
TEST
The
first
test
was
conducted
on
both
AWS
and
VCC
platforms
and
Customer
did
share
the
test
results.
During
the
test
the
load
was
ramped
up
using
100
VU
increments
for
each
subsequent
10
minutes
long
test
run.
During
each
run
the
corresponding
number
of
virtual
users
performed
various
API
calls
emulating
human
behavior
using
patterns
observed
and
measured
on
the
production
application.
The
chart
below
depicts
the
number
of
application
transactions
successfully
processed
by
each
platform
during
the
10
minutes
test
runs.
Figure
3:
First
Test
Results
-‐
Comparison
Chart
It
was
obvious
that
the
AWS
infrastructure
is
more
powerful,
processing
more
than
two
times
higher
throughput,
which
did
not
come
as
big
surprise.
However,
Customer
expressed
several
concerns
about
overall
VCC
platform
stability,
low
MySQL
DB
server
performance
and
uneven
load
distribution
between
striped
data
volumes,
dubbed
as
I/O
skews.
321
462
539
627
637
645
651
654
203
256
269
257
275
249
268
247
0
100
200
300
400
500
600
700
200
300
400
500
600
700
800
900
TPS
per
VU
count
AWS
TPS
Verizon
TPS
11.
Indeed,
application
“Transactions
per
Second”
(TPS)
measurements
did
not
correlate
well
with
the
generated
application
load
and
even
with
a
growing
number
of
users
something
prevented
the
application
from
taking-‐off.
After
short
increases
overall
throughput
consistently
dropped
again,
clearly
pointing
to
a
bottleneck
limiting
the
transaction
stream.
According
to
Jennifer
APM
monitors
the
increase
in
application
transaction
times
was
caused
by
slow
DB
responses,
taking
5
second
and
more,
per
single
DB
operation.
At
the
same
time
DB
server
was
showing
very
high
CPU
%iowait,
fluctuating
about
85-‐90%.
Figure
4:
First
Test
-‐
High
CPU
Load
on
DB
Server
Figure
5:
First
Test
-‐
High
CPU
%iowait
on
DB
Server
Furthermore,
out
of
three
stripes,
parts
of
the
data
volume,
one
volume
constantly
reported
significantly
higher
device
wait
times
and
utilization
percentage,
effectively
causing
disk
I/O
skews.
Figure
6:
First
Test
-‐
Disk
I/O
Skew
on
DB
Server
12. Obviously,
these
test
results
were
not
acceptable.
To
investigate
and
identify
bottlenecks
and
performance
limiting
factors
good
knowledge
of
the
application
architecture
and
its
internals
was
required
as
well
as
deep
VCC
product
and
storage
stack
knowledge,
since
the
latter
two
issues
seemed
to
be
platform
and
infrastructure
related.
To
address
this
a
dedicated
cross-‐team
taskforce
was
established.
PART
III
–
STORAGE
STACK
PERFORMANCE
The
VCC
Storage
Stack
was
validated
once
more
and
it’s
been
reconfirmed,
that
there
are
no
limiting
factors
or
shortcomings
on
layers
below
block
device.
The
resulting
conclusion
was
that
the
limitations
had
to
be
on
the
hypervisor,
OS,
or
application
layer.
On
the
other
hand
Customer
confirmed
that
AWS
deployment
is
using
exactly
the
same
configuration
and
application
versions
as
VCC.
The
only
possible
logical
conclusion
was
that
the
setup
and
configuration
optimal
for
AWS
does
not
perform
the
same
way
on
VCC.
Or
in
other
words,
the
VCC
platform
required
its
own
optimal
configuration.
Further
efforts
have
been
aligned
with
the
following
objectives:
-‐ Improve
storage
throughput
and
address
I/O
skews
-‐ Identify
the
root
cause
for
low
DB
server
performance
-‐ Improve
DB
server
performance
and
scalability
-‐ Work
with
Customer
on
improving
overall
VCC
deployment
performance
-‐ Re-‐run
performance
tests
and
demonstrate
improved
throughput
and
predictable
performance
levels
Originally
the
storage
volumes
were
setup
using
Customer
specifications
and
OS
defaults
for
other
parameters.
After
performing
research
and
a
number
of
component
performance
tests,
several
interesting
discoveries
were
made,
in
particular:
-‐ Different
Linux
distributions
(Ubuntu
and
CentOS)
are
using
a
different
approach
to
disk
partitioning.
Ubuntu
did
align
partitions
for
4k
block
sizes,
while
CentOS
did
not
-‐ The
default
block
device
scheduler
CFQ
is
not
a
good
choice
in
environments
using
virtualized
storage
-‐ MDADM
and
LVM
volume
managers
are
using
quite
different
algorithms
for
I/O
batching
or
compaction
-‐ XFS
and
EXT4
file-‐systems
yield
very
different
results
depending
on
the
number
of
concurrent
threads
performing
I/O
-‐ Due
to
all
Linux
optimizations
and
multiple
caching
levels
it’s
hard
enough
to
measure
net
storage
throughput
from
within
VM,
let
alone
through
the
entire
application
stack
After
number
of
trials
and
studying
platform
behavior,
the
following
was
suggested
for
achieving
optimal
I/O
performance
on
VCC
storage
stack:
-‐ Use
raw
block
devices
instead
of
partitions
for
RAID
stripes
to
circumvent
any
partition
block
alignment
issues
13. -‐ Use
MDADM
software
RAID
instead
of
LVM
(the
latter
is
more
flexible
and
may
be
used
in
combination
with
MDADM,
however
it
does
perform
certain
amount
of
“optimization”
assuming
spindle
based
storage
that
may
interfere
with
performance
in
VCC)
-‐ Use
proper
stripe
settings
and
block
sizes
for
software
RAID
(don’t
let
system
guess
–
specify!)
-‐ Use
EXT4
file-‐system
instead
of
XFS.
EXT4
does
provide
journaling
for
meta-‐data
and
data
instead
of
meta-‐data
only
with
neglectable
performance
overhead
for
the
load
observed.
-‐ Use
optimal
(and
safe)
settings
for
EXT4
file-‐system
creation
and
mounts
-‐ Ensure
NOOP
block
device
scheduler
is
used
(which
lets
the
underlying
storage
stack
from
the
hypervisor
down
optimize
block
I/O
more
effectively)
-‐ Separate
various
I/O
profiles,
e.g.
sequential
I/O
(redo/bin-‐log
files)
and
random
I/O
(data
files)
for
DB
server
by
writing
corresponding
data
to
separate
logical
disks.
-‐ Use
DIRECT_IO
wherever
possible
and
avoid
OS/file-‐system
caching
(cache
may
give
in
certain
situations
false
impression
of
high
performance
which
is
then
abruptly
interrupted
by
flushing
massive
caches
during
which
the
entire
VM
gets
blocked)
-‐ Avoid
I/O
bursts
due
to
cache
flushing
and
keep
device
queue
length
close
to
8.
This
corresponds
to
a
hardware
limitation
on
the
chassis
NPU.
In
VCC
storage
is
very
low
latency
and
quick,
but
if
the
storage
queue
locks
up
the
entire
VM
gets
blocked.
Writing
early
and
often
at
a
consistent
rate
performs
dramatically
better
under
load
than
caching
in
RAM
as
long
as
possible
and
then
flooding
the
I/O
queue
when
the
cache
has
been
exhausted.
-‐ Make
sure
network
device
driver
is
not
competing
with
block
device
drivers
and
application
for
CPU
time
by
relocating
associated
interrupts
to
different
vCPU
cores
inside
the
VM.
-‐ Use
4K
blocks
for
I/O
operations
wherever
possible
for
more
optimal
storage
stack
operation.
After
implementing
these
suggestions
on
a
DB
server,
storage
subsystem
yielded
predictable
and
consistent
performance.
For
example,
data
volumes
setup
with
10K
IOPS,
have
been
reporting
~39MB/s
throughput,
which
is
expected
maximum
assuming
4K
I/O
block
size:
(4K
*
10000
IOPS)
/1024
=
39.06M,
the
maximum
possible
throughput
(4K
*
15000
IOPS)
/1024
=
58.59M,
the
maximum
possible
throughput
With
15K
IOPS
setup
using
3
stripes
(5K
IOPS
each)
the
~55-‐56MB/s
throughput
was
achieved
as
shown
on
the
screenshot
below:
14.
Figure
7:
Optimized
Storage
Subsystem
Throughput
Although
some
minor
I/O
figures
deviation
(+/-‐
5%)
was
still
observed,
this
is
typically
considered
acceptable
and
within
normal
range.
While
performing
additional
tests
on
optimized
systems,
it
was
observed
that
all
block
device
interrupts
are
being
served
by
CPU0,
which
was
becoming
a
hot
spot
even
with
netdev
interrupts
moved
off,
to
a
different
CPUs.
The
following
method
may
be
used
to
spread
block
device
interrupts
evenly
for
devices
implementing
RAID
stripes:
#
distribute
block
device
interrupts
between
CPU4-‐CPU7
cat
/proc/interrupts
cat
/proc/irq/183[3-‐6]/smp_affinity*
echo
80
>
/proc/irq/1836/smp_affinity
echo
40
>
/proc/irq/1835/smp_affinity
echo
20
>
/proc/irq/1834/smp_affinity
echo
10
>
/proc/irq/1833/smp_affinity
echo
8
>
/proc/irq/1838/smp_affinity
15. Please
note
that
IRQ
numbers
and
assignment
may
differ
on
your
system.
You
have
to
consult
/proc/interrupts
table
for
specific
assignments
pertinent
to
your
system.
For
additional
details
and
theory,
please
refer
to
the
following
online
materials:
http://www.percona.com/blog/2011/06/09/aligning-‐io-‐on-‐a-‐hard-‐disk-‐raid-‐the-‐
theory/
https://www.kernel.org/doc/ols/2009/ols2009-‐pages-‐235-‐238.pdf
http://people.redhat.com/msnitzer/docs/io-‐limits.txt
PART
IV
–
DATABASE
OPTIMIZATION
Since
Customer
didn’t
share
application
and
testing
know-‐how
yet,
the
only
way
to
reproduce
abnormal
DB
behavior
during
the
test
was
to
replay
DB
transaction
log
against
recovered
from
backup
DB
snapshot.
This
was
slow,
cumbersome
and
not
really
fully
repeatable
process.
Percona
tools
were
really
instrumental
for
this
task
allowing
multithreaded
transaction
replay
inserting
delays
between
transactions
as
recorded.
A
plain
SQL
script
import
would
have
been
processed
by
single
thread
only
and
all
requests
would
be
processed
as
one
stream.
Although
the
transaction
replay
has
created
some
DB
server
load,
the
load
type
and
its
I/O
patterns
were
quite
different
compared
to
I/O
patterns
observed
during
the
test.
Transaction
logs
included
only
DML
statements
(insert,
update,
delete),
but
no
data
read
(select)
requests.
Knowing
that
those
“select”
requests
represented
75%
of
all
requests,
it
quickly
became
apparent
that
such
testing
approach
is
flawed
and
will
not
be
able
to
recreate
real-‐life
conditions.
We
came
to
a
point
where
more
advanced
tools
and
techniques
were
required
for
iterating
over
various
DB
parameters
in
a
repeatable
fashion
while
measuring
their
impact
on
DB
performance
and
underlying
subsystems.
Moreover,
it
was
not
clear
whether
unexpected
DB
behavior
and
performance
issues
were
caused
by
the
virtualization
infrastructure,
the
DB
engine
settings,
or
the
way
DB
was
used,
i.e.
combination
of
application
logic
and
data
stored
in
DB
tables.
To
separate
those
concerns
it
was
proposed
to
perform
load-‐tests
using
synthetic
OLTP
transactions
generated
by
sysbench,
a
well-‐known
load-‐testing
toolkit.
Such
tests
have
been
executed
on
both
VCC
and
AWS
platforms.
The
results
were
speaking
for
themselves.
16.
Figure
8:
AWS
i2.8xlarge
CPU
load
-‐
Sysbench
Test
Completed
in
64.42
sec
Figure
9:
VCC
4C-‐28G
CPU
load
-‐
Sysbench
Test
Complete
in
283.51
sec
At
this
point
it
was
clear
that
DB
server’s
performance
issues
have
nothing
to
do
with
application
logic
and
are
not
specific
to
SQL
workload
and
rather
related
to
configuration
and
infrastructure.
The
OLTP
test
provided
the
capability
to
stress
test
the
DB
engine
and
optimize
it
independently,
without
having
to
rely
on
Customer’s
application
know-‐how
and
the
solution
wide
test
harness.
17. Thorough
research
and
study
of
InnoDB
engine
began…
Studying
source
code
as
well
as
consulting
with
the
following
online
resources
was
a
key
to
a
clear
understanding
of
DB
engine
internals
and
its
behavior:
-‐ http://www.mysqlperformanceblog.com
-‐ http://www.percona.com
-‐ http://dimitrik.free.fr/blog/
-‐ https://blog.mariadb.org
The
drawing
below
published
by
Percona
engineers
is
showing
key
factors
and
settings
impacting
DB
engine
throughput
and
performance.
Figure
10:
InnoDB
Engine
Internals
Obviously,
there
is
no
quick
win
and
no
single
dial
to
turn
in
order
to
achieve
the
optimal
result.
It’s
easy
to
explain
main
factors
impacting
InnoDB
engine
performance,
though
optimizing
those
factors
practically
is
a
quite
challenging
task.
18.
InnoDB
Performance
–
Theory
and
Practice
The
two
most
important
parameters
for
InnoDB
performance
are
innodb_buffer_pool_size
and
innodb_log_file_size.
InnoDB
works
with
data
in
memory,
and
all
changes
to
data
are
performed
in
memory.
In
order
to
survive
a
crash
or
system
failure,
InnoDB
is
logging
changes
into
InnoDB
transaction
logs.
The
size
of
the
InnoDB
transaction
log
defines
up
to
how
many
changed
blocks
are
tolerated
in
memory
for
any
given
point
in
time.
The
obvious
question
is:
“why
can’t
we
simply
use
a
gigantic
InnoDB
transaction
log?”
The
answer
is
that
the
size
of
the
transaction
log
affects
recovery
time
after
a
crash.
The
rule
of
thumb
(until
recent)
was
-‐
the
bigger
the
log,
the
longer
the
recovery
time.
Okay,
so
we
have
another
innodb_log_file_size
variable.
Let’s
imagine
it
as
some
distance
on
imaginary
axis:
Our
current
state
is
checkpoint
age,
which
is
the
age
of
the
oldest
modified
non-‐flushed
page.
Checkpoint
age
is
located
somewhere
between
0
and
innodb_log_file_size.
Point
0
means
there
are
no
modified
pages.
Checkpoint
age
can’t
grow
past
innodb_log_file_size,
as
that
would
mean
we
would
not
be
able
to
recover
after
a
crash.
In
fact,
InnoDB
has
two
safety
nets
or
protection
points:
“async”
and
“sync”.
When
checkpoint
age
reaches
“async”
point,
InnoDB
tries
to
flush
as
many
pages
as
possible,
while
still
allowing
other
queries,
however,
throughput
drops
down
to
the
floor.
The
“sync”
stage
is
even
worse.
When
we
reach
“sync”
point,
InnoDB
blocks
other
queries
while
trying
to
flush
pages
and
return
checkpoint
age
to
a
point
before
“async”.
This
is
done
to
prevent
checkpoint
age
from
exceeding
innodb_log_file_size.
These
are
both
abnormal
operational
stages
for
InnoDB
and
should
be
avoided
at
all
cost.
In
current
versions
of
InnoDB,
the
“sync”
point
is
at
about
7/8
of
innodb_log_file_size,
and
the
“async”
point
is
at
about
6/8
=
3/4
of
innodb_log_file_size.
So,
there
is
one
critically
important
balancing
act:
on
the
one
hand
we
want
“checkpoint
age”
as
large
as
possible,
as
it
defines
performance
and
throughput.
But,
on
the
other
hand,
we
should
never
reach
the
“async”
point.
19.
The
idea
is
to
define
another
point
T
(target),
which
is
located
before
“async”,
in
order
to
have
a
gap
for
flexibility,
and
try
at
all
cost
to
keep
checkpoint
age
from
going
past
T.
We
assume
that
if
we
can
keep
“checkpoint_age”
in
the
range
0
–
T,
we
will
achieve
stable
throughput
even
for
more-‐less
unpredictable
workload.
Now,
which
factors
affecting
checkpoint
age?
When
we
execute
DML
queries
that
change
data
(insert/update/delete),
we
perform
writes
to
the
log,
we
change
pages,
and
checkpoint
age
is
growing.
When
we
perform
flushing
of
changed
pages,
checkpoint
age
is
going
down
again.
So,
that
means
–
the
main
way
we
have
to
keep
checkpoint
age
about
point
“T”
is
to
change
the
number
of
pages
being
flushed
per
second
or
make
this
number
variable
and
suited
for
specific
workload.
That
way,
we
can
keep
checkpoint
age
down.
If
this
doesn’t
help
and
checkpoint
age
keeps
growing
beyond
“T”
towards
“async”–
we
have
a
second
control
mechanism:
we
can
add
a
delay
into
insert/update/delete
operations.
This
way
we
prevent
checkpoint
age
from
growing
and
reaching
“async”.
To
summarize,
the
idea
for
the
optimization
algorithm
is:
under
load
we
must
keep
checkpoint
age
around
point
“T”
by
increasing
or
decreasing
the
number
of
pages
flushed
per
second.
If
checkpoint
age
continues
to
grow,
we
need
to
throttle
throughput
to
prevent
further
growth.
The
throttling
depends
on
the
position
of
checkpoint
age
–
as
our
checkpoint
age
gets
closer
to
“async”,
we
need
higher
levels
of
throttling.
From
Theory
to
Practice
–
Test
Framework
There
is
a
saying
-‐
In
theory,
there
is
no
difference
between
theory
and
practice,
but
in
practice
there
is…
In
practice,
there
are
a
lot
more
variables
to
bear
in
mind.
There
are
also
such
factors
as
I/O
limits,
thread
contention
and
locking
coming
into
play
and
improving
performance
is
becoming
more
like
solving
equation
with
a
number
of
variables,
which
are
depending
on
each
other…
Obviously,
for
being
able
to
iterate
over
various
parameter
and
setting
combinations
there
is
a
need
to
execute
DB
tests
in
a
repeatable
and
well-‐defined
(read
automated)
manner,
while
capturing
test
results
for
correlation
and
further
analysis.
Quick
research
showed
that
although
there
are
many
load-‐testing
frameworks
available,
with
some
being
specifically
tailored
for
testing
MySQL
DB
performance,
unfortunately,
none
of
them
would
cover
all
requirements
and
provide
needed
tools
and
automation.
Eventually,
we
developed
our
own
fully
automated
and
flexible
load-‐testing
framework.
This
framework
was
mainly
used
to
stress
test
and
analyze
MySQL
and
InnoDB
20. behavior,
nonetheless,
it
is
open
enough
to
plug
in
any
other
tools
or
to
be
used
for
testing
different
applications.
The
developed
toolkit
includes
following
components:
-‐ Test
Runner
-‐ Remote
Test
Agent
(load
generator)
-‐ Data
Collector
(sampler)
-‐ Data
Processor
-‐ Graphing
facility
Using
this
framework
it
was
possible
to
identify
the
optimal
MySQL
and
InnoDB
engine
configuration.
The
goal
was
to
deliver
best
possible
InnoDB
engine
performance
in
terms
of
transactions
and
queries
served
per
second
(TPS
and
QPS)
while
eliminating
I/O
spikes
and
achieving
consistent
and
predictable
system
load,
in
other
words
fulfilling
the
critically
important
balancing
act
mentioned
above:
keeping
“checkpoint
age”
as
large
as
possible
at
the
same
time
trying
not
to
reach
the
“async”
(or
even
worse
“sync”)
point.
The
graphs
below
show
that
an
optimally
configured
DB
server
can
easily
deliver
1000+
OLTP
transactions,
translating
to
20+K
queries
per
second,
generated
by
500
concurrent
DB
connections
during
a
6
hour
long
test.
Queries per second (QPS) – green
Figure
11:
Optimized
MySQL
DB
-‐
QPS
Graph
After
a
warm-‐up
phase
the
system
consistently
delivered
about
22K
queries
per
second.
Transactions per second (TPS) – green Response Time (RT) - blue
Figure
12:
Optimized
MySQL
DB
-‐
TPS
and
RT
Graph
21.
After
ramping
up
load
up
to
500
concurrent
users,
the
system
consistently
delivered
1200
TPS
in
average.
The
response
time
1600ms
average
is
measured
end
to
end
and
includes
both
network
and
communication
overhead
(~1000ms)
and
SQL
processing
time
(~600ms).
%util - red await - green avgqu-sz - blue
Figure
13:
Optimized
MySQL
DB
-‐RAID
Stripe
I/O
Metrics
It’s
easy
to
see
that
after
the
warm-‐up
and
stabilization
phases
the
disk
stripe
performed
consistently
utilizing
an
average
disk
queue
size
~8,
which
was
suggested
by
the
storage
team
as
the
optimum
value
for
VCC
storage
stack.
The
“await”
iostat
metric
is
constantly
below
20ms
,
which
is
the
average
time
for
I/O
requests
to
be
issued
to
the
device
and
to
be
served.
Device
utilization
is
<25%
in
average,
showing
that
there
is
still
plenty
of
spare
capacity
to
serve
I/O
requests.
%idle – red %user - green %system - blue %iowait - yellow
Figure
14:
Optimized
MySQL
DB
-‐
CPU
Metrics
The
CPU
metrics
are
showing
that
in
average
55%
of
CPUs
were
idle,
35%
were
spent
in
user
space,
i.e.
executing
applications,
5%
were
spent
by
kernel
(or
system)
tasks
including
interrupt
processing
and
just
5%
were
spent
waiting
for
device
I/O.
22.
bytes sent - green bytes received - blue
Figure
15:
Optimized
MySQL
DB
-‐
Network
Metrics
The
network
traffic
measurement
suggests
that
network
capacity
is
fully
consumed,
or
using
other
words
–
network
is
saturated
with
~48
MB/s
sent
and
~2
MB/s
received.
These
50
MB/s
of
accumulative
traffic
getting
very
close
to
a
practical
maximum
throughput
that
can
be
achieved
on
the
500
Mbps
network
interface.
In
plain
English
this
means
that
network
is
the
limiting
factor
here
and
having
other
resources
available,
DB
server
could
deliver
much
higher
TPS
and
QPS
figures,
if
additional
network
capacity
can
be
provisioned.
The
ultimate
system
capacity
limit
was
not
established
due
to
time
constraints
and
the
fact
that
Customer
application
did
not
utilize
more
than
300
concurrent
DB
connections.
Optimal
DB
Configuration
Below
is
a
summary
of
major
changes
between
the
MySQL
database
configurations
on
the
AWS
and
VCC
platforms.
As
with
the
file-‐system
configuration
the
objective
was
to
achieve
consistent
and
predictable
performance
by
avoiding
resource
usage
surges
and
stalls.
The
proposed
optimizations
may
have
a
positive
effect
in
general,
however,
they
are
specific
to
a
certain
workload
and
use-‐case.
Therefore,
these
optimizations
cannot
be
considered
as
universally
applicable
in
VCC
environments
and
must
be
tailored
for
a
specific
workload.
Settings
marked
with
asterisk
(*)
are
defaults
for
the
DB
version
used.
<
…
removed
…
>
Table
4:
Optimized
MySQL
DB
-‐
Recommended
Settings
Besides
the
parameter
changes
listed
above,
binary
logs
(also
known
as
transaction
logs)
have
been
moved
to
a
separate
volume
where
Ext4
file-‐system
has
been
setup
with
the
following
parameters:
<
…
removed
…
>
23. Further
areas
for
DB
improvement:
-‐ Consider
using
the
latest
stable
Percona
XtraDB
version,
which
is
based
on
MariaDB
codebase
and
provides
many
improvements,
including
patches
from
Google
and
Facebook:
o Redesign
of
locking
subsystem,
no
reliance
on
kernel
mutexes
o Latest
versions
have
removed
number
of
known
contention
points
resulting
in
less
spins
and
lock
waits
and
eventually
in
a
better
overall
performance
o Dump
and
pre-‐load
buffer
pool
features
–
allowing
much
quicker
startup
and
warming-‐up
phases
o Online
DDL
–
changing
schema
does
not
require
downtime
o Better
query
analyzer
and
overall
query
performance
o Better
page
compression
support
and
performance
o Better
monitoring
and
integration
with
performance
schema
o More
intelligent
flushing
algorithm
taking
in
consideration
both
page
change
rates,
I/O
rates,
system
load
and
capabilities
and
thus
providing
better
performance
adjusted
to
workload
out
of
the
box
o Suited
better
for
fast
SSD-‐based
storage
(no
added
cost
for
random
I/O)
and
adaptive
algorithms
not
attempting
to
accommodate
for
spinning
disks
shortcomings
o Scales
better
on
SMP
(multi-‐core)
systems
and
better
utilizes
higher
number
of
CPU
threads
o Provides
fast-‐checksums
(hardware
assisted
CRC32)
allowing
to
lessen
CPU
overhead
while
retaining
data
consistency
and
security
o New
configuration
options
allowing
to
tailor
InnoDB
engine
even
better
to
a
specific
workload
-‐ Consider
using
more
efficient
memory
allocator,
e.g.
jemalloc
or
tc_malloc.
o The
memory
allocator
provided
as
a
part
of
GLIBC
is
known
to
fall
short
under
high
concurrency.
o GLIBC
malloc
wasn’t
designed
for
multithreaded
workloads
and
has
number
of
internal
contention
points.
o Using
modern
memory
allocators
suited
for
high-‐concurrency
can
significantly
improve
throughput
by
reducing
internal
locking
and
contention.
-‐ Perform
DB
optimization.
While
optimizing
infrastructure
may
result
in
significant
improvement,
even
better
results
may
be
achieved
by
tailoring
the
DB
structure
itself:
o Consider
cluster
indexes
to
avoid
locking
and
contention
o Consider
page
compression.
Besides
slight
CPU
penalty,
this
may
significantly
improve
throughput
while
reducing
on-‐disk
storage
several
times,
resulting
in
turn
in
quicker
replication
and
backups
o Monitor
performance
schema
to
find
out
more
about
in-‐flight
DB
engine
performance
and
adjust
required
parameters
o Monitor
performance
and
information
schemas
to
find
more
details
about
index
effectiveness
and
build
better,
more
effective
indexes
24. -‐ Perform
SQL
optimization.
No
infrastructure
optimization
can
accommodate
for
badly
written
SQL
requests.
Caching
and
other
optimization
techniques
often
mask
bad
code.
SQL
queries
joining
multi-‐million
record
tables
may
work
just
fine
in
development
and
completely
break
down
on
a
production
DB.
Continuously
analyze
the
most
expensive
SQL
queries
to
avoid
full
table
scans
and
on-‐disk
temporary
tables.
PART
V
–
PEELING
THE
ONION
It
is
a
common
saying
that
performance
improvement
is
like
peeling
an
onion.
After
addressing
one
issue,
the
next
one,
previously
masked,
is
uncovered
and
so
on…
Likewise,
in
our
case,
after
addressing
the
storage
and
DB
layers
and
improving
overall
application
throughput
it
is
became
apparent
something
else
was
holding
the
application
back
from
delivering
the
best
possible
performance.
By
this
time,
DB
layer
was
studied
very
well,
however,
the
overall
application
stack
and
associated
connection
flows
were
not
yet
completely
understood.
The
Customer
demonstrated
willingness
to
cooperate
and
assisted
by
providing
instructions
for
reproducing
JMeter
load
tests
as
well
as
on-‐site
resources
for
an
architecture
workshop.
From
this
point
on,
the
optimization
project
speed
up
tremendously.
Not
only
was
it
possible
to
iterate
reliably
and
perform
load-‐test
against
the
complete
application
stack,
the
understanding
of
the
application
architecture
and
access
to
Application
Performance
Management
(APM)
tool
Jennifer
made
a
huge
difference
in
terms
of
visibility
into
internal
application
operation
and
major
performance
metrics.
25. Figure
16:
Jennifer
APM
Console
Besides
providing
visual
feedback
and
displaying
a
number
of
metrics,
Jennifer
revealed
the
next
bottleneck
–
the
network.
PART
VI
–
PFSENSE
The
original
network
design,
replicating
network
structure
in
AWS,
was
proposed
and
agreed
with
the
Customer.
Separate
networks
were
created
to
replicate
the
functionality
of
AWS
VPC
and
pfSense
appliances
were
used
to
provide
network
segmentation,
routing
and
load
balancing.
<
…
removed
…
>
Figure
17:
Initial
Application
Deployment
-‐
Network
Diagram
The
pfSense
is
an
open
source
firewall/router
software
distribution
based
on
FreeBSD.
It
is
installed
on
a
VM
and
turns
this
VM
to
a
dedicated
firewall/router
for
a
network.
It
also
provides
additional
important
functions
such
as
load
balancing,
VPN,
DHCP.
It
is
easy
to
manage
using
web
based
UI
even
for
users
with
little
knowledge
about
underlying
FreeBSD
system.
The
FreeBSD
network
stack
is
known
for
it’s
exceptional
stability
and
performance.
The
pfSense
appliances
have
been
used
many
times
before
and
after,
thus
nobody
expected
issues
coming
from
that
side…
Watching
the
Jennifer
XView
chart
closely
in
real-‐time
is
fun
by
itself,
like
watching
fire.
It
also
is
a
powerful
analysis
tool
that
helps
to
understand
application
components
behavior.
26.
Figure
18:
Jennifer
XView
-‐
Transaction
Response
Time
Scatter
Graph
On
the
graph
above,
distance
between
layers
is
exactly
10000ms,
which
is
pointing
to
the
fact
that
one
of
application
services
is
timing-‐out
with
10
second
interval
and
repeating
connection
attempts
several
times.
Figure
19:
Jennifer
APM
-‐
Transaction
Introspection
Network
socket
operations
were
taking
a
significant
amount
of
time
resulting
in
multiple
repeated
attempts
in
10-‐second
intervals.
Following
old
sysadmin
adage
–
“…always
blame
the
network…
”
application
flows
have
been
analyzed
again
and
pfSense
was
suspected
to
loose
or
delay
packets.
Interestingly
enough,
the
web
UI
has
reported
low
to
moderate
VM
load
and
didn’t
show
any
reasons
for
concern.
27. Nonetheless,
the
console
access
revealed
the
truth
–
the
load
created
by
number
of
short
thread
spins
was
not
properly
reported
in
the
web
UI
and
hidden
by
averaging
calculations.
A
closer
look
using
advanced
CPU
and
system
metrics
confirmed
that
the
appliance
was
experiencing
unexpectedly
high
CPU-‐load,
adding
to
latency
and
dropping
network
packets.
Adding
more
CPUs
to
the
pfSense
appliances
resulted
in
doubling
network
traffic
passed
by
them.
However,
even
with
the
maximum
CPU
count
the
network
was
not
yet
saturated,
suggesting
that
pfSense
appliances
may
be
still
limiting
application
performance.
Since
pfSense
appliances
were
not
an
essential
requirement
and
they
were
just
used
to
provide
routing
and
load-‐balancing
capability,
it
was
decided
to
remove
them
from
application
network
flow
and
access
subnets
by
adding
additional
network
cards
to
VMs,
with
either
NIC
connected
to
corresponding
subnet.
To
summarize
-‐
it
would
be
wrong
to
conclude
that
pfSense
does
not
fit
the
purpose
and
is
not
a
viable
option
for
building
virtual
network
deployments.
Most
definitely,
additional
research
and
tuning
would
help
to
overcome
the
observed
issues.
Due
to
time
constraints
this
area
was
not
fully
researched
and
is
still
pending
thorough
investigation.
PART
VII
–
JMETER
With
pfSense
removed
and
HAProxy
used
for
load
balancing,
overall
application
throughput
was
definitely
improved.
Increasing
the
number
of
CPUs
on
the
DB
servers
and
the
Cassandra
nodes
seemed
to
help
as
well.
The
collaborative
effort
with
the
Customer
yielded
great
results
and
we
were
definitely
on
the
right
track.
With
the
floodgates
wide
open
we
have
been
able
to
push
more
than
1000+
concurrent
users
during
our
tests.
About
the
same
time
we
started
seeing
another
anomaly
–
one
out
of
three
JMeter
load
agents
(generators)
was
behaving
quite
strange.
After
reaching
end
of
the
test
at
3600
seconds
time
frame,
java
threads
belonging
to
the
two
JMeter
servers
were
shutting
down
quickly
and
the
third
instance
shutdown
took
a
while,
effectively
increasing
test
window
duration
and
as
result
negatively
impacting
average
test
metrics.
All
three
JMeter
servers
were
reconfigured
to
use
the
same
settings.
For
some
reason
they
were
using
slightly
different
configurations
and
were
logging
data
to
different
paths.
It
didn’t
resolve
the
underlying
issue,
though.
Due
to
time
constraints
it
was
decided
to
build
a
replacement
VM
rather
than
to
troubleshoot
issues
with
one
of
the
existing
VMs.
Eventually,
a
fourth
JMeter
server
was
deployed.
Besides
fixing
the
issue
with
java
threads
startup
and
shutdown,
it
allowed
us
to
generate
higher
loads
and
provided
additional
flexibility
in
defining
load-‐patterns.
28.
Lesson
learned:
for
low
to
moderate
loads
JMeter
is
working
just
fine.
For
high
loads,
JMeter
may
become
a
breaking
point
itself.
In
this
case,
it
is
recommended
to
use
scale-‐
out
approach
rather
than
scale-‐up,
keeping
the
number
of
java-‐threads
per
server
below
a
certain
threshold.
PART
VIII
–
ALMOST
THERE
Although
AWS
performance
measurements
were
still
better,
we
had
already
significantly
improved
performance
compared
to
the
figures
captured
during
the
first
round
of
performance
tests.
Removing
pfSense
an
average
of
587
TPS
with
800
VU
was
achieved.
In
this
test
load
was
spread
statically
rather
than
balanced
by
specifying
different
target
application
server
IP
addresses
manually
in
the
JMeter
configuration
files.
With
a
HAProxy
load-‐
balancer
put
in
place
TPS
figures
initially
went
down
to
544
VU
and
after
some
optimizations
(disabled
connection
tracking,
netfilter),
it
has
increased
up
to
607
TPS
with
800
VU
–
the
maximum
we’ve
seen
to
date.
This
represents
a
22%
increase
from
the
best
previous
result
(498
TPS/800
VU
with
pfSense
yet)
and
100%
increase
from
initial
performance
test.
Overall
the
results
were
looking
more
than
promising.
Figure
20:
Iterative
Optimization
Progress
Chart
29.
Despite
good
progress
the
following
points
still
required
further
investigation:
-‐ Disk
I/O
skew
issues
still
remained
-‐ Cassandra
servers
disk
I/O
was
uneven
and
quite
high
Our
enthusiasm
rose
more
and
more
as
we
discovered
that
VCC
platform
could
serve
more
users
than
AWS.
The
AWS
test
results
showed
that
past
600VU
performance
started
to
decline
and
we
were
able
to
push
as
high
as
1600VU
with
application
being
able
to
support
the
load
and
showing
higher
throughput
numbers
(~760-‐780TPS),
until
…
The
next
day
something
happened,
which
became
another
turning
point
in
this
project.
The
application
became
unstable
and
the
application
throughput
that
we
saw
just
a
couple
hours
earlier
decreased
significantly.
More
importantly
it
started
to
fluctuate,
with
the
application
freezing
at
random
times.
The
TPS-‐scatter
landscape
in
Jennifer
was
showing
a
new
anomaly…
Figure
21:
Jennifer
XView
-‐
Transaction
Response
Time
Surges
Since
other
known
bottlenecks
have
ben
removed
and
MySQL
DB
was
not
a
weak
link
in
the
chain
any
more,
basically
being
bored
during
the
performance
test,
the
Cassandra
cluster
became
a
next
suspect.
PART
IX
–
CASSANDRA
30. The
tomcat
logs
were
pointing
to
Cassandra
as
well.
There
were
numerous
warning
messages
about
excluding
one
or
another
node
from
the
connection
pool
due
to
connectivity
timeouts.
After
having
a
closer
look
at
the
Cassandra
nodes
several
points
drew
our
attention:
-‐ There
was
no
consistency
in
the
Cassandra
ring
load
-‐ Amounts
of
data
stored
on
Cassandra
nodes
varied
significantly
-‐ Memory
usage
and
I/O
profiles
were
different
across
the
board.
As
a
common
trend
after
a
short
normal
run
period,
the
average
system
load
on
several
random
Cassandra
nodes
started
growing
exponentially
eventually
making
those
nodes
unresponsive.
During
this
time
the
I/O
subsystem
was
over-‐utilized
as
well,
yielding
very
high
CPU
%wait
and
queue
length
on
block
devices.
Everything
was
pointing
to
the
fact
that
certain
Cassandra
nodes
initiated
compaction
(internal
data
structure
optimization)
right
during
the
load
test,
spiraling
down
in
a
deadly
loop.
Another
quick
conversation
with
Customer’s
architect
confirmed
the
same
–
it
was
most
likely
the
SSTable
compaction
causing
the
issue.
Figure
22:
VCC
Cassandra
Cluster
CPU
Usage
During
the
Test
As
seen
on
the
graph
above,
during
the
various
test
runs,
one
or
another
Cassandra
node
maxed
out
CPU
utilization.
The
same
configuration
in
AWS
has
been
working
just
fine
with
not
perfect
but
still
quite
even
load
and
no
continuous
load
spikes.
31.
Figure
23:
AWS
Cassandra
Cluster
CPU
Usage
During
the
Test
Comparing
both
VCC
and
AWS
Cassandra
deployments
led
to
quite
contradicting
conclusions:
-‐ VCC
has
more
nodes
–
12
vs.
8
in
AWS,
but
it
should
improve
performance,
right?
-‐ AWS
is
using
spinning
disks
for
Cassandra
VMs
and
VCC
storage
stack
is
SSD-‐
based,
which
should
improve
performance
too…
Like
with
MySQL,
it
was
clear
-‐
the
optimal,
or
even
“good
enough”
settings
taken
from
AWS
are
not
good
or
at
times
even
bad
for
using
on
the
VCC
platform.
For
historical
reasons
Customer’s
application
is
utilizing
both
SQL
and
NOSQL
databases.
When
mapping
AWS
infrastructure
to
VCC,
it
was
decided
to
build
a
Cassandra
ring
using
12
nodes
in
VCC
instead
of
8
nodes
in
AWS,
since
latter
were
lot
more
powerful
in
terms
of
individual
node
specifications.
As
further
tests
revealed
the
better
approach
would
have
been
just
opposite
-‐
to
use
bigger
number
of
smaller
VMs
for
the
Cassandra
cluster.
It
is
also
worth
mentioning
that
Cassandra
has
been
originally
designed
to
run
on
number
of
low-‐end
systems,
based
on
slow
spinning
disks.
During
the
past
couple
years,
SSD
started
to
appear
more
and
more
often
in
the
Data
Centers.
While
not
being
a
commodity
yet,
SSDs
became
a
heavily
used
component
in
modern
infrastructures
and
the
Cassandra
codebase
was
adjusted
to
make
internal
decisions
and
algorithms
more
suitable
for
use
in
conjunction
with
SSD,
and
not
only
spinning
disks.
Therefore
deploying
the
latest
stable
Cassandra
version
could
have
provided
additional
benefits
right
away.
Unfortunately,
the
specification
required
specific
version,
and
therefore
all
optimizations
have
been
performed
against
the
older
version.
Let’s
have
a
quick
look
at
Cassandra’s
architecture
and
some
key
definitions.
32.
Figure
24:
High-‐Level
Cassandra
Architecture
Cassandra
is
a
distributed
key-‐value
store
initially
developed
at
Facebook.
It
was
designed
to
handle
large
amounts
of
data
spread
across
many
commodity
servers.
Cassandra
provides
high
availability
through
a
symmetric
architecture
that
contains
no
single
point
of
failure
and
replicates
data
across
nodes.
Cassandra’s
architecture
is
a
combination
of
Google’s
Big-‐
Table
and
Amazon’s
Dynamo.
Like
in
Dynamo’s
architecture,
all
Cassandra
nodes
form
a
ring
that
partitions
the
key
space
using
consistent
hashing
(see
figure
above).
This
is
known
as
distributed
hash
table
(DHT).
The
data
model
and
single
node
architecture
are
mainly
based
on
BigTable
in
its
terminology.
Cassandra
can
be
classified
as
an
extensible
row
store
since
it
can
store
a
variable
number
of
attributes
per
row.
Each
row
is
accessible
through
a
globally
unique
key.
Although
columns
can
differ
per
row,
columns
are
grouped
into
more
static
column
families.
These
are
treated
like
tables
in
a
relational
database.
Each
column
family
is
stored
in
separate
files.
In
order
to
allow
the
level
of
flexibility
of
a
different
schema
per
row,
Cassandra
stores
metadata
with
each
value.
The
metadata
contains
the
column
name
as
well
as
a
timestamp
for
versioning.
Like
BigTable,
Cassandra
has
an
in-‐memory
storage
structure
that
is
called
Memtable,
one
instance
per
column
family.
The
Memtable
acts
as
a
write
cache
that
allows
for
fast
sequential
writes
to
disk.
Data
on
disk
is
stored
in
immutable
Sorted
String
Tables
(SSTable).
SSTables
consist
of
three
structures,
a
key
index,
a
bloom
filter
and
a
data
file.
The
key
index
points
to
the
rows
in
the
SSTable,
the
bloom
filter
enables
checking
for
the
existence
of
keys
in
the
table.
Due
to
the
limited
size
of
the
bloom
filter
it
is
also
cached
in
memory.
The
data
file
is
ordered
for
faster
scanning
and
merging.
For
consistency
and
fault
tolerance,
all
updates
are
first
written
to
a
sequential
log
(Commit
Log)
after
which
they
can
be
confirmed.
In
addition
to
the
Memtable,
Cassandra
provides
optional
row
caches
and
key
cache.
The
row
cache
stores
a
consolidated,
up-‐to-‐date
version
of
a
row,
while
the
key
cache
acts
as
an
index
to
the
SSTables.
If
these
are
used,
write
operations
have
to
keep
them
updated.
It
is
worth
33. mentioning
that
only
previously
accessed
rows
are
cached
in
Cassandra
in
both
caches.
As
a
result,
new
rows
will
only
be
written
to
the
Memtable
but
not
the
cache.
In
order
to
deliver
the
least
possible
latency
and
best
performance
on
low-‐end
hardware,
data
writes
in
Cassandra
are
using
a
multi-‐step
process,
first
writing
requests
to
the
commit-‐log,
then
to
a
MemTable
structure
and
eventually,
when
flushed,
they
are
appended
to
and
becoming
immutable
SSTables
in
the
form
of
a
disk
file.
Over
time
as
the
number
of
SSTables
is
growing,
they
are
becoming
fragmented,
which
is
impacting
read
operations
performance.
To
make
it
simple,
flushing
and
compaction
operations
are
vitally
important
for
Cassandra.
However,
if
setup
incorrectly
or
executed
at
the
“wrong”
time,
they
can
decrease
performance
significantly,
at
times
making
the
entire
Cassandra
node
unresponsive.
Exactly
this
was
happening
and
during
the
test
when
several
nodes
stopped
responding
and
showed
very
high
system
load
and
performing
huge
amounts
of
I/O.
Obviously,
Cassandra’s
configuration
was
tuned
for
spinning
disks
on
AWS,
resulting
in
unexpected
behavior
on
the
SSD-‐based
VCC
storage
stack.
As
a
first
measure
to
gain
better
visibility
into
Cassandra’s
operation,
the
DataStax
OpsCenter
application
was
deployed.
It
allowed
iterating
over
various
parameters
and
executing
a
number
of
tests
against
the
Cassandra
cluster
while
measuring
their
impact
and
helping
to
observe
overall
cluster
behavior.
Applying
all
the
lessons
learned
earlier
and
working
with
VCC
storage
team
the
following
configuration
changes
were
applied:
<
…
removed
…
>
Table
5:
Optimized
Cassandra
-‐
Recommended
Settings
Similar
to
the
MySQL
optimization,
the
basic
idea
is
to
use
more
frequent
I/O
to
saturate
block
device
queues
less
and
as
a
result
more
optimally
utilizing
storage
stack
resources.
Besides
the
recommended
option
changes,
the
commit-‐log
was
moved
to
a
separate
volume.
Those
changes
led
to
predictable
and
consistent
Cassandra
performance,
evenly
and
constantly
forcing
in-‐memory
data
to
disk
and
avoiding
I/O
spikes
and
minimizing
stalls
due
to
compaction.
Below
is
a
summary
of
the
volumes
created
for
the
Cassandra
nodes:
xvda
600
IOPS
–
boot
and
root
xvdb
600
IOPS
–
lvm2
root
extension
xvdc
4600
IOPS
–
data
mdadm
stripe
disk
1
–
no
partitioning
xvde
4600
IOPS
–
data
mdadm
stripe
disk
2
–
no
partitioning
34. xvdf
4600
IOPS
–
data
mdadm
stripe
disk
3
–
no
partitioning
xvdg
5000
IOPS
–
commit
log
disk
–
no
partitioning
There
are
two
more
parameters
worth
mentioning,
which
are
controlling
the
streaming
and
compaction
throughput
limits
within
the
Cassandra
cluster.
Both
values
were
set
to
50MB/s,
which
is
sufficient
for
normal
cluster
operation
and
in
line
with
storage
sub-‐
system
throughput
configured
on
the
Cassandra
nodes.
However,
sometimes
those
thresholds
may
need
to
be
changed.
In
case
of
cluster
rebalancing,
maintenance,
and
similar
operations
the
following
handy
shortcuts
may
be
used
to
control
thresholds
cluster
wide.
#
for
n
in
01
02
03
04
05
06
07
08
09
10
11
12
;
do
./nodetool
-‐h
node$n
-‐
p
9199
setcompactionthroughput
150
;
done
#
for
n
in
01
02
03
04
05
06
07
08
09
10
11
12
;
do
./nodetool
-‐h
node$n
-‐
p
9199
setstreamthroughput
150
;
done
Obviously,
after
maintenance
has
completed,
those
thresholds
should
be
set
back
to
appropriate
values
for
normal
production
use.
PART
X
–
HAPROXY
With
the
DB
layer
fixed,
application
performance
became
stable
across
tests,
although
two
points
were
still
raising
some
concerns:
-‐ After
an
initial
spike
at
the
beginning
of
a
load
test,
the
number
of
concurrent
connections
abruptly
dropped
almost
two
times
-‐ The
amount
of
Virtual
User
requests
reaching
either
application
server
was
quite
different
sometimes
reaching
a
1:2
ratio
35. Figure
25:
Jennifer
APM
-‐
Concurrent
Connections
and
Per-‐server
Arrival
Rate
It
was
time
to
take
a
closer
look
at
the
software
load-‐balancers
based
on
HAProxy.
This
application
is
known
to
be
able
to
serve
100K+
concurrent
connections,
so
just
one
thousand
concurrent
connections
should
not
get
even
close
to
the
limit.
Additional
research
showed
that
the
round-‐robin
load-‐balancing
scheme
is
not
performing
as
expected
and
was
causing
a
concentration
of
requests
on
one
or
another
system
in
an
unpredictable
manner.
The
most
even
request
distribution
was
achieved
by
using
least-‐connect
algorithm.
After
implementing
this
change,
the
load
eventually
evenly
spread
across
all
systems.
Figure
26:
Jennifer
APM
-‐
Connection
Statistics
After
Optimization
Furthermore,
a
number
of
SYN
flood
kernel
warnings
in
the
log
files
as
well
as
nf_conntrac
complaints
(Linux
connection
tracking
facility
used
by
iptables)
about
its
overrun
buffers
and
dropped
connections
pointed
to
next
optimization
steps.
Initially,
it
was
decided
to
increase
the
size
of
the
connection
tracking
tables
and
internal
structures
and
disable
the
SYN
flood
protection
mechanisms.
<
…
removed
…
>
This
did
show
some
improvement,
however,
eventually
it
was
decided
to
turn
iptables
off
completely
to
remove
any
possible
obstacles
and
latency
introduced
by
this
facility.
36. During
the
subsequent
tests
when
generated
load
was
increased
further,
HAProxy
hit
another
issue
often
referred
to
as
“TCP
socket
exhaustion”.
A
quick
reminder
–
there
were
two
layers
of
HAProxies
deployed.
The
first
layer
was
load-‐balancing
the
incoming
http
requests
originating
from
the
application
clients
between
the
java
application
server
(tomcat)
instances
and
the
second
layer
passing
requests
from
the
java
application
server
to
the
primary
and
stand-‐by
MySQL
DB
servers.
HAProxy
works
as
a
reverse-‐proxy
and
so
uses
its
own
IP
address
to
establish
connections
to
the
server.
Most
operating
systems
implementing
a
TCP
stack
typically
have
around
64K
(or
less)
TCP
source
ports
available
for
connections
to
a
remote
IP:port.
Once
a
combination
of
“source
IP:port
=>
destination
IP:port”
is
in
use,
it
cannot
be
re-‐used.
As
a
consequence
there
cannot
be
more
than
64K
open
connections
from
a
HAProxy
box
to
a
single
remote
IP:port
couple.
On
the
front
layer
the
http
request
rate
was
a
few
hundreds
per
second,
so
we
never
ever
reach
the
limit
of
64K
simultaneous
open
connections
to
the
remote
service.
On
the
backend
layer
there
should
not
have
been
more
than
a
couple
of
hundred
persistent
connections
during
peak
time
since
connection
pooling
was
used
on
the
application
server.
So
this
was
not
the
problem
either.
It
turns
out
that
there
was
an
issue
with
the
MySQL
client
implementation.
When
a
client
sends
its
“QUIT”
sequence,
it
performs
a
few
internal
operations
before
immediately
shutting
down
the
TCP
connection,
without
waiting
for
the
server
to
do
it.
A
basic
tcpdump
revealed
this
behavior.
Note
that
this
issue
cannot
be
reproduced
on
a
loopback
interface
or
on
the
same
system,
because
the
server
answers
fast
enough.
But
over
a
LAN
connection
with
2
different
machines
the
latency
raises
past
the
threshold
where
the
issue
becomes
apparent.
Basically,
here
is
the
sequence
performed
by
a
MySQL
client:
MySQL
Client
==>
"QUIT"
sequence
==>
MySQL
Server
MySQL
Client
==>
FIN
==>
MySQL
Server
MySQL
Client
<==
FIN
ACK
<==
MySQL
Server
MySQL
Client
==>
ACK
==>
MySQL
Server
This
results
in
the
client
connection
to
remain
unavailable
for
twice
the
MSL
(Maximum
Segment
Life)
time,
which
defaults
to
2
minutes.
Note
that
this
type
of
close
has
no
negative
impact
when
the
MySQL
connection
is
established
using
a
UNIX
socket.