SlideShare a Scribd company logo
 
	
  
	
  
	
  
	
  
	
  
	
  
	
  
The	
  Cloud	
  Story	
  or	
  Less	
  is	
  More…	
  
	
  
	
  
	
  
by	
  Slava	
  Vladyshevsky	
  
slava[at]verizon.com	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Dedicated	
  to	
  Lee,	
  Sarah,	
  David,	
  
Andy	
  and	
  Jeff	
  as	
  well	
  as	
  many	
  others,	
  
who	
  went	
  above	
  and	
  beyond	
  to	
  make	
  this	
  possible.	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
“Cache	
  is	
  evil.	
  Full	
  stop.”	
  
Jeff	
  
  	
  
Table	
  of	
  Content	
  
	
  
PART	
  I	
  –	
  BUILDING	
  TESTBED	
  ......................................................................................................................................	
  6	
  
PART	
  II	
  –	
  FIRST	
  TEST	
  ....................................................................................................................................................	
  10	
  
PART	
  III	
  –	
  STORAGE	
  STACK	
  PERFORMANCE	
  .....................................................................................................	
  12	
  
PART	
  IV	
  –	
  DATABASE	
  OPTIMIZATION	
  ..................................................................................................................	
  15	
  
PART	
  V	
  –	
  PEELING	
  THE	
  ONION	
  ................................................................................................................................	
  24	
  
PART	
  VI	
  –	
  PFSENSE	
  ........................................................................................................................................................	
  25	
  
PART	
  VII	
  –	
  JMETER	
  ........................................................................................................................................................	
  27	
  
PART	
  VIII	
  –	
  ALMOST	
  THERE	
  ......................................................................................................................................	
  28	
  
PART	
  IX	
  –	
  CASSANDRA	
  .................................................................................................................................................	
  29	
  
PART	
  X	
  –	
  HAPROXY	
  ........................................................................................................................................................	
  34	
  
PART	
  XI	
  –	
  TOMCAT	
  ........................................................................................................................................................	
  40	
  
PART	
  XII	
  –	
  JAVA	
  ...............................................................................................................................................................	
  42	
  
PART	
  XIII	
  –	
  OS	
  OPTIMIZATION	
  .................................................................................................................................	
  44	
  
PART	
  XIV	
  –	
  NETWORK	
  STACK	
  ..................................................................................................................................	
  44	
  
	
  
Figure	
  Register	
  
	
  
AWS	
  Application	
  Deployment	
  ......................................................................................................................	
  6	
  
Initial	
  VCC	
  Application	
  Deployment	
  ..........................................................................................................	
  9	
  
First	
  Test	
  Results	
  -­‐	
  Comparison	
  Chart	
  ..................................................................................................	
  10	
  
First	
  Test	
  -­‐	
  High	
  CPU	
  Load	
  on	
  DB	
  Server	
  .............................................................................................	
  11	
  
First	
  Test	
  -­‐	
  High	
  CPU	
  %iowait	
  on	
  DB	
  Server	
  ......................................................................................	
  11	
  
First	
  Test	
  -­‐	
  Disk	
  I/O	
  Skew	
  on	
  DB	
  Server	
  ..............................................................................................	
  11	
  
Optimized	
  Storage	
  Subsystem	
  Throughput	
  ........................................................................................	
  14	
  
AWS	
  i2.8xlarge	
  CPU	
  load	
  -­‐	
  Sysbench	
  Test	
  Completed	
  in	
  64.42	
  sec	
  ..........................................	
  16	
  
VCC	
  4C-­‐28G	
  CPU	
  load	
  -­‐	
  Sysbench	
  Test	
  Complete	
  in	
  283.51	
  sec	
  .................................................	
  16	
  
InnoDB	
  Engine	
  Internals	
  .............................................................................................................................	
  17	
  
Optimized	
  MySQL	
  DB	
  -­‐	
  QPS	
  Graph	
  ..........................................................................................................	
  20	
  
Optimized	
  MySQL	
  DB	
  -­‐	
  TPS	
  and	
  RT	
  Graph	
  ..........................................................................................	
  20	
  
Optimized	
  MySQL	
  DB	
  -­‐RAID	
  Stripe	
  I/O	
  Metrics	
  ...............................................................................	
  21	
  
Optimized	
  MySQL	
  DB	
  -­‐	
  CPU	
  Metrics	
  ......................................................................................................	
  21	
  
Optimized	
  MySQL	
  DB	
  -­‐	
  Network	
  Metrics	
  .............................................................................................	
  22	
  
Jennifer	
  APM	
  Console	
  ...................................................................................................................................	
  25	
  
Initial	
  Application	
  Deployment	
  -­‐	
  Network	
  Diagram	
  .......................................................................	
  25	
  
Jennifer	
  XView	
  -­‐	
  Transaction	
  Response	
  Time	
  Scatter	
  Graph	
  ......................................................	
  26	
  
Jennifer	
  APM	
  -­‐	
  Transaction	
  Introspection	
  ...........................................................................................	
  26	
  
Iterative	
  Optimization	
  Progress	
  Chart	
  ..................................................................................................	
  28	
  
Jennifer	
  XView	
  -­‐	
  Transaction	
  Response	
  Time	
  Surges	
  .....................................................................	
  29	
  
VCC	
  Cassandra	
  Cluster	
  CPU	
  Usage	
  During	
  the	
  Test	
  .........................................................................	
  30	
  
AWS	
  Cassandra	
  Cluster	
  CPU	
  Usage	
  During	
  the	
  Test	
  .......................................................................	
  31	
  
High-­‐Level	
  Cassandra	
  Architecture	
  ........................................................................................................	
  32	
  
Jennifer	
  APM	
  -­‐	
  Concurrent	
  Connections	
  and	
  Per-­‐server	
  Arrival	
  Rate	
  ....................................	
  35	
  
Jennifer	
  APM	
  -­‐	
  Connection	
  Statistics	
  After	
  Optimization	
  ..............................................................	
  35	
  
Jennifer	
  APM	
  -­‐	
  DB	
  Connection	
  Pool	
  Usage	
  ..........................................................................................	
  41	
  
JVM	
  Garbage	
  Collection	
  Analysis	
  .............................................................................................................	
  42	
  
JVM	
  Garbage	
  Collection	
  Analysis	
  –	
  Optimized	
  Run	
  .........................................................................	
  43	
  
XEN	
  PV	
  Driver	
  and	
  Network	
  Device	
  Architecture	
  ...........................................................................	
  45	
  
Recommended	
  Network	
  Optimizations	
  ...............................................................................................	
  47	
  
Last	
  Performance	
  Test	
  Results	
  .................................................................................................................	
  52	
  
	
  
Table	
  Register	
  
Major	
  Infrastructure	
  Limits	
  ..........................................................................................................................	
  7	
  
AWS	
  Infrastructure	
  Mapping	
  and	
  Sizing	
  .................................................................................................	
  7	
  
VCC	
  Infrastructure	
  Mapping	
  and	
  Sizing	
  ..................................................................................................	
  8	
  
Optimized	
  MySQL	
  DB	
  -­‐	
  Recommended	
  Settings	
  ...............................................................................	
  22	
  
Optimized	
  Cassandra	
  -­‐	
  Recommended	
  Settings	
  ...............................................................................	
  33	
  
Network	
  Parameter	
  Comparison	
  ............................................................................................................	
  49	
  
	
   	
  
PREFACE	
  
One	
  of	
  the	
  market	
  leading	
  enterprises,	
  hereinafter	
  called	
  Customer,	
  has	
  multiple	
  
business	
  units	
  working	
  in	
  various	
  areas,	
  ranging	
  from	
  consumer	
  electronics	
  to	
  mobile	
  
communications	
  and	
  cloud	
  services.	
  One	
  of	
  their	
  strategic	
  initiatives	
  is	
  to	
  expand	
  
software	
  capabilities	
  to	
  get	
  on	
  top	
  of	
  the	
  competition.	
  
	
  
The	
  Customer	
  started	
  to	
  use	
  AWS	
  platform	
  for	
  development	
  purposes	
  and	
  as	
  the	
  main	
  
hosting	
  platform	
  for	
  their	
  cloud	
  services.	
  Over	
  past	
  years	
  the	
  usage	
  of	
  AWS	
  grew	
  
significantly	
  with	
  over	
  30	
  production	
  applications	
  currently	
  hosted	
  on	
  AWS	
  
infrastructure.	
  
	
  
While	
  Customer	
  reliance	
  on	
  AWS	
  increased,	
  the	
  number	
  of	
  pain	
  points	
  grew	
  as	
  well.	
  
They	
  experienced	
  multiple	
  outages	
  and	
  had	
  to	
  spend	
  unnecessary	
  high	
  costs	
  to	
  grow	
  
application	
  performance	
  and	
  to	
  accommodate	
  unbalanced	
  CPU/Memory	
  hardware	
  
profiles.	
  Although	
  achieved	
  application	
  performance	
  was	
  satisfactory	
  in	
  general,	
  several	
  
major	
  challenges	
  and	
  trends	
  emerged	
  over	
  time:	
  
-­‐ Scalability	
  and	
  growth	
  issues	
  
-­‐ Very	
  high	
  overall	
  infrastructure	
  and	
  support	
  costs	
  
-­‐ Single	
  service	
  provider	
  lock-­‐in.	
  
	
  	
  
Verizon	
  proposed	
  to	
  trial	
  the	
  Verizon	
  Cloud	
  Compute	
  (VCC)	
  beta	
  product	
  as	
  an	
  
alternative	
  hosting	
  platform	
  with	
  the	
  goal	
  to	
  demonstrate	
  that	
  on	
  par	
  application	
  
performance	
  can	
  be	
  achieved	
  at	
  a	
  much	
  lower	
  cost,	
  effectively	
  addressing	
  one	
  of	
  the	
  
biggest	
  challenges.	
  An	
  alternative	
  hosting	
  platform	
  would	
  give	
  the	
  Customer	
  a	
  freedom	
  
of	
  choice,	
  thus	
  addressing	
  another	
  issue.	
  Last,	
  but	
  not	
  least,	
  the	
  unique	
  VCC	
  platform	
  
architecture	
  and	
  infrastructure	
  stack	
  built	
  for	
  low	
  latency	
  and	
  high	
  performance	
  
workloads	
  would	
  definitely	
  help	
  to	
  address	
  another	
  pain	
  point	
  –	
  application	
  
performance	
  and	
  scalability.	
  
	
  
Senior	
  executives	
  from	
  both	
  companies	
  supported	
  this	
  initiative	
  and	
  one	
  of	
  the	
  
Customer’s	
  applications	
  was	
  selected	
  for	
  the	
  proof	
  of	
  concept	
  project.	
  The	
  objective	
  was	
  
to	
  compare	
  side-­‐by-­‐side	
  AWS	
  and	
  VCC	
  deployments	
  from	
  both	
  capability	
  and	
  
performance	
  perspectives,	
  execute	
  performance	
  tests	
  and	
  deliver	
  report	
  to	
  senior	
  
management.	
  
	
  
The	
  proof	
  of	
  concept	
  project	
  has	
  been	
  successfully	
  executed	
  in	
  a	
  close	
  collaboration	
  
between	
  various	
  Verizon	
  teams	
  as	
  well	
  as	
  Customer’s	
  SMEs.	
  It	
  was	
  demonstrated	
  that	
  
the	
  application	
  hosted	
  on	
  the	
  VCC	
  platform,	
  given	
  appropriate	
  tuning,	
  is	
  capable	
  of	
  
delivering	
  better	
  performance	
  than	
  when	
  hosted	
  on	
  a	
  more	
  powerful	
  AWS	
  based	
  
footprint.	
  
	
   	
  
PART	
  I	
  –	
  BUILDING	
  TESTBED	
  
	
  
The	
  agreed	
  high-­‐level	
  plan	
  was	
  clear	
  and	
  straightforward:	
  
• (Verizon)	
  Mirror	
  AWS	
  hosting	
  infrastructure	
  using	
  VCC	
  platform	
  
• (Verizon)	
  Setup	
  Infrastructure,	
  OS	
  and	
  Applications	
  per	
  specification	
  sheet	
  
• (Customer)	
  Adjust	
  necessary	
  configurations	
  and	
  settings	
  on	
  VCC	
  platform	
  
• (Customer)	
  Upload	
  test	
  data	
  –	
  10	
  million	
  users,	
  100	
  million	
  contacts	
  
• (Customer)	
  Execute	
  smoke,	
  performance	
  and	
  aging	
  test	
  in	
  AWS	
  environment	
  
• (Customer)	
  Execute	
  smoke,	
  performance	
  and	
  aging	
  test	
  in	
  VCC	
  environment	
  
• (Customer)	
  Compare	
  AWS	
  and	
  VCC	
  results	
  and	
  captured	
  metrics	
  
• (Customer)	
  Deliver	
  report	
  to	
  senior	
  management	
  
	
  
The	
  high-­‐level	
  diagram	
  below	
  is	
  depicting	
  the	
  application	
  infrastructure	
  hosted	
  on	
  AWS	
  
platform.	
  
	
  
	
  
Figure	
  1:	
  AWS	
  Application	
  Deployment	
  
Although	
  both	
  AWS	
  and	
  VCC	
  platforms	
  are	
  using	
  XEN	
  hypervisor	
  in	
  their	
  core,	
  the	
  
initial	
  step	
  –	
  mirroring	
  AWS	
  hosting	
  environment	
  by	
  provisioning	
  equally	
  sized	
  VMs	
  in	
  
VCC	
  raised	
  first	
  challenge.	
  Verizon	
  Cloud	
  Compute	
  platform	
  in	
  its	
  early	
  beta	
  stage	
  has	
  
imposed	
  number	
  of	
  limitations.	
  To	
  be	
  fair,	
  those	
  limitations	
  were	
  not	
  by	
  design,	
  nor	
  
hardware	
  limits,	
  rather	
  software	
  or	
  configuration	
  settings	
  pertinent	
  to	
  corresponding	
  
product	
  release.	
  	
  
	
  
The	
  table	
  below	
  summarizes	
  most	
  important	
  infrastructure	
  limits	
  for	
  both	
  cloud	
  
platforms	
  as	
  of	
  February	
  2014:	
  
	
  
Resource	
  Limit	
   VCC	
   AWS	
  
VPUs	
  per	
  VM	
   8	
   32	
  
RAM	
  per	
  VM	
   28	
  GB	
   244	
  GB	
  
Volumes	
  per	
  VM	
   5	
   20+	
  
IOPS	
  per	
  Volume	
  (SSD)	
   3000	
   4000	
  
Max	
  Volume	
  Size	
   1	
  TB	
   1	
  TB	
  
Guaranteed	
  IOPS	
  per	
  VM	
   15K	
   40K	
  
Throughput	
  per	
  vNIC	
   500	
  Mbps	
   10	
  Gbps	
  
Table	
  1:	
  Major	
  Infrastructure	
  Limits	
  
Besides	
  obvious	
  points,	
  like	
  the	
  number	
  of	
  CPUs	
  or	
  huge	
  difference	
  in	
  network	
  
throughput,	
  it’s	
  also	
  worth	
  mentioning	
  that	
  the	
  CPU/RAM	
  –	
  processor	
  count	
  to	
  memory	
  
size	
  ratio	
  is	
  quite	
  different	
  as	
  well	
  -­‐	
  1:4.5	
  for	
  VCC	
  and	
  1:7.625	
  for	
  AWS	
  correspondingly.	
  
This	
  ratio	
  is	
  crucial	
  for	
  certain	
  types	
  of	
  applications,	
  specifically	
  for	
  databases.	
  
	
  
Despite	
  aforementioned	
  differences,	
  it	
  was	
  jointly	
  decided	
  with	
  the	
  Customer	
  to	
  move	
  
forward	
  with	
  smaller	
  VCC	
  VMs	
  and	
  consider	
  sizing	
  ratio	
  while	
  comparing	
  performance	
  
and	
  test	
  results.	
  This	
  already	
  set	
  the	
  expectation	
  that	
  VCC	
  results	
  might	
  be	
  lower	
  
comparing	
  to	
  AWS,	
  assuming	
  linear	
  application	
  scalability	
  and	
  4-­‐8	
  times	
  hardware	
  
footprint	
  differences.	
  
	
  
The	
  table	
  below	
  summarizes	
  infrastructure	
  sizing	
  and	
  mapping	
  for	
  corresponding	
  
service	
  layers	
  hosted	
  on	
  both	
  cloud	
  platforms.	
  Resources	
  sized	
  differently	
  on	
  the	
  
corresponding	
  platforms	
  are	
  highlighted.	
  
	
  
VM	
  Role	
   AWS	
  VM	
  Profile	
  
Count	
   VPUs	
   RAM,	
  GB	
   IOPS	
   Net,	
  
Mbps	
  
Tomcat	
   2	
   4	
   34.2	
   -­‐	
   1000	
  
MySQL	
   1	
   32	
   244	
   10K	
   10000	
  
Cassandra	
   8	
   8	
   68.4	
   5K	
   1000	
  
HA	
  Proxy	
   4	
   2	
   7.5	
   -­‐	
   1000	
  
DB	
  Cache	
   2	
   4	
   34.2	
   -­‐	
   1000	
  
Table	
  2:	
  AWS	
  Infrastructure	
  Mapping	
  and	
  Sizing	
  
VM	
  Role	
   VCC	
  VM	
  Profile	
  
Count	
   VPUs	
   RAM,	
  GB	
   IOPS	
   Net,	
  
Mbps	
  
Tomcat	
   2	
   4	
   28	
   -­‐	
   500	
  
MySQL	
   1	
   4	
   28	
   9K	
   500	
  
Cassandra	
   12	
   4	
   28	
   5K	
   500	
  
HA	
  Proxy	
   4	
   2	
   4	
   -­‐	
   500	
  
DB	
  Cache	
   2	
   4	
   28	
   -­‐	
   500	
  
	
  
Table	
  3:	
  VCC	
  Infrastructure	
  Mapping	
  and	
  Sizing	
  
The	
  initial	
  setup	
  of	
  the	
  disk	
  volumes	
  required	
  special	
  creativity	
  in	
  order	
  to	
  get	
  as	
  close	
  
as	
  possible	
  to	
  the	
  required	
  number	
  of	
  IOPS.	
  In	
  addition	
  to	
  the	
  per-­‐disk	
  storage	
  limits	
  
mentioned	
  above,	
  initially	
  there	
  was	
  another	
  VCC	
  limitation	
  in	
  place	
  that	
  was	
  luckily	
  
addressed	
  later	
  –	
  all	
  disks	
  connected	
  to	
  a	
  particular	
  VM	
  had	
  to	
  be	
  provisioned	
  with	
  the	
  
exact	
  same	
  IOPS	
  rate.	
  
	
  
The	
  most	
  common	
  setup	
  used	
  was	
  based	
  on	
  LVM2	
  with	
  a	
  linear	
  extension	
  for	
  the	
  boot	
  
disk	
  volume	
  group	
  and	
  either	
  two	
  or	
  three	
  additional	
  disks	
  aggregated	
  into	
  an	
  LVM	
  
stripe	
  set.	
  This	
  setup	
  allowed	
  setting	
  up	
  disk	
  volumes	
  with	
  up	
  to	
  3TB	
  size	
  and	
  9000	
  
IOPS,	
  getting	
  close	
  enough	
  to	
  the	
  required	
  10K	
  IOPS	
  for	
  database	
  VMs.	
  
	
  
Besides	
  technical	
  limitations	
  the	
  sheer	
  volume	
  of	
  provisioning	
  and	
  configuration	
  work	
  
presented	
  challenge	
  in	
  itself.	
  The	
  hosting	
  platform	
  requirements	
  were	
  captured	
  in	
  a	
  
spreadsheet	
  listing	
  system	
  parameters	
  for	
  every	
  VM.	
  Following	
  this	
  spreadsheet	
  
manually	
  and	
  building	
  out	
  environment	
  sequentially	
  would	
  have	
  required	
  significant	
  
time	
  and	
  tremendous	
  manual	
  effort.	
  Additionally,	
  this	
  may	
  have	
  resulted	
  in	
  a	
  number	
  of	
  
human	
  errors	
  and	
  omissions.	
  Automating	
  and	
  scripting	
  major	
  parts	
  of	
  the	
  installation	
  
and	
  setup	
  process	
  addressed	
  this.	
  
	
  
The	
  automation	
  suite	
  implemented	
  based	
  on	
  the	
  vzDeploymentFramework	
  shell	
  library	
  
(Verizon	
  internal	
  development),	
  made	
  it	
  possible	
  in	
  a	
  matter	
  of	
  minutes	
  to:	
  
-­‐ Parse	
  specification	
  spreadsheet	
  for	
  inputs	
  and	
  updates	
  
-­‐ Generate	
  updated	
  OS	
  and	
  Application	
  configurations	
  
-­‐ Create	
  LVM	
  volumes	
  or	
  software	
  RAID	
  arrays	
  
-­‐ Roll-­‐out	
  updated	
  settings	
  to	
  multiple	
  systems	
  based	
  on	
  their	
  functional	
  role	
  
-­‐ Change	
  of	
  Linux	
  iptables	
  based	
  firewall	
  configurations	
  across	
  the	
  board	
  
-­‐ Validate	
  required	
  connectivity	
  between	
  hosts	
  
-­‐ Install	
  required	
  software	
  packages	
  
	
  
Having	
  all	
  configurations	
  in	
  version	
  controlled	
  repository	
  allowed	
  auditing	
  and	
  
comparing	
  configurations	
  between	
  master	
  and	
  on-­‐host	
  deployed	
  versions,	
  providing	
  
rudimentary	
  configuration	
  management	
  capabilities.	
  
	
  
Below	
  is	
  a	
  high-­‐level	
  architecture	
  for	
  the	
  originally	
  implemented	
  test	
  environment.	
  	
  
 
Figure	
  2:	
  Initial	
  VCC	
  Application	
  Deployment	
  
The	
  test	
  load	
  was	
  initiated	
  by	
  a	
  JMeter	
  Master	
  (Test	
  Controller	
  and	
  Management	
  GUI)	
  
and	
  generated	
  by	
  several	
  JMeter	
  Slaves	
  (Load	
  Generators	
  or	
  Test	
  Agents).	
  The	
  
generated	
  virtual	
  user	
  (VU)	
  requests	
  were	
  load-­‐balanced	
  between	
  two	
  Tomcat	
  
application	
  servers	
  each	
  running	
  single	
  application	
  instance.	
  
	
  
Since	
  F5	
  LTM	
  instances	
  were	
  not	
  available	
  during	
  the	
  build	
  time,	
  the	
  proposed	
  design	
  
utilized	
  pfSense	
  appliances	
  as	
  routers,	
  load-­‐balancers	
  or	
  firewalls	
  for	
  corresponding	
  
VLANs.	
  
	
  
The	
  tomcat	
  servers	
  communicated	
  via	
  another	
  pair	
  of	
  HAProxy	
  load-­‐balancers	
  with	
  two	
  
persistent	
  storage	
  back-­‐ends	
  –	
  MySQL	
  (SQL	
  DB)	
  and	
  Cassandra	
  (NOSQL	
  DB),	
  employing	
  
Couchbase	
  (DB	
  Cache)	
  as	
  a	
  caching	
  layer.	
  
	
  
	
  
Most	
  systems	
  were	
  additionally	
  instrumented	
  with	
  NMON	
  collectors	
  for	
  gathering	
  key	
  
performance	
  metrics.	
  A	
  Jennifer	
  APM	
  application	
  has	
  been	
  deployed	
  to	
  perform	
  real-­‐
time	
  transaction	
  monitoring	
  and	
  code	
  introspection.	
  
	
  
Following	
  the	
  initial	
  plan,	
  the	
  hosting	
  environment	
  was	
  timely	
  handed	
  over	
  to	
  the	
  
Customer	
  for	
  adjusting	
  configurations	
  and	
  uploading	
  test	
  data.	
  
	
  
PART	
  II	
  –	
  FIRST	
  TEST	
  
	
  
The	
  first	
  test	
  was	
  conducted	
  on	
  both	
  AWS	
  and	
  VCC	
  platforms	
  and	
  Customer	
  did	
  share	
  
the	
  test	
  results.	
  	
  During	
  the	
  test	
  the	
  load	
  was	
  ramped	
  up	
  using	
  100	
  VU	
  increments	
  for	
  
each	
  subsequent	
  10	
  minutes	
  long	
  test	
  run.	
  During	
  each	
  run	
  the	
  corresponding	
  number	
  
of	
  virtual	
  users	
  performed	
  various	
  API	
  calls	
  emulating	
  human	
  behavior	
  using	
  patterns	
  
observed	
  and	
  measured	
  on	
  the	
  production	
  application.	
  
	
  
The	
  chart	
  below	
  depicts	
  the	
  number	
  of	
  application	
  transactions	
  successfully	
  processed	
  
by	
  each	
  platform	
  during	
  the	
  10	
  minutes	
  test	
  runs.	
  
	
  
	
  
Figure	
  3:	
  First	
  Test	
  Results	
  -­‐	
  Comparison	
  Chart	
  
It	
  was	
  obvious	
  that	
  the	
  AWS	
  infrastructure	
  is	
  more	
  powerful,	
  processing	
  more	
  than	
  two	
  
times	
  higher	
  throughput,	
  which	
  did	
  not	
  come	
  as	
  big	
  surprise.	
  However,	
  Customer	
  
expressed	
  several	
  concerns	
  about	
  overall	
  VCC	
  platform	
  stability,	
  low	
  MySQL	
  DB	
  server	
  
performance	
  and	
  uneven	
  load	
  distribution	
  between	
  striped	
  data	
  volumes,	
  dubbed	
  as	
  
I/O	
  skews.	
  
321	
  
462	
  
539	
  
627	
   637	
   645	
   651	
   654	
  
203	
  
256	
   269	
   257	
   275	
  
249	
   268	
  
247	
  
0	
  
100	
  
200	
  
300	
  
400	
  
500	
  
600	
  
700	
  
200	
   300	
   400	
   500	
   600	
   700	
   800	
   900	
  
TPS	
  per	
  VU	
  count
AWS	
  TPS	
   Verizon	
  TPS	
  
 
Indeed,	
  application	
  “Transactions	
  per	
  Second”	
  (TPS)	
  measurements	
  did	
  not	
  correlate	
  
well	
  with	
  the	
  generated	
  application	
  load	
  and	
  even	
  with	
  a	
  growing	
  number	
  of	
  users	
  
something	
  prevented	
  the	
  application	
  from	
  taking-­‐off.	
  After	
  short	
  increases	
  overall	
  
throughput	
  consistently	
  dropped	
  again,	
  clearly	
  pointing	
  to	
  a	
  bottleneck	
  limiting	
  the	
  
transaction	
  stream.	
  
	
  
According	
  to	
  Jennifer	
  APM	
  monitors	
  the	
  increase	
  in	
  application	
  transaction	
  times	
  was	
  
caused	
  by	
  slow	
  DB	
  responses,	
  taking	
  5	
  second	
  and	
  more,	
  per	
  single	
  DB	
  operation.	
  At	
  the	
  
same	
  time	
  DB	
  server	
  was	
  showing	
  very	
  high	
  CPU	
  %iowait,	
  fluctuating	
  about	
  85-­‐90%.	
  
	
  
	
  
	
  
Figure	
  4:	
  First	
  Test	
  -­‐	
  High	
  CPU	
  Load	
  on	
  DB	
  Server	
  
	
  
Figure	
  5:	
  First	
  Test	
  -­‐	
  High	
  CPU	
  %iowait	
  on	
  DB	
  Server	
  
Furthermore,	
  out	
  of	
  three	
  stripes,	
  parts	
  of	
  the	
  data	
  volume,	
  one	
  volume	
  constantly	
  
reported	
  significantly	
  higher	
  device	
  wait	
  times	
  and	
  utilization	
  percentage,	
  effectively	
  
causing	
  disk	
  I/O	
  skews.	
  
	
  
	
  
Figure	
  6:	
  First	
  Test	
  -­‐	
  Disk	
  I/O	
  Skew	
  on	
  DB	
  Server	
  
Obviously,	
  these	
  test	
  results	
  were	
  not	
  acceptable.	
  To	
  investigate	
  and	
  identify	
  
bottlenecks	
  and	
  performance	
  limiting	
  factors	
  good	
  knowledge	
  of	
  the	
  application	
  
architecture	
  and	
  its	
  internals	
  was	
  required	
  as	
  well	
  as	
  deep	
  VCC	
  product	
  and	
  storage	
  
stack	
  knowledge,	
  since	
  the	
  latter	
  two	
  issues	
  seemed	
  to	
  be	
  platform	
  and	
  infrastructure	
  
related.	
  To	
  address	
  this	
  a	
  dedicated	
  cross-­‐team	
  taskforce	
  was	
  established.	
  
	
  
PART	
  III	
  –	
  STORAGE	
  STACK	
  PERFORMANCE	
  
	
  
The	
  VCC	
  Storage	
  Stack	
  was	
  validated	
  once	
  more	
  and	
  it’s	
  been	
  reconfirmed,	
  that	
  there	
  
are	
  no	
  limiting	
  factors	
  or	
  shortcomings	
  on	
  layers	
  below	
  block	
  device.	
  The	
  resulting	
  
conclusion	
  was	
  that	
  the	
  limitations	
  had	
  to	
  be	
  on	
  the	
  hypervisor,	
  OS,	
  or	
  application	
  layer.	
  
	
  
On	
  the	
  other	
  hand	
  Customer	
  confirmed	
  that	
  AWS	
  deployment	
  is	
  using	
  exactly	
  the	
  same	
  
configuration	
  and	
  application	
  versions	
  as	
  VCC.	
  The	
  only	
  possible	
  logical	
  conclusion	
  was	
  
that	
  the	
  setup	
  and	
  configuration	
  optimal	
  for	
  AWS	
  does	
  not	
  perform	
  the	
  same	
  way	
  on	
  
VCC.	
  Or	
  in	
  other	
  words,	
  the	
  VCC	
  platform	
  required	
  its	
  own	
  optimal	
  configuration.	
  
	
  
Further	
  efforts	
  have	
  been	
  aligned	
  with	
  the	
  following	
  objectives:	
  
-­‐ Improve	
  storage	
  throughput	
  and	
  address	
  I/O	
  skews	
  
-­‐ Identify	
  the	
  root	
  cause	
  for	
  low	
  DB	
  server	
  performance	
  
-­‐ Improve	
  DB	
  server	
  performance	
  and	
  scalability	
  
-­‐ Work	
  with	
  Customer	
  on	
  improving	
  overall	
  VCC	
  deployment	
  performance	
  
-­‐ Re-­‐run	
  performance	
  tests	
  and	
  demonstrate	
  improved	
  throughput	
  and	
  
predictable	
  performance	
  levels	
  
	
  
Originally	
  the	
  storage	
  volumes	
  were	
  setup	
  using	
  Customer	
  specifications	
  and	
  OS	
  
defaults	
  for	
  other	
  parameters.	
  	
  After	
  performing	
  research	
  and	
  a	
  number	
  of	
  component	
  
performance	
  tests,	
  several	
  interesting	
  discoveries	
  were	
  made,	
  in	
  particular:	
  
-­‐ Different	
  Linux	
  distributions	
  (Ubuntu	
  and	
  CentOS)	
  are	
  using	
  a	
  different	
  
approach	
  to	
  disk	
  partitioning.	
  Ubuntu	
  did	
  align	
  partitions	
  for	
  4k	
  block	
  sizes,	
  
while	
  CentOS	
  did	
  not	
  
-­‐ The	
  default	
  block	
  device	
  scheduler	
  CFQ	
  is	
  not	
  a	
  good	
  choice	
  in	
  environments	
  
using	
  virtualized	
  storage	
  
-­‐ MDADM	
  and	
  LVM	
  volume	
  managers	
  are	
  using	
  quite	
  different	
  algorithms	
  for	
  I/O	
  
batching	
  or	
  compaction	
  
-­‐ XFS	
  and	
  EXT4	
  file-­‐systems	
  yield	
  very	
  different	
  results	
  depending	
  on	
  the	
  number	
  
of	
  concurrent	
  threads	
  performing	
  I/O	
  
-­‐ Due	
  to	
  all	
  Linux	
  optimizations	
  and	
  multiple	
  caching	
  levels	
  it’s	
  hard	
  enough	
  to	
  
measure	
  net	
  storage	
  throughput	
  from	
  within	
  VM,	
  let	
  alone	
  through	
  the	
  entire	
  
application	
  stack	
  
	
  
After	
  number	
  of	
  trials	
  and	
  studying	
  platform	
  behavior,	
  the	
  following	
  was	
  suggested	
  for	
  
achieving	
  optimal	
  I/O	
  performance	
  on	
  VCC	
  storage	
  stack:	
  
-­‐ Use	
  raw	
  block	
  devices	
  instead	
  of	
  partitions	
  for	
  RAID	
  stripes	
  to	
  circumvent	
  any	
  
partition	
  block	
  alignment	
  issues	
  
-­‐ Use	
  MDADM	
  software	
  RAID	
  instead	
  of	
  LVM	
  (the	
  latter	
  is	
  more	
  flexible	
  and	
  may	
  
be	
  used	
  in	
  combination	
  with	
  MDADM,	
  however	
  it	
  does	
  perform	
  certain	
  amount	
  
of	
  “optimization”	
  assuming	
  spindle	
  based	
  storage	
  that	
  may	
  interfere	
  with	
  
performance	
  in	
  VCC)	
  
-­‐ Use	
  proper	
  stripe	
  settings	
  and	
  block	
  sizes	
  for	
  software	
  RAID	
  (don’t	
  let	
  system	
  
guess	
  –	
  specify!)	
  
-­‐ Use	
  EXT4	
  file-­‐system	
  instead	
  of	
  XFS.	
  EXT4	
  does	
  provide	
  journaling	
  for	
  meta-­‐data	
  
and	
  data	
  instead	
  of	
  meta-­‐data	
  only	
  with	
  neglectable	
  performance	
  overhead	
  for	
  
the	
  load	
  observed.	
  
-­‐ Use	
  optimal	
  (and	
  safe)	
  settings	
  for	
  EXT4	
  file-­‐system	
  creation	
  and	
  mounts	
  
-­‐ Ensure	
  NOOP	
  block	
  device	
  scheduler	
  is	
  used	
  (which	
  lets	
  the	
  underlying	
  storage	
  
stack	
  from	
  the	
  hypervisor	
  down	
  optimize	
  block	
  I/O	
  more	
  effectively)	
  
-­‐ Separate	
  various	
  I/O	
  profiles,	
  e.g.	
  sequential	
  I/O	
  (redo/bin-­‐log	
  files)	
  and	
  random	
  
I/O	
  (data	
  files)	
  for	
  DB	
  server	
  by	
  writing	
  corresponding	
  data	
  to	
  separate	
  logical	
  
disks.	
  
-­‐ Use	
  DIRECT_IO	
  wherever	
  possible	
  and	
  avoid	
  OS/file-­‐system	
  caching	
  (cache	
  may	
  
give	
  in	
  certain	
  situations	
  false	
  impression	
  of	
  high	
  performance	
  which	
  is	
  then	
  
abruptly	
  interrupted	
  by	
  flushing	
  massive	
  caches	
  during	
  which	
  the	
  entire	
  VM	
  gets	
  
blocked)	
  
-­‐ Avoid	
  I/O	
  bursts	
  due	
  to	
  cache	
  flushing	
  and	
  keep	
  device	
  queue	
  length	
  close	
  to	
  8.	
  
This	
  corresponds	
  to	
  a	
  hardware	
  limitation	
  on	
  the	
  chassis	
  NPU.	
  In	
  VCC	
  storage	
  is	
  
very	
  low	
  latency	
  and	
  quick,	
  but	
  if	
  the	
  storage	
  queue	
  locks	
  up	
  the	
  entire	
  VM	
  gets	
  
blocked.	
  Writing	
  early	
  and	
  often	
  at	
  a	
  consistent	
  rate	
  performs	
  dramatically	
  
better	
  under	
  load	
  than	
  caching	
  in	
  RAM	
  as	
  long	
  as	
  possible	
  and	
  then	
  flooding	
  the	
  
I/O	
  queue	
  when	
  the	
  cache	
  has	
  been	
  exhausted.	
  
-­‐ Make	
  sure	
  network	
  device	
  driver	
  is	
  not	
  competing	
  with	
  block	
  device	
  drivers	
  and	
  
application	
  for	
  CPU	
  time	
  by	
  relocating	
  associated	
  interrupts	
  to	
  different	
  vCPU	
  
cores	
  inside	
  the	
  VM.	
  
-­‐ Use	
  4K	
  blocks	
  for	
  I/O	
  operations	
  wherever	
  possible	
  for	
  more	
  optimal	
  storage	
  
stack	
  operation.	
  
	
  
After	
  implementing	
  these	
  suggestions	
  on	
  a	
  DB	
  server,	
  storage	
  subsystem	
  yielded	
  
predictable	
  and	
  consistent	
  performance.	
  For	
  example,	
  data	
  volumes	
  setup	
  with	
  10K	
  
IOPS,	
  have	
  been	
  reporting	
  ~39MB/s	
  throughput,	
  which	
  is	
  expected	
  maximum	
  assuming	
  
4K	
  I/O	
  block	
  size:	
  
	
  (4K	
  *	
  10000	
  IOPS)	
  /1024	
  =	
  39.06M,	
  the	
  maximum	
  possible	
  throughput	
  
	
  (4K	
  *	
  15000	
  IOPS)	
  /1024	
  =	
  58.59M,	
  the	
  maximum	
  possible	
  throughput	
  
With	
  15K	
  IOPS	
  setup	
  using	
  3	
  stripes	
  (5K	
  IOPS	
  each)	
  the	
  ~55-­‐56MB/s	
  throughput	
  was	
  
achieved	
  as	
  shown	
  on	
  the	
  screenshot	
  below:	
  
 
Figure	
  7:	
  Optimized	
  Storage	
  Subsystem	
  Throughput	
  
Although	
  some	
  minor	
  I/O	
  figures	
  deviation	
  (+/-­‐	
  5%)	
  was	
  still	
  observed,	
  this	
  is	
  typically	
  
considered	
  acceptable	
  and	
  within	
  normal	
  range.	
  
	
  
While	
  performing	
  additional	
  tests	
  on	
  optimized	
  systems,	
  it	
  was	
  observed	
  that	
  all	
  block	
  
device	
  interrupts	
  are	
  being	
  served	
  by	
  CPU0,	
  which	
  was	
  becoming	
  a	
  hot	
  spot	
  even	
  with	
  
netdev	
  interrupts	
  moved	
  off,	
  to	
  a	
  different	
  CPUs.	
  The	
  following	
  method	
  may	
  be	
  used	
  to	
  
spread	
  block	
  device	
  interrupts	
  evenly	
  for	
  devices	
  implementing	
  RAID	
  stripes:	
  
#	
  distribute	
  block	
  device	
  interrupts	
  between	
  CPU4-­‐CPU7	
  
cat	
  /proc/interrupts	
  
cat	
  /proc/irq/183[3-­‐6]/smp_affinity*	
  
echo	
  80	
  >	
  /proc/irq/1836/smp_affinity	
  
echo	
  40	
  >	
  /proc/irq/1835/smp_affinity	
  
echo	
  20	
  >	
  /proc/irq/1834/smp_affinity	
  
echo	
  10	
  >	
  /proc/irq/1833/smp_affinity	
  
echo	
  8	
  >	
  /proc/irq/1838/smp_affinity	
  	
  
Please	
  note	
  that	
  IRQ	
  numbers	
  and	
  assignment	
  may	
  differ	
  on	
  your	
  system.	
  You	
  have	
  to	
  
consult	
  /proc/interrupts	
  table	
  for	
  specific	
  assignments	
  pertinent	
  to	
  your	
  system.	
  
	
  
For	
  additional	
  details	
  and	
  theory,	
  please	
  refer	
  to	
  the	
  following	
  online	
  materials:	
  
http://www.percona.com/blog/2011/06/09/aligning-­‐io-­‐on-­‐a-­‐hard-­‐disk-­‐raid-­‐the-­‐
theory/	
  
https://www.kernel.org/doc/ols/2009/ols2009-­‐pages-­‐235-­‐238.pdf	
  
http://people.redhat.com/msnitzer/docs/io-­‐limits.txt	
  
	
  
PART	
  IV	
  –	
  DATABASE	
  OPTIMIZATION	
  
	
  
Since	
  Customer	
  didn’t	
  share	
  application	
  and	
  testing	
  know-­‐how	
  yet,	
  the	
  only	
  way	
  to	
  
reproduce	
  abnormal	
  DB	
  behavior	
  during	
  the	
  test	
  was	
  to	
  replay	
  DB	
  transaction	
  log	
  
against	
  recovered	
  from	
  backup	
  DB	
  snapshot.	
  This	
  was	
  slow,	
  cumbersome	
  and	
  not	
  really	
  
fully	
  repeatable	
  process.	
  Percona	
  tools	
  were	
  really	
  instrumental	
  for	
  this	
  task	
  allowing	
  
multithreaded	
  transaction	
  replay	
  inserting	
  delays	
  between	
  transactions	
  as	
  recorded.	
  A	
  
plain	
  SQL	
  script	
  import	
  would	
  have	
  been	
  processed	
  by	
  single	
  thread	
  only	
  and	
  all	
  
requests	
  would	
  be	
  processed	
  as	
  one	
  stream.	
  
	
  
Although	
  the	
  transaction	
  replay	
  has	
  created	
  some	
  DB	
  server	
  load,	
  the	
  load	
  type	
  and	
  its	
  
I/O	
  patterns	
  were	
  quite	
  different	
  compared	
  to	
  I/O	
  patterns	
  observed	
  during	
  the	
  test.	
  
Transaction	
  logs	
  included	
  only	
  DML	
  statements	
  (insert,	
  update,	
  delete),	
  but	
  no	
  data	
  
read	
  (select)	
  requests.	
  Knowing	
  that	
  those	
  “select”	
  requests	
  represented	
  75%	
  of	
  all	
  
requests,	
  it	
  quickly	
  became	
  apparent	
  that	
  such	
  testing	
  approach	
  is	
  flawed	
  and	
  will	
  not	
  
be	
  able	
  to	
  recreate	
  real-­‐life	
  conditions.	
  
	
  
We	
  came	
  to	
  a	
  point	
  where	
  more	
  advanced	
  tools	
  and	
  techniques	
  were	
  required	
  for	
  
iterating	
  over	
  various	
  DB	
  parameters	
  in	
  a	
  repeatable	
  fashion	
  while	
  measuring	
  their	
  
impact	
  on	
  DB	
  performance	
  and	
  underlying	
  subsystems.	
  Moreover,	
  it	
  was	
  not	
  clear	
  
whether	
  unexpected	
  DB	
  behavior	
  and	
  performance	
  issues	
  were	
  caused	
  by	
  the	
  
virtualization	
  infrastructure,	
  the	
  DB	
  engine	
  settings,	
  or	
  the	
  way	
  DB	
  was	
  used,	
  i.e.	
  
combination	
  of	
  application	
  logic	
  and	
  data	
  stored	
  in	
  DB	
  tables.	
  
	
  
To	
  separate	
  those	
  concerns	
  it	
  was	
  proposed	
  to	
  perform	
  load-­‐tests	
  using	
  synthetic	
  OLTP	
  
transactions	
  generated	
  by	
  sysbench,	
  a	
  well-­‐known	
  load-­‐testing	
  toolkit.	
  Such	
  tests	
  have	
  
been	
  executed	
  on	
  both	
  VCC	
  and	
  AWS	
  platforms.	
  The	
  results	
  were	
  speaking	
  for	
  
themselves.	
  
	
  
 
Figure	
  8:	
  AWS	
  i2.8xlarge	
  CPU	
  load	
  -­‐	
  Sysbench	
  Test	
  Completed	
  in	
  64.42	
  sec	
  	
  
	
  
Figure	
  9:	
  VCC	
  4C-­‐28G	
  CPU	
  load	
  -­‐	
  Sysbench	
  Test	
  Complete	
  in	
  283.51	
  sec	
  
At	
  this	
  point	
  it	
  was	
  clear	
  that	
  DB	
  server’s	
  performance	
  issues	
  have	
  nothing	
  to	
  do	
  with	
  
application	
  logic	
  and	
  are	
  not	
  specific	
  to	
  SQL	
  workload	
  and	
  rather	
  related	
  to	
  
configuration	
  and	
  infrastructure.	
  The	
  OLTP	
  test	
  provided	
  the	
  capability	
  to	
  stress	
  test	
  
the	
  DB	
  engine	
  and	
  optimize	
  it	
  independently,	
  without	
  having	
  to	
  rely	
  on	
  Customer’s	
  
application	
  know-­‐how	
  and	
  the	
  solution	
  wide	
  test	
  harness.	
  
	
  
Thorough	
  research	
  and	
  study	
  of	
  InnoDB	
  engine	
  began…	
  Studying	
  source	
  code	
  as	
  well	
  as	
  
consulting	
  with	
  the	
  following	
  online	
  resources	
  was	
  a	
  key	
  to	
  a	
  clear	
  understanding	
  of	
  DB	
  
engine	
  internals	
  and	
  its	
  behavior:	
  
	
  
-­‐ http://www.mysqlperformanceblog.com	
  
-­‐ http://www.percona.com	
  	
  
-­‐ http://dimitrik.free.fr/blog/	
  
-­‐ https://blog.mariadb.org	
  	
  
	
  
	
  
The	
  drawing	
  below	
  published	
  by	
  Percona	
  engineers	
  is	
  showing	
  key	
  factors	
  and	
  settings	
  
impacting	
  DB	
  engine	
  throughput	
  and	
  performance.	
  
	
  
	
  
Figure	
  10:	
  InnoDB	
  Engine	
  Internals	
  
Obviously,	
  there	
  is	
  no	
  quick	
  win	
  and	
  no	
  single	
  dial	
  to	
  turn	
  in	
  order	
  to	
  achieve	
  the	
  
optimal	
  result.	
  	
  It’s	
  easy	
  to	
  explain	
  main	
  factors	
  impacting	
  InnoDB	
  engine	
  performance,	
  
though	
  optimizing	
  those	
  factors	
  practically	
  is	
  a	
  quite	
  challenging	
  task.	
  
	
  
 
InnoDB	
  Performance	
  –	
  Theory	
  and	
  Practice	
  
	
  
The	
  two	
  most	
  important	
  parameters	
  for	
  InnoDB	
  performance	
  are	
  	
  
innodb_buffer_pool_size	
  and	
  innodb_log_file_size.	
  InnoDB	
  works	
  with	
  data	
  in	
  memory,	
  
and	
  all	
  changes	
  to	
  data	
  are	
  performed	
  in	
  memory.	
  In	
  order	
  to	
  survive	
  a	
  crash	
  or	
  system	
  
failure,	
  InnoDB	
  is	
  logging	
  changes	
  into	
  InnoDB	
  transaction	
  logs.	
  The	
  size	
  of	
  the	
  InnoDB	
  
transaction	
  log	
  defines	
  up	
  to	
  how	
  many	
  changed	
  blocks	
  are	
  tolerated	
  in	
  memory	
  for	
  any	
  
given	
  point	
  in	
  time.	
  The	
  obvious	
  question	
  is:	
  “why	
  can’t	
  we	
  simply	
  use	
  a	
  gigantic	
  
InnoDB	
  transaction	
  log?”	
  The	
  answer	
  is	
  that	
  the	
  size	
  of	
  the	
  transaction	
  log	
  affects	
  
recovery	
  time	
  after	
  a	
  crash.	
  The	
  rule	
  of	
  thumb	
  (until	
  recent)	
  was	
  -­‐	
  the	
  bigger	
  the	
  log,	
  the	
  
longer	
  the	
  recovery	
  time.	
  Okay,	
  so	
  we	
  have	
  another	
  innodb_log_file_size	
  variable.	
  Let’s	
  
imagine	
  it	
  as	
  some	
  distance	
  on	
  imaginary	
  axis:	
  
	
  
Our	
  current	
  state	
  is	
  checkpoint	
  age,	
  which	
  is	
  the	
  age	
  of	
  the	
  oldest	
  modified	
  non-­‐flushed	
  
page.	
  Checkpoint	
  age	
  is	
  located	
  somewhere	
  between	
  0	
  and	
  innodb_log_file_size.	
  Point	
  0	
  
means	
  there	
  are	
  no	
  modified	
  pages.	
  Checkpoint	
  age	
  can’t	
  grow	
  past	
  innodb_log_file_size,	
  
as	
  that	
  would	
  mean	
  we	
  would	
  not	
  be	
  able	
  to	
  recover	
  after	
  a	
  crash.	
  
	
  
In	
  fact,	
  InnoDB	
  has	
  two	
  safety	
  nets	
  or	
  protection	
  points:	
  “async”	
  and	
  “sync”.	
  When	
  
checkpoint	
  age	
  reaches	
  “async”	
  point,	
  InnoDB	
  tries	
  to	
  flush	
  as	
  many	
  pages	
  as	
  possible,	
  
while	
  still	
  allowing	
  other	
  queries,	
  however,	
  throughput	
  drops	
  down	
  to	
  the	
  floor.	
  The	
  
“sync”	
  stage	
  is	
  even	
  worse.	
  When	
  we	
  reach	
  “sync”	
  point,	
  InnoDB	
  blocks	
  other	
  queries	
  
while	
  trying	
  to	
  flush	
  pages	
  and	
  return	
  checkpoint	
  age	
  to	
  a	
  point	
  before	
  “async”.	
  This	
  is	
  
done	
  to	
  prevent	
  checkpoint	
  age	
  from	
  exceeding	
  innodb_log_file_size.	
  These	
  are	
  both	
  
abnormal	
  operational	
  stages	
  for	
  InnoDB	
  and	
  should	
  be	
  avoided	
  at	
  all	
  cost.	
  In	
  current	
  
versions	
  of	
  InnoDB,	
  the	
  “sync”	
  point	
  is	
  at	
  about	
  7/8	
  of	
  innodb_log_file_size,	
  and	
  the	
  
“async”	
  point	
  is	
  at	
  about	
  6/8	
  =	
  3/4	
  of	
  innodb_log_file_size.	
  
	
  
So,	
  there	
  is	
  one	
  critically	
  important	
  balancing	
  act:	
  on	
  the	
  one	
  hand	
  we	
  want	
  “checkpoint	
  
age”	
  as	
  large	
  as	
  possible,	
  as	
  it	
  defines	
  performance	
  and	
  throughput.	
  But,	
  on	
  the	
  other	
  
hand,	
  we	
  should	
  never	
  reach	
  the	
  “async”	
  point.	
  
 
The	
  idea	
  is	
  to	
  define	
  another	
  point	
  T	
  (target),	
  which	
  is	
  located	
  before	
  “async”,	
  in	
  order	
  
to	
  have	
  a	
  gap	
  for	
  flexibility,	
  and	
  try	
  at	
  all	
  cost	
  to	
  keep	
  checkpoint	
  age	
  from	
  going	
  past	
  T.	
  
We	
  assume	
  that	
  if	
  we	
  can	
  keep	
  “checkpoint_age”	
  in	
  the	
  range	
  0	
  –	
  T,	
  we	
  will	
  achieve	
  
stable	
  throughput	
  even	
  for	
  more-­‐less	
  unpredictable	
  workload.	
  
	
  
Now,	
  which	
  factors	
  affecting	
  checkpoint	
  age?	
  When	
  we	
  execute	
  DML	
  queries	
  that	
  
change	
  data	
  (insert/update/delete),	
  we	
  perform	
  writes	
  to	
  the	
  log,	
  we	
  change	
  pages,	
  and	
  
checkpoint	
  age	
  is	
  growing.	
  When	
  we	
  perform	
  flushing	
  of	
  changed	
  pages,	
  checkpoint	
  age	
  
is	
  going	
  down	
  again.	
  So,	
  that	
  means	
  –	
  the	
  main	
  way	
  we	
  have	
  to	
  keep	
  checkpoint	
  age	
  
about	
  point	
  “T”	
  is	
  to	
  change	
  the	
  number	
  of	
  pages	
  being	
  flushed	
  per	
  second	
  or	
  make	
  this	
  
number	
  variable	
  and	
  suited	
  for	
  specific	
  workload.	
  That	
  way,	
  we	
  can	
  keep	
  checkpoint	
  
age	
  down.	
  If	
  this	
  doesn’t	
  help	
  and	
  checkpoint	
  age	
  keeps	
  growing	
  beyond	
  “T”	
  towards	
  
“async”–	
  we	
  have	
  a	
  second	
  control	
  mechanism:	
  we	
  can	
  add	
  a	
  delay	
  into	
  
insert/update/delete	
  operations.	
  This	
  way	
  we	
  prevent	
  checkpoint	
  age	
  from	
  growing	
  
and	
  reaching	
  “async”.	
  
	
  
To	
  summarize,	
  the	
  idea	
  for	
  the	
  optimization	
  algorithm	
  is:	
  under	
  load	
  we	
  must	
  keep	
  
checkpoint	
  age	
  around	
  point	
  “T”	
  by	
  increasing	
  or	
  decreasing	
  the	
  number	
  of	
  pages	
  
flushed	
  per	
  second.	
  If	
  checkpoint	
  age	
  continues	
  to	
  grow,	
  we	
  need	
  to	
  throttle	
  throughput	
  
to	
  prevent	
  further	
  growth.	
  The	
  throttling	
  depends	
  on	
  the	
  position	
  of	
  checkpoint	
  age	
  –	
  as	
  
our	
  checkpoint	
  age	
  gets	
  closer	
  to	
  “async”,	
  we	
  need	
  higher	
  levels	
  of	
  throttling.	
  
	
  
From	
  Theory	
  to	
  Practice	
  –	
  Test	
  Framework	
  
There	
  is	
  a	
  saying	
  -­‐	
  
In	
  theory,	
  there	
  is	
  no	
  difference	
  between	
  theory	
  and	
  practice,	
  but	
  in	
  practice	
  there	
  is…	
  
	
  
In	
  practice,	
  there	
  are	
  a	
  lot	
  more	
  variables	
  to	
  bear	
  in	
  mind.	
  There	
  are	
  also	
  such	
  factors	
  as	
  
I/O	
  limits,	
  thread	
  contention	
  and	
  locking	
  coming	
  into	
  play	
  and	
  improving	
  performance	
  
is	
  becoming	
  more	
  like	
  solving	
  equation	
  with	
  a	
  number	
  of	
  variables,	
  which	
  are	
  
depending	
  on	
  each	
  other…	
  
	
  
Obviously,	
  for	
  being	
  able	
  to	
  iterate	
  over	
  various	
  parameter	
  and	
  setting	
  combinations	
  
there	
  is	
  a	
  need	
  to	
  execute	
  DB	
  tests	
  in	
  a	
  repeatable	
  and	
  well-­‐defined	
  (read	
  automated)	
  
manner,	
  while	
  capturing	
  test	
  results	
  for	
  correlation	
  and	
  further	
  analysis.	
  Quick	
  research	
  
showed	
  that	
  although	
  there	
  are	
  many	
  load-­‐testing	
  frameworks	
  available,	
  with	
  some	
  
being	
  specifically	
  tailored	
  for	
  testing	
  MySQL	
  DB	
  performance,	
  unfortunately,	
  none	
  of	
  
them	
  would	
  cover	
  all	
  requirements	
  and	
  provide	
  needed	
  tools	
  and	
  automation.	
  
	
  
Eventually,	
  we	
  developed	
  our	
  own	
  fully	
  automated	
  and	
  flexible	
  load-­‐testing	
  framework.	
  
This	
  framework	
  was	
  mainly	
  used	
  to	
  stress	
  test	
  and	
  analyze	
  MySQL	
  and	
  InnoDB	
  
behavior,	
  nonetheless,	
  it	
  is	
  open	
  enough	
  to	
  plug	
  in	
  any	
  other	
  tools	
  or	
  to	
  be	
  used	
  for	
  
testing	
  different	
  applications.	
  The	
  developed	
  toolkit	
  includes	
  following	
  components:	
  
-­‐ Test	
  Runner	
  
-­‐ Remote	
  Test	
  Agent	
  (load	
  generator)	
  
-­‐ Data	
  Collector	
  (sampler)	
  
-­‐ Data	
  Processor	
  
-­‐ Graphing	
  facility	
  
	
  
Using	
  this	
  framework	
  it	
  was	
  possible	
  to	
  identify	
  the	
  optimal	
  MySQL	
  and	
  InnoDB	
  engine	
  
configuration.	
  The	
  goal	
  was	
  to	
  deliver	
  best	
  possible	
  InnoDB	
  engine	
  performance	
  in	
  
terms	
  of	
  transactions	
  and	
  queries	
  served	
  per	
  second	
  (TPS	
  and	
  QPS)	
  while	
  eliminating	
  
I/O	
  spikes	
  and	
  achieving	
  consistent	
  and	
  predictable	
  system	
  load,	
  in	
  other	
  words	
  
fulfilling	
  the	
  critically	
  important	
  balancing	
  act	
  mentioned	
  above:	
  keeping	
  “checkpoint	
  
age”	
  as	
  large	
  as	
  possible	
  at	
  the	
  same	
  time	
  trying	
  not	
  to	
  reach	
  the	
  “async”	
  (or	
  even	
  worse	
  
“sync”)	
  point.	
  
	
  
The	
  graphs	
  below	
  show	
  that	
  an	
  optimally	
  configured	
  DB	
  server	
  can	
  easily	
  deliver	
  1000+	
  
OLTP	
  transactions,	
  translating	
  to	
  20+K	
  queries	
  per	
  second,	
  generated	
  by	
  500	
  
concurrent	
  DB	
  connections	
  during	
  a	
  6	
  hour	
  long	
  test.	
  
	
  
Queries per second (QPS) – green
	
  
	
  
Figure	
  11:	
  Optimized	
  MySQL	
  DB	
  -­‐	
  QPS	
  Graph	
  
After	
  a	
  warm-­‐up	
  phase	
  the	
  system	
  consistently	
  delivered	
  about	
  22K	
  queries	
  per	
  second.	
  
	
  
Transactions per second (TPS) – green Response Time (RT) - blue
	
  
Figure	
  12:	
  Optimized	
  MySQL	
  DB	
  -­‐	
  TPS	
  and	
  RT	
  Graph	
  
 
After	
  ramping	
  up	
  load	
  up	
  to	
  500	
  concurrent	
  users,	
  the	
  system	
  consistently	
  delivered	
  
1200	
  TPS	
  in	
  average.	
  The	
  response	
  time	
  1600ms	
  average	
  is	
  measured	
  end	
  to	
  end	
  and	
  
includes	
  both	
  network	
  and	
  communication	
  overhead	
  (~1000ms)	
  and	
  SQL	
  processing	
  
time	
  (~600ms).	
  	
  
	
  
	
  
%util - red await - green avgqu-sz - blue
	
  
Figure	
  13:	
  Optimized	
  MySQL	
  DB	
  -­‐RAID	
  Stripe	
  I/O	
  Metrics	
  
It’s	
  easy	
  to	
  see	
  that	
  after	
  the	
  warm-­‐up	
  and	
  stabilization	
  phases	
  the	
  disk	
  stripe	
  
performed	
  consistently	
  utilizing	
  an	
  average	
  disk	
  queue	
  size	
  ~8,	
  which	
  was	
  suggested	
  by	
  
the	
  storage	
  team	
  as	
  the	
  optimum	
  value	
  for	
  VCC	
  storage	
  stack.	
  The	
  “await”	
  iostat	
  metric	
  
is	
  constantly	
  below	
  20ms	
  ,	
  which	
  is	
  the	
  average	
  time	
  for	
  I/O	
  requests	
  to	
  be	
  issued	
  to	
  the	
  
device	
  and	
  to	
  be	
  served.	
  Device	
  utilization	
  is	
  <25%	
  in	
  average,	
  showing	
  that	
  there	
  is	
  still	
  
plenty	
  of	
  spare	
  capacity	
  to	
  serve	
  I/O	
  requests.	
  
	
  
	
  
%idle – red %user - green %system - blue %iowait - yellow
	
  
Figure	
  14:	
  Optimized	
  MySQL	
  DB	
  -­‐	
  CPU	
  Metrics	
  
	
  
The	
  CPU	
  metrics	
  are	
  showing	
  that	
  in	
  average	
  55%	
  of	
  CPUs	
  were	
  idle,	
  35%	
  were	
  spent	
  in	
  
user	
  space,	
  i.e.	
  executing	
  applications,	
  5%	
  were	
  spent	
  by	
  kernel	
  (or	
  system)	
  tasks	
  
including	
  interrupt	
  processing	
  and	
  just	
  5%	
  were	
  spent	
  waiting	
  for	
  device	
  I/O.	
  	
  
	
  
	
  
	
  
 
bytes sent - green bytes received - blue
	
  
Figure	
  15:	
  Optimized	
  MySQL	
  DB	
  -­‐	
  Network	
  Metrics	
  
The	
  network	
  traffic	
  measurement	
  suggests	
  that	
  network	
  capacity	
  is	
  fully	
  consumed,	
  or	
  
using	
  other	
  words	
  –	
  network	
  is	
  saturated	
  with	
  ~48	
  MB/s	
  sent	
  and	
  ~2	
  MB/s	
  received.	
  
These	
  50	
  MB/s	
  of	
  accumulative	
  traffic	
  getting	
  very	
  close	
  to	
  a	
  practical	
  maximum	
  
throughput	
  that	
  can	
  be	
  achieved	
  on	
  the	
  500	
  Mbps	
  network	
  interface.	
  
	
  
In	
  plain	
  English	
  this	
  means	
  that	
  network	
  is	
  the	
  limiting	
  factor	
  here	
  and	
  having	
  other	
  
resources	
  available,	
  DB	
  server	
  could	
  deliver	
  much	
  higher	
  TPS	
  and	
  QPS	
  figures,	
  if	
  
additional	
  network	
  capacity	
  can	
  be	
  provisioned.	
  The	
  ultimate	
  system	
  capacity	
  limit	
  was	
  
not	
  established	
  due	
  to	
  time	
  constraints	
  and	
  the	
  fact	
  that	
  Customer	
  application	
  did	
  not	
  
utilize	
  more	
  than	
  300	
  concurrent	
  DB	
  connections.	
  
	
  
Optimal	
  DB	
  Configuration	
  
	
  
Below	
  is	
  a	
  summary	
  of	
  major	
  changes	
  between	
  the	
  MySQL	
  database	
  configurations	
  on	
  
the	
  AWS	
  and	
  VCC	
  platforms.	
  As	
  with	
  the	
  file-­‐system	
  configuration	
  the	
  objective	
  was	
  to	
  
achieve	
  consistent	
  and	
  predictable	
  performance	
  by	
  avoiding	
  resource	
  usage	
  surges	
  and	
  
stalls.	
  	
  
	
  
The	
  proposed	
  optimizations	
  may	
  have	
  a	
  positive	
  effect	
  in	
  general,	
  however,	
  they	
  are	
  
specific	
  to	
  a	
  certain	
  workload	
  and	
  use-­‐case.	
  Therefore,	
  these	
  optimizations	
  cannot	
  be	
  
considered	
  as	
  universally	
  applicable	
  in	
  VCC	
  environments	
  and	
  must	
  be	
  tailored	
  for	
  a	
  
specific	
  workload.	
  Settings	
  marked	
  with	
  asterisk	
  (*)	
  are	
  defaults	
  for	
  the	
  DB	
  version	
  used.	
  
	
  
<	
  …	
  removed	
  …	
  >	
  
Table	
  4:	
  Optimized	
  MySQL	
  DB	
  -­‐	
  Recommended	
  Settings	
  
	
  
Besides	
  the	
  parameter	
  changes	
  listed	
  above,	
  binary	
  logs	
  (also	
  known	
  as	
  transaction	
  
logs)	
  have	
  been	
  moved	
  to	
  a	
  separate	
  volume	
  where	
  Ext4	
  file-­‐system	
  has	
  been	
  setup	
  
with	
  the	
  following	
  parameters:	
  
	
  
<	
  …	
  removed	
  …	
  >	
  
Further	
  areas	
  for	
  DB	
  improvement:	
  
-­‐ Consider	
  using	
  the	
  latest	
  stable	
  Percona	
  XtraDB	
  version,	
  which	
  is	
  based	
  on	
  
MariaDB	
  codebase	
  and	
  provides	
  many	
  improvements,	
  including	
  patches	
  from	
  
Google	
  and	
  Facebook:	
  
o Redesign	
  of	
  locking	
  subsystem,	
  no	
  reliance	
  on	
  kernel	
  mutexes	
  
o Latest	
  versions	
  have	
  removed	
  number	
  of	
  known	
  contention	
  points	
  
resulting	
  in	
  less	
  spins	
  and	
  lock	
  waits	
  and	
  eventually	
  in	
  a	
  better	
  overall	
  
performance	
  
o Dump	
  and	
  pre-­‐load	
  buffer	
  pool	
  features	
  –	
  allowing	
  much	
  quicker	
  startup	
  
and	
  warming-­‐up	
  phases	
  
o Online	
  DDL	
  –	
  changing	
  schema	
  does	
  not	
  require	
  downtime	
  
o Better	
  query	
  analyzer	
  and	
  overall	
  query	
  performance	
  
o Better	
  page	
  compression	
  support	
  and	
  performance	
  
o Better	
  monitoring	
  and	
  integration	
  with	
  performance	
  schema	
  
o More	
  intelligent	
  flushing	
  algorithm	
  taking	
  in	
  consideration	
  both	
  page	
  
change	
  rates,	
  I/O	
  rates,	
  system	
  load	
  and	
  capabilities	
  and	
  thus	
  providing	
  
better	
  performance	
  adjusted	
  to	
  workload	
  out	
  of	
  the	
  box	
  
o Suited	
  better	
  for	
  fast	
  SSD-­‐based	
  storage	
  (no	
  added	
  cost	
  for	
  random	
  I/O)	
  
and	
  adaptive	
  algorithms	
  not	
  attempting	
  to	
  accommodate	
  for	
  spinning	
  
disks	
  shortcomings	
  
o Scales	
  better	
  on	
  SMP	
  (multi-­‐core)	
  systems	
  and	
  better	
  utilizes	
  higher	
  
number	
  of	
  CPU	
  threads	
  
o Provides	
  fast-­‐checksums	
  (hardware	
  assisted	
  CRC32)	
  allowing	
  to	
  lessen	
  
CPU	
  overhead	
  while	
  retaining	
  data	
  consistency	
  and	
  security	
  
o New	
  configuration	
  options	
  allowing	
  to	
  tailor	
  InnoDB	
  engine	
  even	
  better	
  
to	
  a	
  specific	
  workload	
  
-­‐ Consider	
  using	
  more	
  efficient	
  memory	
  allocator,	
  e.g.	
  jemalloc	
  or	
  tc_malloc.	
  	
  
o The	
  memory	
  allocator	
  provided	
  as	
  a	
  part	
  of	
  GLIBC	
  is	
  known	
  to	
  fall	
  short	
  
under	
  high	
  concurrency.	
  	
  
o GLIBC	
  malloc	
  wasn’t	
  designed	
  for	
  multithreaded	
  workloads	
  and	
  has	
  
number	
  of	
  internal	
  contention	
  points.	
  	
  
o Using	
  modern	
  memory	
  allocators	
  suited	
  for	
  high-­‐concurrency	
  can	
  
significantly	
  improve	
  throughput	
  by	
  reducing	
  internal	
  locking	
  and	
  
contention.	
  
-­‐ Perform	
  DB	
  optimization.	
  While	
  optimizing	
  infrastructure	
  may	
  result	
  in	
  
significant	
  improvement,	
  even	
  better	
  results	
  may	
  be	
  achieved	
  by	
  tailoring	
  the	
  DB	
  
structure	
  itself:	
  
o Consider	
  cluster	
  indexes	
  to	
  avoid	
  locking	
  and	
  contention	
  
o Consider	
  page	
  compression.	
  Besides	
  slight	
  CPU	
  penalty,	
  this	
  may	
  
significantly	
  improve	
  throughput	
  while	
  reducing	
  on-­‐disk	
  storage	
  several	
  
times,	
  resulting	
  in	
  turn	
  in	
  quicker	
  replication	
  and	
  backups	
  
o Monitor	
  performance	
  schema	
  to	
  find	
  out	
  more	
  about	
  in-­‐flight	
  DB	
  engine	
  
performance	
  and	
  adjust	
  required	
  parameters	
  
o Monitor	
  performance	
  and	
  information	
  schemas	
  to	
  find	
  more	
  details	
  about	
  
index	
  effectiveness	
  and	
  build	
  better,	
  more	
  effective	
  indexes	
  
-­‐ Perform	
  SQL	
  optimization.	
  No	
  infrastructure	
  optimization	
  can	
  accommodate	
  
for	
  badly	
  written	
  SQL	
  requests.	
  Caching	
  and	
  other	
  optimization	
  techniques	
  often	
  
mask	
  bad	
  code.	
  SQL	
  queries	
  joining	
  multi-­‐million	
  record	
  tables	
  may	
  work	
  just	
  
fine	
  in	
  development	
  and	
  completely	
  break	
  down	
  on	
  a	
  production	
  DB.	
  
Continuously	
  analyze	
  the	
  most	
  expensive	
  SQL	
  queries	
  to	
  avoid	
  full	
  table	
  scans	
  
and	
  on-­‐disk	
  temporary	
  tables.	
  
	
  
PART	
  V	
  –	
  PEELING	
  THE	
  ONION	
  
	
  
It	
  is	
  a	
  common	
  saying	
  that	
  performance	
  improvement	
  is	
  like	
  peeling	
  an	
  onion.	
  After	
  
addressing	
  one	
  issue,	
  the	
  next	
  one,	
  previously	
  masked,	
  is	
  uncovered	
  and	
  so	
  on…	
  
Likewise,	
  in	
  our	
  case,	
  after	
  addressing	
  the	
  storage	
  and	
  DB	
  layers	
  and	
  improving	
  overall	
  
application	
  throughput	
  it	
  is	
  became	
  apparent	
  something	
  else	
  was	
  holding	
  the	
  
application	
  back	
  from	
  delivering	
  the	
  best	
  possible	
  performance.	
  By	
  this	
  time,	
  DB	
  layer	
  
was	
  studied	
  very	
  well,	
  however,	
  the	
  overall	
  application	
  stack	
  and	
  associated	
  connection	
  
flows	
  were	
  not	
  yet	
  completely	
  understood.	
  
	
  
The	
  Customer	
  demonstrated	
  willingness	
  to	
  cooperate	
  and	
  assisted	
  by	
  providing	
  
instructions	
  for	
  reproducing	
  JMeter	
  load	
  tests	
  as	
  well	
  as	
  on-­‐site	
  resources	
  for	
  an	
  
architecture	
  workshop.	
  
	
  
From	
  this	
  point	
  on,	
  the	
  optimization	
  project	
  speed	
  up	
  tremendously.	
  Not	
  only	
  was	
  it	
  
possible	
  to	
  iterate	
  reliably	
  and	
  perform	
  load-­‐test	
  against	
  the	
  complete	
  application	
  stack,	
  
the	
  understanding	
  of	
  the	
  application	
  architecture	
  and	
  access	
  to	
  Application	
  
Performance	
  Management	
  (APM)	
  tool	
  Jennifer	
  made	
  a	
  huge	
  difference	
  in	
  terms	
  of	
  
visibility	
  into	
  internal	
  application	
  operation	
  and	
  major	
  performance	
  metrics.	
  	
  
	
  
	
  
	
  
Figure	
  16:	
  Jennifer	
  APM	
  Console	
  
Besides	
  providing	
  visual	
  feedback	
  and	
  displaying	
  a	
  number	
  of	
  metrics,	
  Jennifer	
  revealed	
  
the	
  next	
  bottleneck	
  –	
  the	
  network.	
  	
  
	
  
PART	
  VI	
  –	
  PFSENSE	
  
	
  
The	
  original	
  network	
  design,	
  replicating	
  network	
  structure	
  in	
  AWS,	
  was	
  proposed	
  and	
  
agreed	
  with	
  the	
  Customer.	
  Separate	
  networks	
  were	
  created	
  to	
  replicate	
  the	
  
functionality	
  of	
  AWS	
  VPC	
  and	
  pfSense	
  appliances	
  were	
  used	
  to	
  provide	
  network	
  
segmentation,	
  routing	
  and	
  load	
  balancing.	
  
	
  
<	
  …	
  removed	
  …	
  >	
  
	
  
Figure	
  17:	
  Initial	
  Application	
  Deployment	
  -­‐	
  Network	
  Diagram	
  
The	
  pfSense	
  is	
  an	
  open	
  source	
  firewall/router	
  software	
  distribution	
  based	
  on	
  FreeBSD.	
  
It	
  is	
  installed	
  on	
  a	
  VM	
  and	
  turns	
  this	
  VM	
  to	
  a	
  dedicated	
  firewall/router	
  for	
  a	
  network.	
  It	
  
also	
  provides	
  additional	
  important	
  functions	
  such	
  as	
  load	
  balancing,	
  VPN,	
  DHCP.	
  It	
  is	
  
easy	
  to	
  manage	
  using	
  web	
  based	
  UI	
  even	
  for	
  users	
  with	
  little	
  knowledge	
  about	
  
underlying	
  FreeBSD	
  system.	
  
	
  
The	
  FreeBSD	
  network	
  stack	
  is	
  known	
  for	
  it’s	
  exceptional	
  stability	
  and	
  performance.	
  The	
  
pfSense	
  appliances	
  have	
  been	
  used	
  many	
  times	
  before	
  and	
  after,	
  thus	
  nobody	
  expected	
  
issues	
  coming	
  from	
  that	
  side…	
  
	
  
Watching	
  the	
  Jennifer	
  XView	
  chart	
  closely	
  in	
  real-­‐time	
  is	
  fun	
  by	
  itself,	
  like	
  watching	
  fire.	
  
It	
  also	
  is	
  a	
  powerful	
  analysis	
  tool	
  that	
  helps	
  to	
  understand	
  application	
  components	
  
behavior.	
  
 
Figure	
  18:	
  Jennifer	
  XView	
  -­‐	
  Transaction	
  Response	
  Time	
  Scatter	
  Graph	
  
On	
  the	
  graph	
  above,	
  distance	
  between	
  layers	
  is	
  exactly	
  10000ms,	
  which	
  is	
  pointing	
  to	
  
the	
  fact	
  that	
  one	
  of	
  application	
  services	
  is	
  timing-­‐out	
  with	
  10	
  second	
  interval	
  and	
  
repeating	
  connection	
  attempts	
  several	
  times.	
  	
  
	
  
	
  
Figure	
  19:	
  Jennifer	
  APM	
  -­‐	
  Transaction	
  Introspection	
  
Network	
  socket	
  operations	
  were	
  taking	
  a	
  significant	
  amount	
  of	
  time	
  resulting	
  in	
  
multiple	
  repeated	
  attempts	
  in	
  10-­‐second	
  intervals.	
  
	
  
Following	
  old	
  sysadmin	
  adage	
  –	
  “…always	
  blame	
  the	
  network…	
  ”	
  application	
  flows	
  have	
  
been	
  analyzed	
  again	
  and	
  pfSense	
  was	
  suspected	
  to	
  loose	
  or	
  delay	
  packets.	
  Interestingly	
  
enough,	
  the	
  web	
  UI	
  has	
  reported	
  low	
  to	
  moderate	
  VM	
  load	
  and	
  didn’t	
  show	
  any	
  reasons	
  
for	
  concern.	
  	
  
Nonetheless,	
  the	
  console	
  access	
  revealed	
  the	
  truth	
  –	
  the	
  load	
  created	
  by	
  number	
  of	
  
short	
  thread	
  spins	
  was	
  not	
  properly	
  reported	
  in	
  the	
  web	
  UI	
  and	
  hidden	
  by	
  averaging	
  
calculations.	
  A	
  closer	
  look	
  using	
  advanced	
  CPU	
  and	
  system	
  metrics	
  confirmed	
  that	
  the	
  
appliance	
  was	
  experiencing	
  unexpectedly	
  high	
  CPU-­‐load,	
  adding	
  to	
  latency	
  and	
  
dropping	
  network	
  packets.	
  
	
  
Adding	
  more	
  CPUs	
  to	
  the	
  pfSense	
  appliances	
  resulted	
  in	
  doubling	
  network	
  traffic	
  
passed	
  by	
  them.	
  However,	
  even	
  with	
  the	
  maximum	
  CPU	
  count	
  the	
  network	
  was	
  not	
  yet	
  
saturated,	
  suggesting	
  that	
  pfSense	
  appliances	
  may	
  be	
  still	
  limiting	
  application	
  
performance.	
  
	
  
Since	
  pfSense	
  appliances	
  were	
  not	
  an	
  essential	
  requirement	
  and	
  they	
  were	
  just	
  used	
  to	
  
provide	
  routing	
  and	
  load-­‐balancing	
  capability,	
  it	
  was	
  decided	
  to	
  remove	
  them	
  from	
  
application	
  network	
  flow	
  and	
  access	
  subnets	
  by	
  adding	
  additional	
  network	
  cards	
  to	
  
VMs,	
  with	
  either	
  NIC	
  connected	
  to	
  corresponding	
  subnet.	
  
	
  
To	
  summarize	
  -­‐	
  it	
  would	
  be	
  wrong	
  to	
  conclude	
  that	
  pfSense	
  does	
  not	
  fit	
  the	
  purpose	
  and	
  
is	
  not	
  a	
  viable	
  option	
  for	
  building	
  virtual	
  network	
  deployments.	
  Most	
  definitely,	
  
additional	
  research	
  and	
  tuning	
  would	
  help	
  to	
  overcome	
  the	
  observed	
  issues.	
  Due	
  to	
  
time	
  constraints	
  this	
  area	
  was	
  not	
  fully	
  researched	
  and	
  is	
  still	
  pending	
  thorough	
  
investigation.	
  
	
  
PART	
  VII	
  –	
  JMETER	
  
With	
  pfSense	
  removed	
  and	
  HAProxy	
  used	
  for	
  load	
  balancing,	
  overall	
  application	
  
throughput	
  was	
  definitely	
  improved.	
  Increasing	
  the	
  number	
  of	
  CPUs	
  on	
  the	
  DB	
  servers	
  
and	
  the	
  Cassandra	
  nodes	
  seemed	
  to	
  help	
  as	
  well.	
  The	
  collaborative	
  effort	
  with	
  the	
  
Customer	
  yielded	
  great	
  results	
  and	
  we	
  were	
  definitely	
  on	
  the	
  right	
  track.	
  
	
  
With	
  the	
  floodgates	
  wide	
  open	
  we	
  have	
  been	
  able	
  to	
  push	
  more	
  than	
  1000+	
  concurrent	
  
users	
  during	
  our	
  tests.	
  About	
  the	
  same	
  time	
  we	
  started	
  seeing	
  another	
  anomaly	
  –	
  one	
  
out	
  of	
  three	
  JMeter	
  load	
  agents	
  (generators)	
  was	
  behaving	
  quite	
  strange.	
  After	
  reaching	
  
end	
  of	
  the	
  test	
  at	
  3600	
  seconds	
  time	
  frame,	
  java	
  threads	
  belonging	
  to	
  the	
  two	
  JMeter	
  
servers	
  were	
  shutting	
  down	
  quickly	
  and	
  the	
  third	
  instance	
  shutdown	
  took	
  a	
  while,	
  
effectively	
  increasing	
  test	
  window	
  duration	
  and	
  as	
  result	
  negatively	
  impacting	
  average	
  
test	
  metrics.	
  
	
  
All	
  three	
  JMeter	
  servers	
  were	
  reconfigured	
  to	
  use	
  the	
  same	
  settings.	
  For	
  some	
  reason	
  
they	
  were	
  using	
  slightly	
  different	
  configurations	
  and	
  were	
  logging	
  data	
  to	
  different	
  
paths.	
  It	
  didn’t	
  resolve	
  the	
  underlying	
  issue,	
  though.	
  Due	
  to	
  time	
  constraints	
  it	
  was	
  
decided	
  to	
  build	
  a	
  replacement	
  VM	
  rather	
  than	
  to	
  troubleshoot	
  issues	
  with	
  one	
  of	
  the	
  
existing	
  VMs.	
  
	
  
Eventually,	
  a	
  fourth	
  JMeter	
  server	
  was	
  deployed.	
  Besides	
  fixing	
  the	
  issue	
  with	
  java	
  
threads	
  startup	
  and	
  shutdown,	
  it	
  allowed	
  us	
  to	
  generate	
  higher	
  loads	
  and	
  provided	
  
additional	
  flexibility	
  in	
  defining	
  load-­‐patterns.	
  
 
Lesson	
  learned:	
  for	
  low	
  to	
  moderate	
  loads	
  JMeter	
  is	
  working	
  just	
  fine.	
  For	
  high	
  loads,	
  
JMeter	
  may	
  become	
  a	
  breaking	
  point	
  itself.	
  In	
  this	
  case,	
  it	
  is	
  recommended	
  to	
  use	
  scale-­‐
out	
  approach	
  rather	
  than	
  scale-­‐up,	
  keeping	
  the	
  number	
  of	
  java-­‐threads	
  per	
  server	
  
below	
  a	
  certain	
  threshold.	
  
	
  
PART	
  VIII	
  –	
  ALMOST	
  THERE	
  
	
  
Although	
  AWS	
  performance	
  measurements	
  were	
  still	
  better,	
  we	
  had	
  already	
  
significantly	
  improved	
  performance	
  compared	
  to	
  the	
  figures	
  captured	
  during	
  the	
  first	
  
round	
  of	
  performance	
  tests.	
  	
  
	
  
Removing	
  pfSense	
  an	
  average	
  of	
  587	
  TPS	
  with	
  800	
  VU	
  was	
  achieved.	
  In	
  this	
  test	
  load	
  
was	
  spread	
  statically	
  rather	
  than	
  balanced	
  by	
  specifying	
  different	
  target	
  application	
  
server	
  IP	
  addresses	
  manually	
  in	
  the	
  JMeter	
  configuration	
  files.	
  With	
  a	
  HAProxy	
  load-­‐
balancer	
  put	
  in	
  place	
  TPS	
  figures	
  initially	
  went	
  down	
  to	
  544	
  VU	
  and	
  after	
  some	
  
optimizations	
  (disabled	
  connection	
  tracking,	
  netfilter),	
  it	
  has	
  increased	
  up	
  to	
  607	
  TPS	
  
with	
  800	
  VU	
  –	
  the	
  maximum	
  we’ve	
  seen	
  to	
  date.	
  This	
  represents	
  a	
  22%	
  increase	
  from	
  
the	
  best	
  previous	
  result	
  (498	
  TPS/800	
  VU	
  with	
  pfSense	
  yet)	
  and	
  100%	
  increase	
  from	
  
initial	
  performance	
  test.	
  Overall	
  the	
  results	
  were	
  looking	
  more	
  than	
  promising.	
  
	
  
	
  
Figure	
  20:	
  Iterative	
  Optimization	
  Progress	
  Chart	
  
 
Despite	
  good	
  progress	
  the	
  following	
  points	
  still	
  required	
  further	
  investigation:	
  
-­‐ Disk	
  I/O	
  skew	
  issues	
  still	
  remained	
  
-­‐ Cassandra	
  servers	
  disk	
  I/O	
  was	
  uneven	
  and	
  quite	
  high	
  
	
  
Our	
  enthusiasm	
  rose	
  more	
  and	
  more	
  as	
  we	
  discovered	
  that	
  VCC	
  platform	
  could	
  serve	
  
more	
  users	
  than	
  AWS.	
  The	
  AWS	
  test	
  results	
  showed	
  that	
  past	
  600VU	
  performance	
  
started	
  to	
  decline	
  and	
  we	
  were	
  able	
  to	
  push	
  as	
  high	
  as	
  1600VU	
  with	
  application	
  being	
  
able	
  to	
  support	
  the	
  load	
  and	
  showing	
  higher	
  throughput	
  numbers	
  (~760-­‐780TPS),	
  until	
  
…	
  
	
  
The	
  next	
  day	
  something	
  happened,	
  which	
  became	
  another	
  turning	
  point	
  in	
  this	
  project.	
  
The	
  application	
  became	
  unstable	
  and	
  the	
  application	
  throughput	
  that	
  we	
  saw	
  just	
  a	
  
couple	
  hours	
  earlier	
  decreased	
  significantly.	
  More	
  importantly	
  it	
  started	
  to	
  fluctuate,	
  
with	
  the	
  application	
  freezing	
  at	
  random	
  times.	
  The	
  TPS-­‐scatter	
  landscape	
  in	
  Jennifer	
  
was	
  showing	
  a	
  new	
  anomaly…	
  
	
  
	
  
Figure	
  21:	
  Jennifer	
  XView	
  -­‐	
  Transaction	
  Response	
  Time	
  Surges	
  
Since	
  other	
  known	
  bottlenecks	
  have	
  ben	
  removed	
  and	
  MySQL	
  DB	
  was	
  not	
  a	
  weak	
  link	
  in	
  
the	
  chain	
  any	
  more,	
  basically	
  being	
  bored	
  during	
  the	
  performance	
  test,	
  the	
  Cassandra	
  
cluster	
  became	
  a	
  next	
  suspect.	
  
	
  
PART	
  IX	
  –	
  CASSANDRA	
  
	
  
The	
  tomcat	
  logs	
  were	
  pointing	
  to	
  Cassandra	
  as	
  well.	
  There	
  were	
  numerous	
  warning	
  
messages	
  about	
  excluding	
  one	
  or	
  another	
  node	
  from	
  the	
  connection	
  pool	
  due	
  to	
  
connectivity	
  timeouts.	
  
	
  
After	
  having	
  a	
  closer	
  look	
  at	
  the	
  Cassandra	
  nodes	
  several	
  points	
  drew	
  our	
  attention:	
  
-­‐ There	
  was	
  no	
  consistency	
  in	
  the	
  Cassandra	
  ring	
  load	
  
-­‐ Amounts	
  of	
  data	
  stored	
  on	
  Cassandra	
  nodes	
  varied	
  significantly	
  
-­‐ Memory	
  usage	
  and	
  I/O	
  profiles	
  were	
  different	
  across	
  the	
  board.	
  	
  
	
  
As	
  a	
  common	
  trend	
  after	
  a	
  short	
  normal	
  run	
  period,	
  the	
  average	
  system	
  load	
  on	
  several	
  
random	
  Cassandra	
  nodes	
  started	
  growing	
  exponentially	
  eventually	
  making	
  those	
  nodes	
  
unresponsive.	
  During	
  this	
  time	
  the	
  I/O	
  subsystem	
  was	
  over-­‐utilized	
  as	
  well,	
  yielding	
  
very	
  high	
  CPU	
  %wait	
  and	
  queue	
  length	
  on	
  block	
  devices.	
  	
  
	
  
Everything	
  was	
  pointing	
  to	
  the	
  fact	
  that	
  certain	
  Cassandra	
  nodes	
  initiated	
  compaction	
  
(internal	
  data	
  structure	
  optimization)	
  right	
  during	
  the	
  load	
  test,	
  spiraling	
  down	
  in	
  a	
  
deadly	
  loop.	
  	
  Another	
  quick	
  conversation	
  with	
  Customer’s	
  architect	
  confirmed	
  the	
  same	
  
–	
  it	
  was	
  most	
  likely	
  the	
  SSTable	
  compaction	
  causing	
  the	
  issue.	
  
	
  
	
  
Figure	
  22:	
  VCC	
  Cassandra	
  Cluster	
  CPU	
  Usage	
  During	
  the	
  Test	
  
As	
  seen	
  on	
  the	
  graph	
  above,	
  during	
  the	
  various	
  test	
  runs,	
  one	
  or	
  another	
  Cassandra	
  
node	
  maxed	
  out	
  CPU	
  utilization.	
  The	
  same	
  configuration	
  in	
  AWS	
  has	
  been	
  working	
  just	
  
fine	
  with	
  not	
  perfect	
  but	
  still	
  quite	
  even	
  load	
  and	
  no	
  continuous	
  load	
  spikes.	
  
	
  
 
Figure	
  23:	
  AWS	
  Cassandra	
  Cluster	
  CPU	
  Usage	
  During	
  the	
  Test	
  
Comparing	
  both	
  VCC	
  and	
  AWS	
  Cassandra	
  deployments	
  led	
  to	
  quite	
  contradicting	
  
conclusions:	
  
-­‐ VCC	
  has	
  more	
  nodes	
  –	
  12	
  vs.	
  8	
  in	
  AWS,	
  but	
  it	
  should	
  improve	
  performance,	
  right?	
  
-­‐ AWS	
  is	
  using	
  spinning	
  disks	
  for	
  Cassandra	
  VMs	
  and	
  VCC	
  storage	
  stack	
  is	
  SSD-­‐
based,	
  which	
  should	
  improve	
  performance	
  too…	
  
	
  
Like	
  with	
  MySQL,	
  it	
  was	
  clear	
  -­‐	
  the	
  optimal,	
  or	
  even	
  “good	
  enough”	
  settings	
  taken	
  from	
  
AWS	
  are	
  not	
  good	
  or	
  at	
  times	
  even	
  bad	
  for	
  using	
  on	
  the	
  VCC	
  platform.	
  
	
  
For	
  historical	
  reasons	
  Customer’s	
  application	
  is	
  utilizing	
  both	
  SQL	
  and	
  NOSQL	
  
databases.	
  When	
  mapping	
  AWS	
  infrastructure	
  to	
  VCC,	
  it	
  was	
  decided	
  to	
  build	
  a	
  
Cassandra	
  ring	
  using	
  12	
  nodes	
  in	
  VCC	
  instead	
  of	
  8	
  nodes	
  in	
  AWS,	
  since	
  latter	
  were	
  lot	
  
more	
  powerful	
  in	
  terms	
  of	
  individual	
  node	
  specifications.	
  As	
  further	
  tests	
  revealed	
  the	
  
better	
  approach	
  would	
  have	
  been	
  just	
  opposite	
  -­‐	
  to	
  use	
  bigger	
  number	
  of	
  smaller	
  VMs	
  
for	
  the	
  Cassandra	
  cluster.	
  It	
  is	
  also	
  worth	
  mentioning	
  that	
  Cassandra	
  has	
  been	
  originally	
  
designed	
  to	
  run	
  on	
  number	
  of	
  low-­‐end	
  systems,	
  based	
  on	
  slow	
  spinning	
  disks.	
  
	
  
During	
  the	
  past	
  couple	
  years,	
  SSD	
  started	
  to	
  appear	
  more	
  and	
  more	
  often	
  in	
  the	
  Data	
  
Centers.	
  While	
  not	
  being	
  a	
  commodity	
  yet,	
  SSDs	
  became	
  a	
  heavily	
  used	
  component	
  in	
  
modern	
  infrastructures	
  and	
  the	
  Cassandra	
  codebase	
  was	
  adjusted	
  to	
  make	
  internal	
  
decisions	
  and	
  algorithms	
  more	
  suitable	
  for	
  use	
  in	
  conjunction	
  with	
  SSD,	
  and	
  not	
  only	
  
spinning	
  disks.	
  Therefore	
  deploying	
  the	
  latest	
  stable	
  Cassandra	
  version	
  could	
  have	
  
provided	
  additional	
  benefits	
  right	
  away.	
  Unfortunately,	
  the	
  specification	
  required	
  
specific	
  version,	
  and	
  therefore	
  all	
  optimizations	
  have	
  been	
  performed	
  against	
  the	
  older	
  
version.	
  
	
  
Let’s	
  have	
  a	
  quick	
  look	
  at	
  Cassandra’s	
  architecture	
  and	
  some	
  key	
  definitions.	
  
	
  
 
Figure	
  24:	
  High-­‐Level	
  Cassandra	
  Architecture	
  
Cassandra	
  is	
  a	
  distributed	
  key-­‐value	
  store	
  initially	
  developed	
  at	
  Facebook.	
  It	
  was	
  
designed	
  to	
  handle	
  large	
  amounts	
  of	
  data	
  spread	
  across	
  many	
  commodity	
  servers.	
  
Cassandra	
  provides	
  high	
  availability	
  through	
  a	
  symmetric	
  architecture	
  that	
  contains	
  no	
  
single	
  point	
  of	
  failure	
  and	
  replicates	
  data	
  across	
  nodes.	
  
	
  
Cassandra’s	
  architecture	
  is	
  a	
  combination	
  of	
  Google’s	
  Big-­‐	
  Table	
  and	
  Amazon’s	
  Dynamo.	
  
Like	
  in	
  Dynamo’s	
  architecture,	
  all	
  Cassandra	
  nodes	
  form	
  a	
  ring	
  that	
  partitions	
  the	
  key	
  
space	
  using	
  consistent	
  hashing	
  (see	
  figure	
  above).	
  This	
  is	
  known	
  as	
  distributed	
  hash	
  
table	
  (DHT).	
  The	
  data	
  model	
  and	
  single	
  node	
  architecture	
  are	
  mainly	
  based	
  on	
  BigTable	
  
in	
  its	
  terminology.	
  Cassandra	
  can	
  be	
  classified	
  as	
  an	
  extensible	
  row	
  store	
  since	
  it	
  can	
  
store	
  a	
  variable	
  number	
  of	
  attributes	
  per	
  row.	
  Each	
  row	
  is	
  accessible	
  through	
  a	
  globally	
  
unique	
  key.	
  Although	
  columns	
  can	
  differ	
  per	
  row,	
  columns	
  are	
  grouped	
  into	
  more	
  static	
  
column	
  families.	
  These	
  are	
  treated	
  like	
  tables	
  in	
  a	
  relational	
  database.	
  Each	
  column	
  
family	
  is	
  stored	
  in	
  separate	
  files.	
  In	
  order	
  to	
  allow	
  the	
  level	
  of	
  flexibility	
  of	
  a	
  different	
  
schema	
  per	
  row,	
  Cassandra	
  stores	
  metadata	
  with	
  each	
  value.	
  The	
  metadata	
  contains	
  the	
  
column	
  name	
  as	
  well	
  as	
  a	
  timestamp	
  for	
  versioning.	
  
	
  
Like	
  BigTable,	
  Cassandra	
  has	
  an	
  in-­‐memory	
  storage	
  structure	
  that	
  is	
  called	
  Memtable,	
  
one	
  instance	
  per	
  column	
  family.	
  The	
  Memtable	
  acts	
  as	
  a	
  write	
  cache	
  that	
  allows	
  for	
  fast	
  
sequential	
  writes	
  to	
  disk.	
  Data	
  on	
  disk	
  is	
  stored	
  in	
  immutable	
  Sorted	
  String	
  Tables	
  
(SSTable).	
  SSTables	
  consist	
  of	
  three	
  structures,	
  a	
  key	
  index,	
  a	
  bloom	
  filter	
  and	
  a	
  data	
  
file.	
  The	
  key	
  index	
  points	
  to	
  the	
  rows	
  in	
  the	
  SSTable,	
  the	
  bloom	
  filter	
  enables	
  checking	
  
for	
  the	
  existence	
  of	
  keys	
  in	
  the	
  table.	
  Due	
  to	
  the	
  limited	
  size	
  of	
  the	
  bloom	
  filter	
  it	
  is	
  also	
  
cached	
  in	
  memory.	
  The	
  data	
  file	
  is	
  ordered	
  for	
  faster	
  scanning	
  and	
  merging.	
  
	
  
For	
  consistency	
  and	
  fault	
  tolerance,	
  all	
  updates	
  are	
  first	
  written	
  to	
  a	
  sequential	
  log	
  
(Commit	
  Log)	
  after	
  which	
  they	
  can	
  be	
  confirmed.	
  In	
  addition	
  to	
  the	
  Memtable,	
  
Cassandra	
  provides	
  optional	
  row	
  caches	
  and	
  key	
  cache.	
  The	
  row	
  cache	
  stores	
  a	
  
consolidated,	
  up-­‐to-­‐date	
  version	
  of	
  a	
  row,	
  while	
  the	
  key	
  cache	
  acts	
  as	
  an	
  index	
  to	
  the	
  
SSTables.	
  If	
  these	
  are	
  used,	
  write	
  operations	
  have	
  to	
  keep	
  them	
  updated.	
  It	
  is	
  worth	
  
mentioning	
  that	
  only	
  previously	
  accessed	
  rows	
  are	
  cached	
  in	
  Cassandra	
  in	
  both	
  caches.	
  
As	
  a	
  result,	
  new	
  rows	
  will	
  only	
  be	
  written	
  to	
  the	
  Memtable	
  but	
  not	
  the	
  cache.	
  
	
  
In	
  order	
  to	
  deliver	
  the	
  least	
  possible	
  latency	
  and	
  best	
  performance	
  on	
  low-­‐end	
  
hardware,	
  data	
  writes	
  in	
  Cassandra	
  are	
  using	
  a	
  multi-­‐step	
  process,	
  first	
  writing	
  
requests	
  to	
  the	
  commit-­‐log,	
  then	
  to	
  a	
  MemTable	
  structure	
  and	
  eventually,	
  when	
  flushed,	
  
they	
  are	
  appended	
  to	
  and	
  becoming	
  immutable	
  SSTables	
  in	
  the	
  form	
  of	
  a	
  disk	
  file.	
  Over	
  
time	
  as	
  the	
  number	
  of	
  SSTables	
  is	
  growing,	
  they	
  are	
  becoming	
  fragmented,	
  which	
  is	
  
impacting	
  read	
  operations	
  performance.	
  	
  
	
  
To	
  make	
  it	
  simple,	
  flushing	
  and	
  compaction	
  operations	
  are	
  vitally	
  important	
  for	
  
Cassandra.	
  However,	
  if	
  setup	
  incorrectly	
  or	
  executed	
  at	
  the	
  “wrong”	
  time,	
  they	
  can	
  
decrease	
  performance	
  significantly,	
  at	
  times	
  making	
  the	
  entire	
  Cassandra	
  node	
  
unresponsive.	
  Exactly	
  this	
  was	
  happening	
  and	
  during	
  the	
  test	
  when	
  several	
  nodes	
  
stopped	
  responding	
  and	
  showed	
  very	
  high	
  system	
  load	
  and	
  performing	
  huge	
  amounts	
  
of	
  I/O.	
  Obviously,	
  Cassandra’s	
  configuration	
  was	
  tuned	
  for	
  spinning	
  disks	
  on	
  AWS,	
  
resulting	
  in	
  unexpected	
  behavior	
  on	
  the	
  SSD-­‐based	
  VCC	
  storage	
  stack.	
  
	
  
As	
  a	
  first	
  measure	
  to	
  gain	
  better	
  visibility	
  into	
  Cassandra’s	
  operation,	
  the	
  DataStax	
  
OpsCenter	
  application	
  was	
  deployed.	
  It	
  allowed	
  iterating	
  over	
  various	
  parameters	
  and	
  
executing	
  a	
  number	
  of	
  tests	
  against	
  the	
  Cassandra	
  cluster	
  while	
  measuring	
  their	
  impact	
  
and	
  helping	
  to	
  observe	
  overall	
  cluster	
  behavior.	
  
	
  
Applying	
  all	
  the	
  lessons	
  learned	
  earlier	
  and	
  working	
  with	
  VCC	
  storage	
  team	
  the	
  
following	
  configuration	
  changes	
  were	
  applied:	
  
	
  
<	
  …	
  removed	
  …	
  >	
  
	
  
Table	
  5:	
  Optimized	
  Cassandra	
  -­‐	
  Recommended	
  Settings	
  
Similar	
  to	
  the	
  MySQL	
  optimization,	
  the	
  basic	
  idea	
  is	
  to	
  use	
  more	
  frequent	
  I/O	
  to	
  
saturate	
  block	
  device	
  queues	
  less	
  and	
  as	
  a	
  result	
  more	
  optimally	
  utilizing	
  storage	
  stack	
  
resources.	
  
	
  
Besides	
  the	
  recommended	
  option	
  changes,	
  the	
  commit-­‐log	
  was	
  moved	
  to	
  a	
  separate	
  
volume.	
  Those	
  changes	
  led	
  to	
  predictable	
  and	
  consistent	
  Cassandra	
  performance,	
  
evenly	
  and	
  constantly	
  forcing	
  in-­‐memory	
  data	
  to	
  disk	
  and	
  avoiding	
  I/O	
  spikes	
  and	
  
minimizing	
  stalls	
  due	
  to	
  compaction.	
  Below	
  is	
  a	
  summary	
  of	
  the	
  volumes	
  created	
  for	
  the	
  
Cassandra	
  nodes:	
  
	
  
xvda	
  	
  600	
  IOPS	
  –	
  boot	
  and	
  root	
  
xvdb	
  	
  600	
  IOPS	
  –	
  lvm2	
  root	
  extension	
  
xvdc	
  4600	
  IOPS	
  –	
  data	
  mdadm	
  stripe	
  disk	
  1	
  –	
  no	
  partitioning	
  
xvde	
  4600	
  IOPS	
  –	
  data	
  mdadm	
  stripe	
  disk	
  2	
  –	
  no	
  partitioning	
  
xvdf	
  4600	
  IOPS	
  –	
  data	
  mdadm	
  stripe	
  disk	
  3	
  –	
  no	
  partitioning	
  
xvdg	
  5000	
  IOPS	
  –	
  commit	
  log	
  disk	
  –	
  no	
  partitioning	
  
	
  
There	
  are	
  two	
  more	
  parameters	
  worth	
  mentioning,	
  which	
  are	
  controlling	
  the	
  streaming	
  
and	
  compaction	
  throughput	
  limits	
  within	
  the	
  Cassandra	
  cluster.	
  Both	
  values	
  were	
  set	
  to	
  
50MB/s,	
  which	
  is	
  sufficient	
  for	
  normal	
  cluster	
  operation	
  and	
  in	
  line	
  with	
  storage	
  sub-­‐
system	
  throughput	
  configured	
  on	
  the	
  Cassandra	
  nodes.	
  However,	
  sometimes	
  those	
  
thresholds	
  may	
  need	
  to	
  be	
  changed.	
  In	
  case	
  of	
  cluster	
  rebalancing,	
  maintenance,	
  and	
  
similar	
  operations	
  the	
  following	
  handy	
  shortcuts	
  may	
  be	
  used	
  to	
  control	
  thresholds	
  
cluster	
  wide.	
  
	
  
#	
  for	
  n	
  in	
  01	
  02	
  03	
  04	
  05	
  06	
  07	
  08	
  09	
  10	
  11	
  12	
  ;	
  do	
  ./nodetool	
  -­‐h	
  node$n	
  -­‐
p	
  9199	
  setcompactionthroughput	
  150	
  ;	
  done	
  
#	
  for	
  n	
  in	
  01	
  02	
  03	
  04	
  05	
  06	
  07	
  08	
  09	
  10	
  11	
  12	
  ;	
  do	
  ./nodetool	
  -­‐h	
  node$n	
  -­‐
p	
  9199	
  setstreamthroughput	
  150	
  ;	
  done	
  
	
  
Obviously,	
  after	
  maintenance	
  has	
  completed,	
  those	
  thresholds	
  should	
  be	
  set	
  back	
  to	
  
appropriate	
  values	
  for	
  normal	
  production	
  use.	
  
	
  
PART	
  X	
  –	
  HAPROXY	
  
	
  
With	
  the	
  DB	
  layer	
  fixed,	
  application	
  performance	
  became	
  stable	
  across	
  tests,	
  although	
  
two	
  points	
  were	
  still	
  raising	
  some	
  concerns:	
  
-­‐ After	
  an	
  initial	
  spike	
  at	
  the	
  beginning	
  of	
  a	
  load	
  test,	
  the	
  number	
  of	
  concurrent	
  
connections	
  abruptly	
  dropped	
  almost	
  two	
  times	
  
-­‐ The	
  amount	
  of	
  Virtual	
  User	
  requests	
  reaching	
  either	
  application	
  server	
  was	
  quite	
  
different	
  sometimes	
  reaching	
  	
  a	
  1:2	
  ratio	
  
	
  
	
  
Figure	
  25:	
  Jennifer	
  APM	
  -­‐	
  Concurrent	
  Connections	
  and	
  Per-­‐server	
  Arrival	
  Rate	
  
It	
  was	
  time	
  to	
  take	
  a	
  closer	
  look	
  at	
  the	
  software	
  load-­‐balancers	
  based	
  on	
  HAProxy.	
  This	
  
application	
  is	
  known	
  to	
  be	
  able	
  to	
  serve	
  100K+	
  concurrent	
  connections,	
  so	
  just	
  one	
  
thousand	
  concurrent	
  connections	
  should	
  not	
  get	
  even	
  close	
  to	
  the	
  limit.	
  
	
  
Additional	
  research	
  showed	
  that	
  the	
  round-­‐robin	
  load-­‐balancing	
  scheme	
  is	
  not	
  
performing	
  as	
  expected	
  and	
  was	
  causing	
  a	
  concentration	
  of	
  requests	
  on	
  one	
  or	
  another	
  
system	
  in	
  an	
  unpredictable	
  manner.	
  The	
  most	
  even	
  request	
  distribution	
  was	
  achieved	
  
by	
  using	
  least-­‐connect	
  algorithm.	
  	
  After	
  implementing	
  this	
  change,	
  the	
  load	
  eventually	
  
evenly	
  spread	
  across	
  all	
  systems.	
  
	
  
	
  
Figure	
  26:	
  Jennifer	
  APM	
  -­‐	
  Connection	
  Statistics	
  After	
  Optimization	
  
Furthermore,	
  a	
  number	
  of	
  SYN	
  flood	
  kernel	
  warnings	
  in	
  the	
  log	
  files	
  as	
  well	
  as	
  
nf_conntrac	
  complaints	
  (Linux	
  connection	
  tracking	
  facility	
  used	
  by	
  iptables)	
  about	
  its	
  
overrun	
  buffers	
  and	
  dropped	
  connections	
  pointed	
  to	
  next	
  optimization	
  steps.	
  	
  
	
  
Initially,	
  it	
  was	
  decided	
  to	
  increase	
  the	
  size	
  of	
  the	
  connection	
  tracking	
  tables	
  and	
  
internal	
  structures	
  and	
  disable	
  the	
  SYN	
  flood	
  protection	
  mechanisms.	
  	
  
	
  
<	
  …	
  removed	
  …	
  >	
  
	
  
This	
  did	
  show	
  some	
  improvement,	
  however,	
  eventually	
  it	
  was	
  decided	
  to	
  turn	
  iptables	
  
off	
  completely	
  to	
  remove	
  any	
  possible	
  obstacles	
  and	
  latency	
  introduced	
  by	
  this	
  facility.	
  	
  
	
  
During	
  the	
  subsequent	
  tests	
  when	
  generated	
  load	
  was	
  increased	
  further,	
  HAProxy	
  hit	
  
another	
  issue	
  often	
  referred	
  to	
  as	
  “TCP	
  socket	
  exhaustion”.	
  	
  
	
  
A	
  quick	
  reminder	
  –	
  there	
  were	
  two	
  layers	
  of	
  HAProxies	
  deployed.	
  The	
  first	
  layer	
  was	
  
load-­‐balancing	
  the	
  incoming	
  http	
  requests	
  originating	
  from	
  the	
  application	
  clients	
  
between	
  the	
  java	
  application	
  server	
  (tomcat)	
  instances	
  and	
  the	
  second	
  layer	
  passing	
  
requests	
  from	
  the	
  java	
  application	
  server	
  to	
  the	
  primary	
  and	
  stand-­‐by	
  MySQL	
  DB	
  
servers.	
  
	
  
HAProxy	
  works	
  as	
  a	
  reverse-­‐proxy	
  and	
  so	
  uses	
  its	
  own	
  IP	
  address	
  to	
  establish	
  
connections	
  to	
  the	
  server.	
  Most	
  operating	
  systems	
  implementing	
  a	
  TCP	
  stack	
  typically	
  
have	
  around	
  64K	
  (or	
  less)	
  TCP	
  source	
  ports	
  available	
  for	
  connections	
  to	
  a	
  remote	
  
IP:port.	
  Once	
  a	
  combination	
  of	
  “source	
  IP:port	
  =>	
  destination	
  IP:port”	
  is	
  in	
  use,	
  it	
  cannot	
  
be	
  re-­‐used.	
  
	
  
As	
  a	
  consequence	
  there	
  cannot	
  be	
  more	
  than	
  64K	
  open	
  connections	
  from	
  
a	
  HAProxy	
  box	
  to	
  a	
  single	
  remote	
  IP:port	
  couple.	
  
	
  
On	
  the	
  front	
  layer	
  the	
  http	
  request	
  rate	
  was	
  a	
  few	
  hundreds	
  per	
  second,	
  so	
  we	
  never	
  
ever	
  reach	
  the	
  limit	
  of	
  64K	
  simultaneous	
  open	
  connections	
  to	
  the	
  remote	
  service.	
  On	
  
the	
  backend	
  layer	
  there	
  should	
  not	
  have	
  been	
  more	
  than	
  a	
  couple	
  of	
  hundred	
  persistent	
  
connections	
  during	
  peak	
  time	
  since	
  connection	
  pooling	
  was	
  used	
  on	
  the	
  application	
  
server.	
  So	
  this	
  was	
  not	
  the	
  problem	
  either.	
  
	
  
It	
  turns	
  out	
  that	
  there	
  was	
  an	
  issue	
  with	
  the	
  MySQL	
  client	
  implementation.	
  When	
  a	
  
client	
  sends	
  its	
  “QUIT”	
  sequence,	
  it	
  performs	
  a	
  few	
  internal	
  operations	
  before	
  
immediately	
  shutting	
  down	
  the	
  TCP	
  connection,	
  without	
  waiting	
  for	
  the	
  server	
  to	
  do	
  it.	
  
A	
  basic	
  tcpdump	
  revealed	
  this	
  behavior.	
  Note	
  that	
  this	
  issue	
  cannot	
  be	
  reproduced	
  on	
  a	
  
loopback	
  interface	
  or	
  on	
  the	
  same	
  system,	
  because	
  the	
  server	
  answers	
  fast	
  enough.	
  But	
  
over	
  a	
  LAN	
  connection	
  with	
  2	
  different	
  machines	
  the	
  latency	
  raises	
  past	
  the	
  threshold	
  
where	
  the	
  issue	
  becomes	
  apparent.	
  Basically,	
  here	
  is	
  the	
  sequence	
  performed	
  by	
  a	
  
MySQL	
  client:	
  
MySQL	
  Client	
  ==>	
  "QUIT"	
  sequence	
  ==>	
  MySQL	
  Server	
  
MySQL	
  Client	
  ==>	
  	
  	
  	
  	
  	
  	
  FIN	
  	
  	
  	
  	
  	
  	
  ==>	
  MySQL	
  Server	
  
MySQL	
  Client	
  <==	
  	
  	
  	
  	
  FIN	
  ACK	
  	
  	
  	
  	
  <==	
  MySQL	
  Server	
  
MySQL	
  Client	
  ==>	
  	
  	
  	
  	
  	
  	
  ACK	
  	
  	
  	
  	
  	
  	
  ==>	
  MySQL	
  Server	
  
This	
  results	
  in	
  the	
  client	
  connection	
  to	
  remain	
  unavailable	
  for	
  twice	
  the	
  MSL	
  (Maximum	
  
Segment	
  Life)	
  time,	
  which	
  defaults	
  to	
  2	
  minutes.	
  Note	
  that	
  this	
  type	
  of	
  close	
  has	
  no	
  
negative	
  impact	
  when	
  the	
  MySQL	
  connection	
  is	
  established	
  using	
  a	
  UNIX	
  socket.	
  
	
  
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...
The Cloud Story or Less is More...

More Related Content

What's hot

Book hudson
Book hudsonBook hudson
Book hudson
Suresh Kumar
 
Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)
Nuno Alves
 
What's New in VMware Virtual SAN
What's New in VMware Virtual SANWhat's New in VMware Virtual SAN
What's New in VMware Virtual SAN
EMC
 
Installing sql server 2012 failover cluster instance
Installing sql server 2012 failover cluster instanceInstalling sql server 2012 failover cluster instance
Installing sql server 2012 failover cluster instance
David Muise
 
Deploying the XenMobile 8.5 Solution
Deploying the XenMobile 8.5 SolutionDeploying the XenMobile 8.5 Solution
Deploying the XenMobile 8.5 Solution
Nuno Alves
 
D space manual 1.5.2
D space manual 1.5.2D space manual 1.5.2
D space manual 1.5.2tvcumet
 
Cisco Virtualization Experience Infrastructure
Cisco Virtualization Experience InfrastructureCisco Virtualization Experience Infrastructure
Cisco Virtualization Experience Infrastructure
ogrossma
 
Perceptive nolij web installation and upgrade guide 6.8.x
Perceptive nolij web installation and upgrade guide 6.8.xPerceptive nolij web installation and upgrade guide 6.8.x
Perceptive nolij web installation and upgrade guide 6.8.x
Kumaran Balachandran
 
Web securith cws getting started
Web securith cws getting startedWeb securith cws getting started
Web securith cws getting started
Harissa Maria
 
Getting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphereGetting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphere
EMC
 
Citrix virtual desktop handbook (7x)
Citrix virtual desktop handbook (7x)Citrix virtual desktop handbook (7x)
Citrix virtual desktop handbook (7x)
Nuno Alves
 
Whats-New-VMware-vCloud-Director-15-Technical-Whitepaper
Whats-New-VMware-vCloud-Director-15-Technical-WhitepaperWhats-New-VMware-vCloud-Director-15-Technical-Whitepaper
Whats-New-VMware-vCloud-Director-15-Technical-WhitepaperDjbilly Mixe Pour Toi
 
Book VMWARE VMware ESXServer Advanced Technical Design Guide
Book VMWARE VMware ESXServer  Advanced Technical Design Guide Book VMWARE VMware ESXServer  Advanced Technical Design Guide
Book VMWARE VMware ESXServer Advanced Technical Design Guide
aktivfinger
 
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)Advantec Distribution
 
Akka java
Akka javaAkka java
Akka java
dlvladescu
 
NP problems
NP problemsNP problems
NP problems
Lien Tran
 
Dell Data Migration A Technical White Paper
Dell Data Migration  A Technical White PaperDell Data Migration  A Technical White Paper
Dell Data Migration A Technical White Paper
nomanc
 
Db2 virtualization
Db2 virtualizationDb2 virtualization
Db2 virtualization
bupbechanhgmail
 

What's hot (20)

Book hudson
Book hudsonBook hudson
Book hudson
 
Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)
 
What's New in VMware Virtual SAN
What's New in VMware Virtual SANWhat's New in VMware Virtual SAN
What's New in VMware Virtual SAN
 
Installing sql server 2012 failover cluster instance
Installing sql server 2012 failover cluster instanceInstalling sql server 2012 failover cluster instance
Installing sql server 2012 failover cluster instance
 
Deploying the XenMobile 8.5 Solution
Deploying the XenMobile 8.5 SolutionDeploying the XenMobile 8.5 Solution
Deploying the XenMobile 8.5 Solution
 
D space manual 1.5.2
D space manual 1.5.2D space manual 1.5.2
D space manual 1.5.2
 
Cisco Virtualization Experience Infrastructure
Cisco Virtualization Experience InfrastructureCisco Virtualization Experience Infrastructure
Cisco Virtualization Experience Infrastructure
 
Perceptive nolij web installation and upgrade guide 6.8.x
Perceptive nolij web installation and upgrade guide 6.8.xPerceptive nolij web installation and upgrade guide 6.8.x
Perceptive nolij web installation and upgrade guide 6.8.x
 
Web securith cws getting started
Web securith cws getting startedWeb securith cws getting started
Web securith cws getting started
 
Getting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphereGetting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphere
 
Citrix virtual desktop handbook (7x)
Citrix virtual desktop handbook (7x)Citrix virtual desktop handbook (7x)
Citrix virtual desktop handbook (7x)
 
Whats-New-VMware-vCloud-Director-15-Technical-Whitepaper
Whats-New-VMware-vCloud-Director-15-Technical-WhitepaperWhats-New-VMware-vCloud-Director-15-Technical-Whitepaper
Whats-New-VMware-vCloud-Director-15-Technical-Whitepaper
 
Powershell selflearn
Powershell selflearnPowershell selflearn
Powershell selflearn
 
Book VMWARE VMware ESXServer Advanced Technical Design Guide
Book VMWARE VMware ESXServer  Advanced Technical Design Guide Book VMWARE VMware ESXServer  Advanced Technical Design Guide
Book VMWARE VMware ESXServer Advanced Technical Design Guide
 
8000 guide
8000 guide8000 guide
8000 guide
 
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
 
Akka java
Akka javaAkka java
Akka java
 
NP problems
NP problemsNP problems
NP problems
 
Dell Data Migration A Technical White Paper
Dell Data Migration  A Technical White PaperDell Data Migration  A Technical White Paper
Dell Data Migration A Technical White Paper
 
Db2 virtualization
Db2 virtualizationDb2 virtualization
Db2 virtualization
 

Viewers also liked

CE-Article_JulAug13_FINAL
CE-Article_JulAug13_FINALCE-Article_JulAug13_FINAL
CE-Article_JulAug13_FINALJeannie Counce
 
Công ty luật Wikilaw
Công ty luật WikilawCông ty luật Wikilaw
Công ty luật Wikilaw
Su Le Van
 
SISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZ
SISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZSISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZ
SISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZ
Fernando Perez
 
Hamlet
HamletHamlet
Hamlet
zeidina
 
UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...
UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...
UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...
savoytwaddle3112
 
Akshay Pawar Experiance
Akshay Pawar ExperianceAkshay Pawar Experiance
Akshay Pawar ExperianceAkshay Pawar
 
S first year orientation history of university of san carlos 1 - copy
S first year orientation history of university of san carlos 1 - copyS first year orientation history of university of san carlos 1 - copy
S first year orientation history of university of san carlos 1 - copySis Mmfe Navarro
 
Aula 2 final 3 tópicos avançados
Aula 2 final 3   tópicos avançadosAula 2 final 3   tópicos avançados
Aula 2 final 3 tópicos avançadosAngelo Peres
 
la informatica como practica social para la satisfaccion de necesidades.
la informatica como practica social para la satisfaccion de necesidades.la informatica como practica social para la satisfaccion de necesidades.
la informatica como practica social para la satisfaccion de necesidades.
itziadaniraparracervantes
 
Aon Food & Drink Inperspective Winter 2015
Aon Food & Drink Inperspective Winter 2015Aon Food & Drink Inperspective Winter 2015
Aon Food & Drink Inperspective Winter 2015
Graeme Cross
 
Hamlet
HamletHamlet
Hamlet
zeidina
 
South Korean ICT Development: Key Lessons for the Emerging Economies
South Korean ICT Development: Key Lessons for the Emerging EconomiesSouth Korean ICT Development: Key Lessons for the Emerging Economies
South Korean ICT Development: Key Lessons for the Emerging Economies
Faheem Hussain
 

Viewers also liked (12)

CE-Article_JulAug13_FINAL
CE-Article_JulAug13_FINALCE-Article_JulAug13_FINAL
CE-Article_JulAug13_FINAL
 
Công ty luật Wikilaw
Công ty luật WikilawCông ty luật Wikilaw
Công ty luật Wikilaw
 
SISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZ
SISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZSISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZ
SISTEMAS OPERATIVOS PERESENTACION FERNANDO PEREZ
 
Hamlet
HamletHamlet
Hamlet
 
UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...
UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...
UnA�Un�Un�UnA�Un�Un�UnA�Un�Un�Un�Un�Por QuAfA� Usted Necesita Un 'Apocalipsis...
 
Akshay Pawar Experiance
Akshay Pawar ExperianceAkshay Pawar Experiance
Akshay Pawar Experiance
 
S first year orientation history of university of san carlos 1 - copy
S first year orientation history of university of san carlos 1 - copyS first year orientation history of university of san carlos 1 - copy
S first year orientation history of university of san carlos 1 - copy
 
Aula 2 final 3 tópicos avançados
Aula 2 final 3   tópicos avançadosAula 2 final 3   tópicos avançados
Aula 2 final 3 tópicos avançados
 
la informatica como practica social para la satisfaccion de necesidades.
la informatica como practica social para la satisfaccion de necesidades.la informatica como practica social para la satisfaccion de necesidades.
la informatica como practica social para la satisfaccion de necesidades.
 
Aon Food & Drink Inperspective Winter 2015
Aon Food & Drink Inperspective Winter 2015Aon Food & Drink Inperspective Winter 2015
Aon Food & Drink Inperspective Winter 2015
 
Hamlet
HamletHamlet
Hamlet
 
South Korean ICT Development: Key Lessons for the Emerging Economies
South Korean ICT Development: Key Lessons for the Emerging EconomiesSouth Korean ICT Development: Key Lessons for the Emerging Economies
South Korean ICT Development: Key Lessons for the Emerging Economies
 

Similar to The Cloud Story or Less is More...

pdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdf
pdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdfpdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdf
pdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdf
Jalal Neshat
 
hci10_help_sap_en.pdf
hci10_help_sap_en.pdfhci10_help_sap_en.pdf
hci10_help_sap_en.pdf
JagadishBabuParri
 
Ngen mvpn with pim implementation guide 8010027-002-en
Ngen mvpn with pim implementation guide   8010027-002-enNgen mvpn with pim implementation guide   8010027-002-en
Ngen mvpn with pim implementation guide 8010027-002-en
Ngoc Nguyen Dang
 
Jdbc
JdbcJdbc
SAP CPI-DS.pdf
SAP CPI-DS.pdfSAP CPI-DS.pdf
SAP CPI-DS.pdf
JagadishBabuParri
 
Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2
Sal Marcus
 
Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2Sal Marcus
 
Lte demonstration network test plan phase 3 part_1-v2_4_05072013
Lte demonstration network test plan phase 3 part_1-v2_4_05072013Lte demonstration network test plan phase 3 part_1-v2_4_05072013
Lte demonstration network test plan phase 3 part_1-v2_4_05072013Pfedya
 
Hp 40gs user's guide english
Hp 40gs user's guide englishHp 40gs user's guide english
Hp 40gs user's guide englishdanilegg17
 
ProxySG_ProxyAV_Integration_Guide.pdf
ProxySG_ProxyAV_Integration_Guide.pdfProxySG_ProxyAV_Integration_Guide.pdf
ProxySG_ProxyAV_Integration_Guide.pdf
PCCW GLOBAL
 
plsqladvanced.pdf
plsqladvanced.pdfplsqladvanced.pdf
plsqladvanced.pdf
Thirupathis9
 
Mastering Oracle PL/SQL: Practical Solutions
Mastering Oracle PL/SQL: Practical SolutionsMastering Oracle PL/SQL: Practical Solutions
Mastering Oracle PL/SQL: Practical Solutions
MURTHYVENKAT2
 
R data
R dataR data
Gdfs sg246374
Gdfs sg246374Gdfs sg246374
Gdfs sg246374Accenture
 
Optimizing oracle-on-sun-cmt-platform
Optimizing oracle-on-sun-cmt-platformOptimizing oracle-on-sun-cmt-platform
Optimizing oracle-on-sun-cmt-platform
Sal Marcus
 
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)Advantec Distribution
 

Similar to The Cloud Story or Less is More... (20)

pdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdf
pdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdfpdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdf
pdfcoffee.com_i-openwells-basics-training-3-pdf-free.pdf
 
C++programming howto
C++programming howtoC++programming howto
C++programming howto
 
hci10_help_sap_en.pdf
hci10_help_sap_en.pdfhci10_help_sap_en.pdf
hci10_help_sap_en.pdf
 
Ngen mvpn with pim implementation guide 8010027-002-en
Ngen mvpn with pim implementation guide   8010027-002-enNgen mvpn with pim implementation guide   8010027-002-en
Ngen mvpn with pim implementation guide 8010027-002-en
 
Jdbc
JdbcJdbc
Jdbc
 
SAP CPI-DS.pdf
SAP CPI-DS.pdfSAP CPI-DS.pdf
SAP CPI-DS.pdf
 
Lfa
LfaLfa
Lfa
 
Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2
 
Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2Maa wp sun_apps11i_db10g_r2-2
Maa wp sun_apps11i_db10g_r2-2
 
Lte demonstration network test plan phase 3 part_1-v2_4_05072013
Lte demonstration network test plan phase 3 part_1-v2_4_05072013Lte demonstration network test plan phase 3 part_1-v2_4_05072013
Lte demonstration network test plan phase 3 part_1-v2_4_05072013
 
Hp 40gs user's guide english
Hp 40gs user's guide englishHp 40gs user's guide english
Hp 40gs user's guide english
 
ProxySG_ProxyAV_Integration_Guide.pdf
ProxySG_ProxyAV_Integration_Guide.pdfProxySG_ProxyAV_Integration_Guide.pdf
ProxySG_ProxyAV_Integration_Guide.pdf
 
R Data
R DataR Data
R Data
 
plsqladvanced.pdf
plsqladvanced.pdfplsqladvanced.pdf
plsqladvanced.pdf
 
Mastering Oracle PL/SQL: Practical Solutions
Mastering Oracle PL/SQL: Practical SolutionsMastering Oracle PL/SQL: Practical Solutions
Mastering Oracle PL/SQL: Practical Solutions
 
R data
R dataR data
R data
 
Gdfs sg246374
Gdfs sg246374Gdfs sg246374
Gdfs sg246374
 
Optimizing oracle-on-sun-cmt-platform
Optimizing oracle-on-sun-cmt-platformOptimizing oracle-on-sun-cmt-platform
Optimizing oracle-on-sun-cmt-platform
 
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
Ap 51xx access point product reference guide (part no. 72 e-113664-01 rev. b)
 
usersguide.pdf
usersguide.pdfusersguide.pdf
usersguide.pdf
 

Recently uploaded

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

The Cloud Story or Less is More...

  • 1.                 The  Cloud  Story  or  Less  is  More…         by  Slava  Vladyshevsky   slava[at]verizon.com                   Dedicated  to  Lee,  Sarah,  David,   Andy  and  Jeff  as  well  as  many  others,   who  went  above  and  beyond  to  make  this  possible.                                   “Cache  is  evil.  Full  stop.”   Jeff  
  • 3. Table  of  Content     PART  I  –  BUILDING  TESTBED  ......................................................................................................................................  6   PART  II  –  FIRST  TEST  ....................................................................................................................................................  10   PART  III  –  STORAGE  STACK  PERFORMANCE  .....................................................................................................  12   PART  IV  –  DATABASE  OPTIMIZATION  ..................................................................................................................  15   PART  V  –  PEELING  THE  ONION  ................................................................................................................................  24   PART  VI  –  PFSENSE  ........................................................................................................................................................  25   PART  VII  –  JMETER  ........................................................................................................................................................  27   PART  VIII  –  ALMOST  THERE  ......................................................................................................................................  28   PART  IX  –  CASSANDRA  .................................................................................................................................................  29   PART  X  –  HAPROXY  ........................................................................................................................................................  34   PART  XI  –  TOMCAT  ........................................................................................................................................................  40   PART  XII  –  JAVA  ...............................................................................................................................................................  42   PART  XIII  –  OS  OPTIMIZATION  .................................................................................................................................  44   PART  XIV  –  NETWORK  STACK  ..................................................................................................................................  44     Figure  Register     AWS  Application  Deployment  ......................................................................................................................  6   Initial  VCC  Application  Deployment  ..........................................................................................................  9   First  Test  Results  -­‐  Comparison  Chart  ..................................................................................................  10   First  Test  -­‐  High  CPU  Load  on  DB  Server  .............................................................................................  11   First  Test  -­‐  High  CPU  %iowait  on  DB  Server  ......................................................................................  11   First  Test  -­‐  Disk  I/O  Skew  on  DB  Server  ..............................................................................................  11   Optimized  Storage  Subsystem  Throughput  ........................................................................................  14   AWS  i2.8xlarge  CPU  load  -­‐  Sysbench  Test  Completed  in  64.42  sec  ..........................................  16   VCC  4C-­‐28G  CPU  load  -­‐  Sysbench  Test  Complete  in  283.51  sec  .................................................  16   InnoDB  Engine  Internals  .............................................................................................................................  17   Optimized  MySQL  DB  -­‐  QPS  Graph  ..........................................................................................................  20   Optimized  MySQL  DB  -­‐  TPS  and  RT  Graph  ..........................................................................................  20   Optimized  MySQL  DB  -­‐RAID  Stripe  I/O  Metrics  ...............................................................................  21   Optimized  MySQL  DB  -­‐  CPU  Metrics  ......................................................................................................  21   Optimized  MySQL  DB  -­‐  Network  Metrics  .............................................................................................  22   Jennifer  APM  Console  ...................................................................................................................................  25   Initial  Application  Deployment  -­‐  Network  Diagram  .......................................................................  25   Jennifer  XView  -­‐  Transaction  Response  Time  Scatter  Graph  ......................................................  26   Jennifer  APM  -­‐  Transaction  Introspection  ...........................................................................................  26   Iterative  Optimization  Progress  Chart  ..................................................................................................  28   Jennifer  XView  -­‐  Transaction  Response  Time  Surges  .....................................................................  29   VCC  Cassandra  Cluster  CPU  Usage  During  the  Test  .........................................................................  30   AWS  Cassandra  Cluster  CPU  Usage  During  the  Test  .......................................................................  31   High-­‐Level  Cassandra  Architecture  ........................................................................................................  32   Jennifer  APM  -­‐  Concurrent  Connections  and  Per-­‐server  Arrival  Rate  ....................................  35   Jennifer  APM  -­‐  Connection  Statistics  After  Optimization  ..............................................................  35   Jennifer  APM  -­‐  DB  Connection  Pool  Usage  ..........................................................................................  41   JVM  Garbage  Collection  Analysis  .............................................................................................................  42  
  • 4. JVM  Garbage  Collection  Analysis  –  Optimized  Run  .........................................................................  43   XEN  PV  Driver  and  Network  Device  Architecture  ...........................................................................  45   Recommended  Network  Optimizations  ...............................................................................................  47   Last  Performance  Test  Results  .................................................................................................................  52     Table  Register   Major  Infrastructure  Limits  ..........................................................................................................................  7   AWS  Infrastructure  Mapping  and  Sizing  .................................................................................................  7   VCC  Infrastructure  Mapping  and  Sizing  ..................................................................................................  8   Optimized  MySQL  DB  -­‐  Recommended  Settings  ...............................................................................  22   Optimized  Cassandra  -­‐  Recommended  Settings  ...............................................................................  33   Network  Parameter  Comparison  ............................................................................................................  49      
  • 5. PREFACE   One  of  the  market  leading  enterprises,  hereinafter  called  Customer,  has  multiple   business  units  working  in  various  areas,  ranging  from  consumer  electronics  to  mobile   communications  and  cloud  services.  One  of  their  strategic  initiatives  is  to  expand   software  capabilities  to  get  on  top  of  the  competition.     The  Customer  started  to  use  AWS  platform  for  development  purposes  and  as  the  main   hosting  platform  for  their  cloud  services.  Over  past  years  the  usage  of  AWS  grew   significantly  with  over  30  production  applications  currently  hosted  on  AWS   infrastructure.     While  Customer  reliance  on  AWS  increased,  the  number  of  pain  points  grew  as  well.   They  experienced  multiple  outages  and  had  to  spend  unnecessary  high  costs  to  grow   application  performance  and  to  accommodate  unbalanced  CPU/Memory  hardware   profiles.  Although  achieved  application  performance  was  satisfactory  in  general,  several   major  challenges  and  trends  emerged  over  time:   -­‐ Scalability  and  growth  issues   -­‐ Very  high  overall  infrastructure  and  support  costs   -­‐ Single  service  provider  lock-­‐in.       Verizon  proposed  to  trial  the  Verizon  Cloud  Compute  (VCC)  beta  product  as  an   alternative  hosting  platform  with  the  goal  to  demonstrate  that  on  par  application   performance  can  be  achieved  at  a  much  lower  cost,  effectively  addressing  one  of  the   biggest  challenges.  An  alternative  hosting  platform  would  give  the  Customer  a  freedom   of  choice,  thus  addressing  another  issue.  Last,  but  not  least,  the  unique  VCC  platform   architecture  and  infrastructure  stack  built  for  low  latency  and  high  performance   workloads  would  definitely  help  to  address  another  pain  point  –  application   performance  and  scalability.     Senior  executives  from  both  companies  supported  this  initiative  and  one  of  the   Customer’s  applications  was  selected  for  the  proof  of  concept  project.  The  objective  was   to  compare  side-­‐by-­‐side  AWS  and  VCC  deployments  from  both  capability  and   performance  perspectives,  execute  performance  tests  and  deliver  report  to  senior   management.     The  proof  of  concept  project  has  been  successfully  executed  in  a  close  collaboration   between  various  Verizon  teams  as  well  as  Customer’s  SMEs.  It  was  demonstrated  that   the  application  hosted  on  the  VCC  platform,  given  appropriate  tuning,  is  capable  of   delivering  better  performance  than  when  hosted  on  a  more  powerful  AWS  based   footprint.      
  • 6. PART  I  –  BUILDING  TESTBED     The  agreed  high-­‐level  plan  was  clear  and  straightforward:   • (Verizon)  Mirror  AWS  hosting  infrastructure  using  VCC  platform   • (Verizon)  Setup  Infrastructure,  OS  and  Applications  per  specification  sheet   • (Customer)  Adjust  necessary  configurations  and  settings  on  VCC  platform   • (Customer)  Upload  test  data  –  10  million  users,  100  million  contacts   • (Customer)  Execute  smoke,  performance  and  aging  test  in  AWS  environment   • (Customer)  Execute  smoke,  performance  and  aging  test  in  VCC  environment   • (Customer)  Compare  AWS  and  VCC  results  and  captured  metrics   • (Customer)  Deliver  report  to  senior  management     The  high-­‐level  diagram  below  is  depicting  the  application  infrastructure  hosted  on  AWS   platform.       Figure  1:  AWS  Application  Deployment   Although  both  AWS  and  VCC  platforms  are  using  XEN  hypervisor  in  their  core,  the   initial  step  –  mirroring  AWS  hosting  environment  by  provisioning  equally  sized  VMs  in  
  • 7. VCC  raised  first  challenge.  Verizon  Cloud  Compute  platform  in  its  early  beta  stage  has   imposed  number  of  limitations.  To  be  fair,  those  limitations  were  not  by  design,  nor   hardware  limits,  rather  software  or  configuration  settings  pertinent  to  corresponding   product  release.       The  table  below  summarizes  most  important  infrastructure  limits  for  both  cloud   platforms  as  of  February  2014:     Resource  Limit   VCC   AWS   VPUs  per  VM   8   32   RAM  per  VM   28  GB   244  GB   Volumes  per  VM   5   20+   IOPS  per  Volume  (SSD)   3000   4000   Max  Volume  Size   1  TB   1  TB   Guaranteed  IOPS  per  VM   15K   40K   Throughput  per  vNIC   500  Mbps   10  Gbps   Table  1:  Major  Infrastructure  Limits   Besides  obvious  points,  like  the  number  of  CPUs  or  huge  difference  in  network   throughput,  it’s  also  worth  mentioning  that  the  CPU/RAM  –  processor  count  to  memory   size  ratio  is  quite  different  as  well  -­‐  1:4.5  for  VCC  and  1:7.625  for  AWS  correspondingly.   This  ratio  is  crucial  for  certain  types  of  applications,  specifically  for  databases.     Despite  aforementioned  differences,  it  was  jointly  decided  with  the  Customer  to  move   forward  with  smaller  VCC  VMs  and  consider  sizing  ratio  while  comparing  performance   and  test  results.  This  already  set  the  expectation  that  VCC  results  might  be  lower   comparing  to  AWS,  assuming  linear  application  scalability  and  4-­‐8  times  hardware   footprint  differences.     The  table  below  summarizes  infrastructure  sizing  and  mapping  for  corresponding   service  layers  hosted  on  both  cloud  platforms.  Resources  sized  differently  on  the   corresponding  platforms  are  highlighted.     VM  Role   AWS  VM  Profile   Count   VPUs   RAM,  GB   IOPS   Net,   Mbps   Tomcat   2   4   34.2   -­‐   1000   MySQL   1   32   244   10K   10000   Cassandra   8   8   68.4   5K   1000   HA  Proxy   4   2   7.5   -­‐   1000   DB  Cache   2   4   34.2   -­‐   1000   Table  2:  AWS  Infrastructure  Mapping  and  Sizing   VM  Role   VCC  VM  Profile  
  • 8. Count   VPUs   RAM,  GB   IOPS   Net,   Mbps   Tomcat   2   4   28   -­‐   500   MySQL   1   4   28   9K   500   Cassandra   12   4   28   5K   500   HA  Proxy   4   2   4   -­‐   500   DB  Cache   2   4   28   -­‐   500     Table  3:  VCC  Infrastructure  Mapping  and  Sizing   The  initial  setup  of  the  disk  volumes  required  special  creativity  in  order  to  get  as  close   as  possible  to  the  required  number  of  IOPS.  In  addition  to  the  per-­‐disk  storage  limits   mentioned  above,  initially  there  was  another  VCC  limitation  in  place  that  was  luckily   addressed  later  –  all  disks  connected  to  a  particular  VM  had  to  be  provisioned  with  the   exact  same  IOPS  rate.     The  most  common  setup  used  was  based  on  LVM2  with  a  linear  extension  for  the  boot   disk  volume  group  and  either  two  or  three  additional  disks  aggregated  into  an  LVM   stripe  set.  This  setup  allowed  setting  up  disk  volumes  with  up  to  3TB  size  and  9000   IOPS,  getting  close  enough  to  the  required  10K  IOPS  for  database  VMs.     Besides  technical  limitations  the  sheer  volume  of  provisioning  and  configuration  work   presented  challenge  in  itself.  The  hosting  platform  requirements  were  captured  in  a   spreadsheet  listing  system  parameters  for  every  VM.  Following  this  spreadsheet   manually  and  building  out  environment  sequentially  would  have  required  significant   time  and  tremendous  manual  effort.  Additionally,  this  may  have  resulted  in  a  number  of   human  errors  and  omissions.  Automating  and  scripting  major  parts  of  the  installation   and  setup  process  addressed  this.     The  automation  suite  implemented  based  on  the  vzDeploymentFramework  shell  library   (Verizon  internal  development),  made  it  possible  in  a  matter  of  minutes  to:   -­‐ Parse  specification  spreadsheet  for  inputs  and  updates   -­‐ Generate  updated  OS  and  Application  configurations   -­‐ Create  LVM  volumes  or  software  RAID  arrays   -­‐ Roll-­‐out  updated  settings  to  multiple  systems  based  on  their  functional  role   -­‐ Change  of  Linux  iptables  based  firewall  configurations  across  the  board   -­‐ Validate  required  connectivity  between  hosts   -­‐ Install  required  software  packages     Having  all  configurations  in  version  controlled  repository  allowed  auditing  and   comparing  configurations  between  master  and  on-­‐host  deployed  versions,  providing   rudimentary  configuration  management  capabilities.     Below  is  a  high-­‐level  architecture  for  the  originally  implemented  test  environment.    
  • 9.   Figure  2:  Initial  VCC  Application  Deployment   The  test  load  was  initiated  by  a  JMeter  Master  (Test  Controller  and  Management  GUI)   and  generated  by  several  JMeter  Slaves  (Load  Generators  or  Test  Agents).  The   generated  virtual  user  (VU)  requests  were  load-­‐balanced  between  two  Tomcat   application  servers  each  running  single  application  instance.     Since  F5  LTM  instances  were  not  available  during  the  build  time,  the  proposed  design   utilized  pfSense  appliances  as  routers,  load-­‐balancers  or  firewalls  for  corresponding   VLANs.     The  tomcat  servers  communicated  via  another  pair  of  HAProxy  load-­‐balancers  with  two   persistent  storage  back-­‐ends  –  MySQL  (SQL  DB)  and  Cassandra  (NOSQL  DB),  employing   Couchbase  (DB  Cache)  as  a  caching  layer.      
  • 10. Most  systems  were  additionally  instrumented  with  NMON  collectors  for  gathering  key   performance  metrics.  A  Jennifer  APM  application  has  been  deployed  to  perform  real-­‐ time  transaction  monitoring  and  code  introspection.     Following  the  initial  plan,  the  hosting  environment  was  timely  handed  over  to  the   Customer  for  adjusting  configurations  and  uploading  test  data.     PART  II  –  FIRST  TEST     The  first  test  was  conducted  on  both  AWS  and  VCC  platforms  and  Customer  did  share   the  test  results.    During  the  test  the  load  was  ramped  up  using  100  VU  increments  for   each  subsequent  10  minutes  long  test  run.  During  each  run  the  corresponding  number   of  virtual  users  performed  various  API  calls  emulating  human  behavior  using  patterns   observed  and  measured  on  the  production  application.     The  chart  below  depicts  the  number  of  application  transactions  successfully  processed   by  each  platform  during  the  10  minutes  test  runs.       Figure  3:  First  Test  Results  -­‐  Comparison  Chart   It  was  obvious  that  the  AWS  infrastructure  is  more  powerful,  processing  more  than  two   times  higher  throughput,  which  did  not  come  as  big  surprise.  However,  Customer   expressed  several  concerns  about  overall  VCC  platform  stability,  low  MySQL  DB  server   performance  and  uneven  load  distribution  between  striped  data  volumes,  dubbed  as   I/O  skews.   321   462   539   627   637   645   651   654   203   256   269   257   275   249   268   247   0   100   200   300   400   500   600   700   200   300   400   500   600   700   800   900   TPS  per  VU  count AWS  TPS   Verizon  TPS  
  • 11.   Indeed,  application  “Transactions  per  Second”  (TPS)  measurements  did  not  correlate   well  with  the  generated  application  load  and  even  with  a  growing  number  of  users   something  prevented  the  application  from  taking-­‐off.  After  short  increases  overall   throughput  consistently  dropped  again,  clearly  pointing  to  a  bottleneck  limiting  the   transaction  stream.     According  to  Jennifer  APM  monitors  the  increase  in  application  transaction  times  was   caused  by  slow  DB  responses,  taking  5  second  and  more,  per  single  DB  operation.  At  the   same  time  DB  server  was  showing  very  high  CPU  %iowait,  fluctuating  about  85-­‐90%.         Figure  4:  First  Test  -­‐  High  CPU  Load  on  DB  Server     Figure  5:  First  Test  -­‐  High  CPU  %iowait  on  DB  Server   Furthermore,  out  of  three  stripes,  parts  of  the  data  volume,  one  volume  constantly   reported  significantly  higher  device  wait  times  and  utilization  percentage,  effectively   causing  disk  I/O  skews.       Figure  6:  First  Test  -­‐  Disk  I/O  Skew  on  DB  Server  
  • 12. Obviously,  these  test  results  were  not  acceptable.  To  investigate  and  identify   bottlenecks  and  performance  limiting  factors  good  knowledge  of  the  application   architecture  and  its  internals  was  required  as  well  as  deep  VCC  product  and  storage   stack  knowledge,  since  the  latter  two  issues  seemed  to  be  platform  and  infrastructure   related.  To  address  this  a  dedicated  cross-­‐team  taskforce  was  established.     PART  III  –  STORAGE  STACK  PERFORMANCE     The  VCC  Storage  Stack  was  validated  once  more  and  it’s  been  reconfirmed,  that  there   are  no  limiting  factors  or  shortcomings  on  layers  below  block  device.  The  resulting   conclusion  was  that  the  limitations  had  to  be  on  the  hypervisor,  OS,  or  application  layer.     On  the  other  hand  Customer  confirmed  that  AWS  deployment  is  using  exactly  the  same   configuration  and  application  versions  as  VCC.  The  only  possible  logical  conclusion  was   that  the  setup  and  configuration  optimal  for  AWS  does  not  perform  the  same  way  on   VCC.  Or  in  other  words,  the  VCC  platform  required  its  own  optimal  configuration.     Further  efforts  have  been  aligned  with  the  following  objectives:   -­‐ Improve  storage  throughput  and  address  I/O  skews   -­‐ Identify  the  root  cause  for  low  DB  server  performance   -­‐ Improve  DB  server  performance  and  scalability   -­‐ Work  with  Customer  on  improving  overall  VCC  deployment  performance   -­‐ Re-­‐run  performance  tests  and  demonstrate  improved  throughput  and   predictable  performance  levels     Originally  the  storage  volumes  were  setup  using  Customer  specifications  and  OS   defaults  for  other  parameters.    After  performing  research  and  a  number  of  component   performance  tests,  several  interesting  discoveries  were  made,  in  particular:   -­‐ Different  Linux  distributions  (Ubuntu  and  CentOS)  are  using  a  different   approach  to  disk  partitioning.  Ubuntu  did  align  partitions  for  4k  block  sizes,   while  CentOS  did  not   -­‐ The  default  block  device  scheduler  CFQ  is  not  a  good  choice  in  environments   using  virtualized  storage   -­‐ MDADM  and  LVM  volume  managers  are  using  quite  different  algorithms  for  I/O   batching  or  compaction   -­‐ XFS  and  EXT4  file-­‐systems  yield  very  different  results  depending  on  the  number   of  concurrent  threads  performing  I/O   -­‐ Due  to  all  Linux  optimizations  and  multiple  caching  levels  it’s  hard  enough  to   measure  net  storage  throughput  from  within  VM,  let  alone  through  the  entire   application  stack     After  number  of  trials  and  studying  platform  behavior,  the  following  was  suggested  for   achieving  optimal  I/O  performance  on  VCC  storage  stack:   -­‐ Use  raw  block  devices  instead  of  partitions  for  RAID  stripes  to  circumvent  any   partition  block  alignment  issues  
  • 13. -­‐ Use  MDADM  software  RAID  instead  of  LVM  (the  latter  is  more  flexible  and  may   be  used  in  combination  with  MDADM,  however  it  does  perform  certain  amount   of  “optimization”  assuming  spindle  based  storage  that  may  interfere  with   performance  in  VCC)   -­‐ Use  proper  stripe  settings  and  block  sizes  for  software  RAID  (don’t  let  system   guess  –  specify!)   -­‐ Use  EXT4  file-­‐system  instead  of  XFS.  EXT4  does  provide  journaling  for  meta-­‐data   and  data  instead  of  meta-­‐data  only  with  neglectable  performance  overhead  for   the  load  observed.   -­‐ Use  optimal  (and  safe)  settings  for  EXT4  file-­‐system  creation  and  mounts   -­‐ Ensure  NOOP  block  device  scheduler  is  used  (which  lets  the  underlying  storage   stack  from  the  hypervisor  down  optimize  block  I/O  more  effectively)   -­‐ Separate  various  I/O  profiles,  e.g.  sequential  I/O  (redo/bin-­‐log  files)  and  random   I/O  (data  files)  for  DB  server  by  writing  corresponding  data  to  separate  logical   disks.   -­‐ Use  DIRECT_IO  wherever  possible  and  avoid  OS/file-­‐system  caching  (cache  may   give  in  certain  situations  false  impression  of  high  performance  which  is  then   abruptly  interrupted  by  flushing  massive  caches  during  which  the  entire  VM  gets   blocked)   -­‐ Avoid  I/O  bursts  due  to  cache  flushing  and  keep  device  queue  length  close  to  8.   This  corresponds  to  a  hardware  limitation  on  the  chassis  NPU.  In  VCC  storage  is   very  low  latency  and  quick,  but  if  the  storage  queue  locks  up  the  entire  VM  gets   blocked.  Writing  early  and  often  at  a  consistent  rate  performs  dramatically   better  under  load  than  caching  in  RAM  as  long  as  possible  and  then  flooding  the   I/O  queue  when  the  cache  has  been  exhausted.   -­‐ Make  sure  network  device  driver  is  not  competing  with  block  device  drivers  and   application  for  CPU  time  by  relocating  associated  interrupts  to  different  vCPU   cores  inside  the  VM.   -­‐ Use  4K  blocks  for  I/O  operations  wherever  possible  for  more  optimal  storage   stack  operation.     After  implementing  these  suggestions  on  a  DB  server,  storage  subsystem  yielded   predictable  and  consistent  performance.  For  example,  data  volumes  setup  with  10K   IOPS,  have  been  reporting  ~39MB/s  throughput,  which  is  expected  maximum  assuming   4K  I/O  block  size:    (4K  *  10000  IOPS)  /1024  =  39.06M,  the  maximum  possible  throughput    (4K  *  15000  IOPS)  /1024  =  58.59M,  the  maximum  possible  throughput   With  15K  IOPS  setup  using  3  stripes  (5K  IOPS  each)  the  ~55-­‐56MB/s  throughput  was   achieved  as  shown  on  the  screenshot  below:  
  • 14.   Figure  7:  Optimized  Storage  Subsystem  Throughput   Although  some  minor  I/O  figures  deviation  (+/-­‐  5%)  was  still  observed,  this  is  typically   considered  acceptable  and  within  normal  range.     While  performing  additional  tests  on  optimized  systems,  it  was  observed  that  all  block   device  interrupts  are  being  served  by  CPU0,  which  was  becoming  a  hot  spot  even  with   netdev  interrupts  moved  off,  to  a  different  CPUs.  The  following  method  may  be  used  to   spread  block  device  interrupts  evenly  for  devices  implementing  RAID  stripes:   #  distribute  block  device  interrupts  between  CPU4-­‐CPU7   cat  /proc/interrupts   cat  /proc/irq/183[3-­‐6]/smp_affinity*   echo  80  >  /proc/irq/1836/smp_affinity   echo  40  >  /proc/irq/1835/smp_affinity   echo  20  >  /proc/irq/1834/smp_affinity   echo  10  >  /proc/irq/1833/smp_affinity   echo  8  >  /proc/irq/1838/smp_affinity    
  • 15. Please  note  that  IRQ  numbers  and  assignment  may  differ  on  your  system.  You  have  to   consult  /proc/interrupts  table  for  specific  assignments  pertinent  to  your  system.     For  additional  details  and  theory,  please  refer  to  the  following  online  materials:   http://www.percona.com/blog/2011/06/09/aligning-­‐io-­‐on-­‐a-­‐hard-­‐disk-­‐raid-­‐the-­‐ theory/   https://www.kernel.org/doc/ols/2009/ols2009-­‐pages-­‐235-­‐238.pdf   http://people.redhat.com/msnitzer/docs/io-­‐limits.txt     PART  IV  –  DATABASE  OPTIMIZATION     Since  Customer  didn’t  share  application  and  testing  know-­‐how  yet,  the  only  way  to   reproduce  abnormal  DB  behavior  during  the  test  was  to  replay  DB  transaction  log   against  recovered  from  backup  DB  snapshot.  This  was  slow,  cumbersome  and  not  really   fully  repeatable  process.  Percona  tools  were  really  instrumental  for  this  task  allowing   multithreaded  transaction  replay  inserting  delays  between  transactions  as  recorded.  A   plain  SQL  script  import  would  have  been  processed  by  single  thread  only  and  all   requests  would  be  processed  as  one  stream.     Although  the  transaction  replay  has  created  some  DB  server  load,  the  load  type  and  its   I/O  patterns  were  quite  different  compared  to  I/O  patterns  observed  during  the  test.   Transaction  logs  included  only  DML  statements  (insert,  update,  delete),  but  no  data   read  (select)  requests.  Knowing  that  those  “select”  requests  represented  75%  of  all   requests,  it  quickly  became  apparent  that  such  testing  approach  is  flawed  and  will  not   be  able  to  recreate  real-­‐life  conditions.     We  came  to  a  point  where  more  advanced  tools  and  techniques  were  required  for   iterating  over  various  DB  parameters  in  a  repeatable  fashion  while  measuring  their   impact  on  DB  performance  and  underlying  subsystems.  Moreover,  it  was  not  clear   whether  unexpected  DB  behavior  and  performance  issues  were  caused  by  the   virtualization  infrastructure,  the  DB  engine  settings,  or  the  way  DB  was  used,  i.e.   combination  of  application  logic  and  data  stored  in  DB  tables.     To  separate  those  concerns  it  was  proposed  to  perform  load-­‐tests  using  synthetic  OLTP   transactions  generated  by  sysbench,  a  well-­‐known  load-­‐testing  toolkit.  Such  tests  have   been  executed  on  both  VCC  and  AWS  platforms.  The  results  were  speaking  for   themselves.    
  • 16.   Figure  8:  AWS  i2.8xlarge  CPU  load  -­‐  Sysbench  Test  Completed  in  64.42  sec       Figure  9:  VCC  4C-­‐28G  CPU  load  -­‐  Sysbench  Test  Complete  in  283.51  sec   At  this  point  it  was  clear  that  DB  server’s  performance  issues  have  nothing  to  do  with   application  logic  and  are  not  specific  to  SQL  workload  and  rather  related  to   configuration  and  infrastructure.  The  OLTP  test  provided  the  capability  to  stress  test   the  DB  engine  and  optimize  it  independently,  without  having  to  rely  on  Customer’s   application  know-­‐how  and  the  solution  wide  test  harness.    
  • 17. Thorough  research  and  study  of  InnoDB  engine  began…  Studying  source  code  as  well  as   consulting  with  the  following  online  resources  was  a  key  to  a  clear  understanding  of  DB   engine  internals  and  its  behavior:     -­‐ http://www.mysqlperformanceblog.com   -­‐ http://www.percona.com     -­‐ http://dimitrik.free.fr/blog/   -­‐ https://blog.mariadb.org         The  drawing  below  published  by  Percona  engineers  is  showing  key  factors  and  settings   impacting  DB  engine  throughput  and  performance.       Figure  10:  InnoDB  Engine  Internals   Obviously,  there  is  no  quick  win  and  no  single  dial  to  turn  in  order  to  achieve  the   optimal  result.    It’s  easy  to  explain  main  factors  impacting  InnoDB  engine  performance,   though  optimizing  those  factors  practically  is  a  quite  challenging  task.    
  • 18.   InnoDB  Performance  –  Theory  and  Practice     The  two  most  important  parameters  for  InnoDB  performance  are     innodb_buffer_pool_size  and  innodb_log_file_size.  InnoDB  works  with  data  in  memory,   and  all  changes  to  data  are  performed  in  memory.  In  order  to  survive  a  crash  or  system   failure,  InnoDB  is  logging  changes  into  InnoDB  transaction  logs.  The  size  of  the  InnoDB   transaction  log  defines  up  to  how  many  changed  blocks  are  tolerated  in  memory  for  any   given  point  in  time.  The  obvious  question  is:  “why  can’t  we  simply  use  a  gigantic   InnoDB  transaction  log?”  The  answer  is  that  the  size  of  the  transaction  log  affects   recovery  time  after  a  crash.  The  rule  of  thumb  (until  recent)  was  -­‐  the  bigger  the  log,  the   longer  the  recovery  time.  Okay,  so  we  have  another  innodb_log_file_size  variable.  Let’s   imagine  it  as  some  distance  on  imaginary  axis:     Our  current  state  is  checkpoint  age,  which  is  the  age  of  the  oldest  modified  non-­‐flushed   page.  Checkpoint  age  is  located  somewhere  between  0  and  innodb_log_file_size.  Point  0   means  there  are  no  modified  pages.  Checkpoint  age  can’t  grow  past  innodb_log_file_size,   as  that  would  mean  we  would  not  be  able  to  recover  after  a  crash.     In  fact,  InnoDB  has  two  safety  nets  or  protection  points:  “async”  and  “sync”.  When   checkpoint  age  reaches  “async”  point,  InnoDB  tries  to  flush  as  many  pages  as  possible,   while  still  allowing  other  queries,  however,  throughput  drops  down  to  the  floor.  The   “sync”  stage  is  even  worse.  When  we  reach  “sync”  point,  InnoDB  blocks  other  queries   while  trying  to  flush  pages  and  return  checkpoint  age  to  a  point  before  “async”.  This  is   done  to  prevent  checkpoint  age  from  exceeding  innodb_log_file_size.  These  are  both   abnormal  operational  stages  for  InnoDB  and  should  be  avoided  at  all  cost.  In  current   versions  of  InnoDB,  the  “sync”  point  is  at  about  7/8  of  innodb_log_file_size,  and  the   “async”  point  is  at  about  6/8  =  3/4  of  innodb_log_file_size.     So,  there  is  one  critically  important  balancing  act:  on  the  one  hand  we  want  “checkpoint   age”  as  large  as  possible,  as  it  defines  performance  and  throughput.  But,  on  the  other   hand,  we  should  never  reach  the  “async”  point.  
  • 19.   The  idea  is  to  define  another  point  T  (target),  which  is  located  before  “async”,  in  order   to  have  a  gap  for  flexibility,  and  try  at  all  cost  to  keep  checkpoint  age  from  going  past  T.   We  assume  that  if  we  can  keep  “checkpoint_age”  in  the  range  0  –  T,  we  will  achieve   stable  throughput  even  for  more-­‐less  unpredictable  workload.     Now,  which  factors  affecting  checkpoint  age?  When  we  execute  DML  queries  that   change  data  (insert/update/delete),  we  perform  writes  to  the  log,  we  change  pages,  and   checkpoint  age  is  growing.  When  we  perform  flushing  of  changed  pages,  checkpoint  age   is  going  down  again.  So,  that  means  –  the  main  way  we  have  to  keep  checkpoint  age   about  point  “T”  is  to  change  the  number  of  pages  being  flushed  per  second  or  make  this   number  variable  and  suited  for  specific  workload.  That  way,  we  can  keep  checkpoint   age  down.  If  this  doesn’t  help  and  checkpoint  age  keeps  growing  beyond  “T”  towards   “async”–  we  have  a  second  control  mechanism:  we  can  add  a  delay  into   insert/update/delete  operations.  This  way  we  prevent  checkpoint  age  from  growing   and  reaching  “async”.     To  summarize,  the  idea  for  the  optimization  algorithm  is:  under  load  we  must  keep   checkpoint  age  around  point  “T”  by  increasing  or  decreasing  the  number  of  pages   flushed  per  second.  If  checkpoint  age  continues  to  grow,  we  need  to  throttle  throughput   to  prevent  further  growth.  The  throttling  depends  on  the  position  of  checkpoint  age  –  as   our  checkpoint  age  gets  closer  to  “async”,  we  need  higher  levels  of  throttling.     From  Theory  to  Practice  –  Test  Framework   There  is  a  saying  -­‐   In  theory,  there  is  no  difference  between  theory  and  practice,  but  in  practice  there  is…     In  practice,  there  are  a  lot  more  variables  to  bear  in  mind.  There  are  also  such  factors  as   I/O  limits,  thread  contention  and  locking  coming  into  play  and  improving  performance   is  becoming  more  like  solving  equation  with  a  number  of  variables,  which  are   depending  on  each  other…     Obviously,  for  being  able  to  iterate  over  various  parameter  and  setting  combinations   there  is  a  need  to  execute  DB  tests  in  a  repeatable  and  well-­‐defined  (read  automated)   manner,  while  capturing  test  results  for  correlation  and  further  analysis.  Quick  research   showed  that  although  there  are  many  load-­‐testing  frameworks  available,  with  some   being  specifically  tailored  for  testing  MySQL  DB  performance,  unfortunately,  none  of   them  would  cover  all  requirements  and  provide  needed  tools  and  automation.     Eventually,  we  developed  our  own  fully  automated  and  flexible  load-­‐testing  framework.   This  framework  was  mainly  used  to  stress  test  and  analyze  MySQL  and  InnoDB  
  • 20. behavior,  nonetheless,  it  is  open  enough  to  plug  in  any  other  tools  or  to  be  used  for   testing  different  applications.  The  developed  toolkit  includes  following  components:   -­‐ Test  Runner   -­‐ Remote  Test  Agent  (load  generator)   -­‐ Data  Collector  (sampler)   -­‐ Data  Processor   -­‐ Graphing  facility     Using  this  framework  it  was  possible  to  identify  the  optimal  MySQL  and  InnoDB  engine   configuration.  The  goal  was  to  deliver  best  possible  InnoDB  engine  performance  in   terms  of  transactions  and  queries  served  per  second  (TPS  and  QPS)  while  eliminating   I/O  spikes  and  achieving  consistent  and  predictable  system  load,  in  other  words   fulfilling  the  critically  important  balancing  act  mentioned  above:  keeping  “checkpoint   age”  as  large  as  possible  at  the  same  time  trying  not  to  reach  the  “async”  (or  even  worse   “sync”)  point.     The  graphs  below  show  that  an  optimally  configured  DB  server  can  easily  deliver  1000+   OLTP  transactions,  translating  to  20+K  queries  per  second,  generated  by  500   concurrent  DB  connections  during  a  6  hour  long  test.     Queries per second (QPS) – green     Figure  11:  Optimized  MySQL  DB  -­‐  QPS  Graph   After  a  warm-­‐up  phase  the  system  consistently  delivered  about  22K  queries  per  second.     Transactions per second (TPS) – green Response Time (RT) - blue   Figure  12:  Optimized  MySQL  DB  -­‐  TPS  and  RT  Graph  
  • 21.   After  ramping  up  load  up  to  500  concurrent  users,  the  system  consistently  delivered   1200  TPS  in  average.  The  response  time  1600ms  average  is  measured  end  to  end  and   includes  both  network  and  communication  overhead  (~1000ms)  and  SQL  processing   time  (~600ms).         %util - red await - green avgqu-sz - blue   Figure  13:  Optimized  MySQL  DB  -­‐RAID  Stripe  I/O  Metrics   It’s  easy  to  see  that  after  the  warm-­‐up  and  stabilization  phases  the  disk  stripe   performed  consistently  utilizing  an  average  disk  queue  size  ~8,  which  was  suggested  by   the  storage  team  as  the  optimum  value  for  VCC  storage  stack.  The  “await”  iostat  metric   is  constantly  below  20ms  ,  which  is  the  average  time  for  I/O  requests  to  be  issued  to  the   device  and  to  be  served.  Device  utilization  is  <25%  in  average,  showing  that  there  is  still   plenty  of  spare  capacity  to  serve  I/O  requests.       %idle – red %user - green %system - blue %iowait - yellow   Figure  14:  Optimized  MySQL  DB  -­‐  CPU  Metrics     The  CPU  metrics  are  showing  that  in  average  55%  of  CPUs  were  idle,  35%  were  spent  in   user  space,  i.e.  executing  applications,  5%  were  spent  by  kernel  (or  system)  tasks   including  interrupt  processing  and  just  5%  were  spent  waiting  for  device  I/O.          
  • 22.   bytes sent - green bytes received - blue   Figure  15:  Optimized  MySQL  DB  -­‐  Network  Metrics   The  network  traffic  measurement  suggests  that  network  capacity  is  fully  consumed,  or   using  other  words  –  network  is  saturated  with  ~48  MB/s  sent  and  ~2  MB/s  received.   These  50  MB/s  of  accumulative  traffic  getting  very  close  to  a  practical  maximum   throughput  that  can  be  achieved  on  the  500  Mbps  network  interface.     In  plain  English  this  means  that  network  is  the  limiting  factor  here  and  having  other   resources  available,  DB  server  could  deliver  much  higher  TPS  and  QPS  figures,  if   additional  network  capacity  can  be  provisioned.  The  ultimate  system  capacity  limit  was   not  established  due  to  time  constraints  and  the  fact  that  Customer  application  did  not   utilize  more  than  300  concurrent  DB  connections.     Optimal  DB  Configuration     Below  is  a  summary  of  major  changes  between  the  MySQL  database  configurations  on   the  AWS  and  VCC  platforms.  As  with  the  file-­‐system  configuration  the  objective  was  to   achieve  consistent  and  predictable  performance  by  avoiding  resource  usage  surges  and   stalls.       The  proposed  optimizations  may  have  a  positive  effect  in  general,  however,  they  are   specific  to  a  certain  workload  and  use-­‐case.  Therefore,  these  optimizations  cannot  be   considered  as  universally  applicable  in  VCC  environments  and  must  be  tailored  for  a   specific  workload.  Settings  marked  with  asterisk  (*)  are  defaults  for  the  DB  version  used.     <  …  removed  …  >   Table  4:  Optimized  MySQL  DB  -­‐  Recommended  Settings     Besides  the  parameter  changes  listed  above,  binary  logs  (also  known  as  transaction   logs)  have  been  moved  to  a  separate  volume  where  Ext4  file-­‐system  has  been  setup   with  the  following  parameters:     <  …  removed  …  >  
  • 23. Further  areas  for  DB  improvement:   -­‐ Consider  using  the  latest  stable  Percona  XtraDB  version,  which  is  based  on   MariaDB  codebase  and  provides  many  improvements,  including  patches  from   Google  and  Facebook:   o Redesign  of  locking  subsystem,  no  reliance  on  kernel  mutexes   o Latest  versions  have  removed  number  of  known  contention  points   resulting  in  less  spins  and  lock  waits  and  eventually  in  a  better  overall   performance   o Dump  and  pre-­‐load  buffer  pool  features  –  allowing  much  quicker  startup   and  warming-­‐up  phases   o Online  DDL  –  changing  schema  does  not  require  downtime   o Better  query  analyzer  and  overall  query  performance   o Better  page  compression  support  and  performance   o Better  monitoring  and  integration  with  performance  schema   o More  intelligent  flushing  algorithm  taking  in  consideration  both  page   change  rates,  I/O  rates,  system  load  and  capabilities  and  thus  providing   better  performance  adjusted  to  workload  out  of  the  box   o Suited  better  for  fast  SSD-­‐based  storage  (no  added  cost  for  random  I/O)   and  adaptive  algorithms  not  attempting  to  accommodate  for  spinning   disks  shortcomings   o Scales  better  on  SMP  (multi-­‐core)  systems  and  better  utilizes  higher   number  of  CPU  threads   o Provides  fast-­‐checksums  (hardware  assisted  CRC32)  allowing  to  lessen   CPU  overhead  while  retaining  data  consistency  and  security   o New  configuration  options  allowing  to  tailor  InnoDB  engine  even  better   to  a  specific  workload   -­‐ Consider  using  more  efficient  memory  allocator,  e.g.  jemalloc  or  tc_malloc.     o The  memory  allocator  provided  as  a  part  of  GLIBC  is  known  to  fall  short   under  high  concurrency.     o GLIBC  malloc  wasn’t  designed  for  multithreaded  workloads  and  has   number  of  internal  contention  points.     o Using  modern  memory  allocators  suited  for  high-­‐concurrency  can   significantly  improve  throughput  by  reducing  internal  locking  and   contention.   -­‐ Perform  DB  optimization.  While  optimizing  infrastructure  may  result  in   significant  improvement,  even  better  results  may  be  achieved  by  tailoring  the  DB   structure  itself:   o Consider  cluster  indexes  to  avoid  locking  and  contention   o Consider  page  compression.  Besides  slight  CPU  penalty,  this  may   significantly  improve  throughput  while  reducing  on-­‐disk  storage  several   times,  resulting  in  turn  in  quicker  replication  and  backups   o Monitor  performance  schema  to  find  out  more  about  in-­‐flight  DB  engine   performance  and  adjust  required  parameters   o Monitor  performance  and  information  schemas  to  find  more  details  about   index  effectiveness  and  build  better,  more  effective  indexes  
  • 24. -­‐ Perform  SQL  optimization.  No  infrastructure  optimization  can  accommodate   for  badly  written  SQL  requests.  Caching  and  other  optimization  techniques  often   mask  bad  code.  SQL  queries  joining  multi-­‐million  record  tables  may  work  just   fine  in  development  and  completely  break  down  on  a  production  DB.   Continuously  analyze  the  most  expensive  SQL  queries  to  avoid  full  table  scans   and  on-­‐disk  temporary  tables.     PART  V  –  PEELING  THE  ONION     It  is  a  common  saying  that  performance  improvement  is  like  peeling  an  onion.  After   addressing  one  issue,  the  next  one,  previously  masked,  is  uncovered  and  so  on…   Likewise,  in  our  case,  after  addressing  the  storage  and  DB  layers  and  improving  overall   application  throughput  it  is  became  apparent  something  else  was  holding  the   application  back  from  delivering  the  best  possible  performance.  By  this  time,  DB  layer   was  studied  very  well,  however,  the  overall  application  stack  and  associated  connection   flows  were  not  yet  completely  understood.     The  Customer  demonstrated  willingness  to  cooperate  and  assisted  by  providing   instructions  for  reproducing  JMeter  load  tests  as  well  as  on-­‐site  resources  for  an   architecture  workshop.     From  this  point  on,  the  optimization  project  speed  up  tremendously.  Not  only  was  it   possible  to  iterate  reliably  and  perform  load-­‐test  against  the  complete  application  stack,   the  understanding  of  the  application  architecture  and  access  to  Application   Performance  Management  (APM)  tool  Jennifer  made  a  huge  difference  in  terms  of   visibility  into  internal  application  operation  and  major  performance  metrics.          
  • 25. Figure  16:  Jennifer  APM  Console   Besides  providing  visual  feedback  and  displaying  a  number  of  metrics,  Jennifer  revealed   the  next  bottleneck  –  the  network.       PART  VI  –  PFSENSE     The  original  network  design,  replicating  network  structure  in  AWS,  was  proposed  and   agreed  with  the  Customer.  Separate  networks  were  created  to  replicate  the   functionality  of  AWS  VPC  and  pfSense  appliances  were  used  to  provide  network   segmentation,  routing  and  load  balancing.     <  …  removed  …  >     Figure  17:  Initial  Application  Deployment  -­‐  Network  Diagram   The  pfSense  is  an  open  source  firewall/router  software  distribution  based  on  FreeBSD.   It  is  installed  on  a  VM  and  turns  this  VM  to  a  dedicated  firewall/router  for  a  network.  It   also  provides  additional  important  functions  such  as  load  balancing,  VPN,  DHCP.  It  is   easy  to  manage  using  web  based  UI  even  for  users  with  little  knowledge  about   underlying  FreeBSD  system.     The  FreeBSD  network  stack  is  known  for  it’s  exceptional  stability  and  performance.  The   pfSense  appliances  have  been  used  many  times  before  and  after,  thus  nobody  expected   issues  coming  from  that  side…     Watching  the  Jennifer  XView  chart  closely  in  real-­‐time  is  fun  by  itself,  like  watching  fire.   It  also  is  a  powerful  analysis  tool  that  helps  to  understand  application  components   behavior.  
  • 26.   Figure  18:  Jennifer  XView  -­‐  Transaction  Response  Time  Scatter  Graph   On  the  graph  above,  distance  between  layers  is  exactly  10000ms,  which  is  pointing  to   the  fact  that  one  of  application  services  is  timing-­‐out  with  10  second  interval  and   repeating  connection  attempts  several  times.         Figure  19:  Jennifer  APM  -­‐  Transaction  Introspection   Network  socket  operations  were  taking  a  significant  amount  of  time  resulting  in   multiple  repeated  attempts  in  10-­‐second  intervals.     Following  old  sysadmin  adage  –  “…always  blame  the  network…  ”  application  flows  have   been  analyzed  again  and  pfSense  was  suspected  to  loose  or  delay  packets.  Interestingly   enough,  the  web  UI  has  reported  low  to  moderate  VM  load  and  didn’t  show  any  reasons   for  concern.    
  • 27. Nonetheless,  the  console  access  revealed  the  truth  –  the  load  created  by  number  of   short  thread  spins  was  not  properly  reported  in  the  web  UI  and  hidden  by  averaging   calculations.  A  closer  look  using  advanced  CPU  and  system  metrics  confirmed  that  the   appliance  was  experiencing  unexpectedly  high  CPU-­‐load,  adding  to  latency  and   dropping  network  packets.     Adding  more  CPUs  to  the  pfSense  appliances  resulted  in  doubling  network  traffic   passed  by  them.  However,  even  with  the  maximum  CPU  count  the  network  was  not  yet   saturated,  suggesting  that  pfSense  appliances  may  be  still  limiting  application   performance.     Since  pfSense  appliances  were  not  an  essential  requirement  and  they  were  just  used  to   provide  routing  and  load-­‐balancing  capability,  it  was  decided  to  remove  them  from   application  network  flow  and  access  subnets  by  adding  additional  network  cards  to   VMs,  with  either  NIC  connected  to  corresponding  subnet.     To  summarize  -­‐  it  would  be  wrong  to  conclude  that  pfSense  does  not  fit  the  purpose  and   is  not  a  viable  option  for  building  virtual  network  deployments.  Most  definitely,   additional  research  and  tuning  would  help  to  overcome  the  observed  issues.  Due  to   time  constraints  this  area  was  not  fully  researched  and  is  still  pending  thorough   investigation.     PART  VII  –  JMETER   With  pfSense  removed  and  HAProxy  used  for  load  balancing,  overall  application   throughput  was  definitely  improved.  Increasing  the  number  of  CPUs  on  the  DB  servers   and  the  Cassandra  nodes  seemed  to  help  as  well.  The  collaborative  effort  with  the   Customer  yielded  great  results  and  we  were  definitely  on  the  right  track.     With  the  floodgates  wide  open  we  have  been  able  to  push  more  than  1000+  concurrent   users  during  our  tests.  About  the  same  time  we  started  seeing  another  anomaly  –  one   out  of  three  JMeter  load  agents  (generators)  was  behaving  quite  strange.  After  reaching   end  of  the  test  at  3600  seconds  time  frame,  java  threads  belonging  to  the  two  JMeter   servers  were  shutting  down  quickly  and  the  third  instance  shutdown  took  a  while,   effectively  increasing  test  window  duration  and  as  result  negatively  impacting  average   test  metrics.     All  three  JMeter  servers  were  reconfigured  to  use  the  same  settings.  For  some  reason   they  were  using  slightly  different  configurations  and  were  logging  data  to  different   paths.  It  didn’t  resolve  the  underlying  issue,  though.  Due  to  time  constraints  it  was   decided  to  build  a  replacement  VM  rather  than  to  troubleshoot  issues  with  one  of  the   existing  VMs.     Eventually,  a  fourth  JMeter  server  was  deployed.  Besides  fixing  the  issue  with  java   threads  startup  and  shutdown,  it  allowed  us  to  generate  higher  loads  and  provided   additional  flexibility  in  defining  load-­‐patterns.  
  • 28.   Lesson  learned:  for  low  to  moderate  loads  JMeter  is  working  just  fine.  For  high  loads,   JMeter  may  become  a  breaking  point  itself.  In  this  case,  it  is  recommended  to  use  scale-­‐ out  approach  rather  than  scale-­‐up,  keeping  the  number  of  java-­‐threads  per  server   below  a  certain  threshold.     PART  VIII  –  ALMOST  THERE     Although  AWS  performance  measurements  were  still  better,  we  had  already   significantly  improved  performance  compared  to  the  figures  captured  during  the  first   round  of  performance  tests.       Removing  pfSense  an  average  of  587  TPS  with  800  VU  was  achieved.  In  this  test  load   was  spread  statically  rather  than  balanced  by  specifying  different  target  application   server  IP  addresses  manually  in  the  JMeter  configuration  files.  With  a  HAProxy  load-­‐ balancer  put  in  place  TPS  figures  initially  went  down  to  544  VU  and  after  some   optimizations  (disabled  connection  tracking,  netfilter),  it  has  increased  up  to  607  TPS   with  800  VU  –  the  maximum  we’ve  seen  to  date.  This  represents  a  22%  increase  from   the  best  previous  result  (498  TPS/800  VU  with  pfSense  yet)  and  100%  increase  from   initial  performance  test.  Overall  the  results  were  looking  more  than  promising.       Figure  20:  Iterative  Optimization  Progress  Chart  
  • 29.   Despite  good  progress  the  following  points  still  required  further  investigation:   -­‐ Disk  I/O  skew  issues  still  remained   -­‐ Cassandra  servers  disk  I/O  was  uneven  and  quite  high     Our  enthusiasm  rose  more  and  more  as  we  discovered  that  VCC  platform  could  serve   more  users  than  AWS.  The  AWS  test  results  showed  that  past  600VU  performance   started  to  decline  and  we  were  able  to  push  as  high  as  1600VU  with  application  being   able  to  support  the  load  and  showing  higher  throughput  numbers  (~760-­‐780TPS),  until   …     The  next  day  something  happened,  which  became  another  turning  point  in  this  project.   The  application  became  unstable  and  the  application  throughput  that  we  saw  just  a   couple  hours  earlier  decreased  significantly.  More  importantly  it  started  to  fluctuate,   with  the  application  freezing  at  random  times.  The  TPS-­‐scatter  landscape  in  Jennifer   was  showing  a  new  anomaly…       Figure  21:  Jennifer  XView  -­‐  Transaction  Response  Time  Surges   Since  other  known  bottlenecks  have  ben  removed  and  MySQL  DB  was  not  a  weak  link  in   the  chain  any  more,  basically  being  bored  during  the  performance  test,  the  Cassandra   cluster  became  a  next  suspect.     PART  IX  –  CASSANDRA    
  • 30. The  tomcat  logs  were  pointing  to  Cassandra  as  well.  There  were  numerous  warning   messages  about  excluding  one  or  another  node  from  the  connection  pool  due  to   connectivity  timeouts.     After  having  a  closer  look  at  the  Cassandra  nodes  several  points  drew  our  attention:   -­‐ There  was  no  consistency  in  the  Cassandra  ring  load   -­‐ Amounts  of  data  stored  on  Cassandra  nodes  varied  significantly   -­‐ Memory  usage  and  I/O  profiles  were  different  across  the  board.       As  a  common  trend  after  a  short  normal  run  period,  the  average  system  load  on  several   random  Cassandra  nodes  started  growing  exponentially  eventually  making  those  nodes   unresponsive.  During  this  time  the  I/O  subsystem  was  over-­‐utilized  as  well,  yielding   very  high  CPU  %wait  and  queue  length  on  block  devices.       Everything  was  pointing  to  the  fact  that  certain  Cassandra  nodes  initiated  compaction   (internal  data  structure  optimization)  right  during  the  load  test,  spiraling  down  in  a   deadly  loop.    Another  quick  conversation  with  Customer’s  architect  confirmed  the  same   –  it  was  most  likely  the  SSTable  compaction  causing  the  issue.       Figure  22:  VCC  Cassandra  Cluster  CPU  Usage  During  the  Test   As  seen  on  the  graph  above,  during  the  various  test  runs,  one  or  another  Cassandra   node  maxed  out  CPU  utilization.  The  same  configuration  in  AWS  has  been  working  just   fine  with  not  perfect  but  still  quite  even  load  and  no  continuous  load  spikes.    
  • 31.   Figure  23:  AWS  Cassandra  Cluster  CPU  Usage  During  the  Test   Comparing  both  VCC  and  AWS  Cassandra  deployments  led  to  quite  contradicting   conclusions:   -­‐ VCC  has  more  nodes  –  12  vs.  8  in  AWS,  but  it  should  improve  performance,  right?   -­‐ AWS  is  using  spinning  disks  for  Cassandra  VMs  and  VCC  storage  stack  is  SSD-­‐ based,  which  should  improve  performance  too…     Like  with  MySQL,  it  was  clear  -­‐  the  optimal,  or  even  “good  enough”  settings  taken  from   AWS  are  not  good  or  at  times  even  bad  for  using  on  the  VCC  platform.     For  historical  reasons  Customer’s  application  is  utilizing  both  SQL  and  NOSQL   databases.  When  mapping  AWS  infrastructure  to  VCC,  it  was  decided  to  build  a   Cassandra  ring  using  12  nodes  in  VCC  instead  of  8  nodes  in  AWS,  since  latter  were  lot   more  powerful  in  terms  of  individual  node  specifications.  As  further  tests  revealed  the   better  approach  would  have  been  just  opposite  -­‐  to  use  bigger  number  of  smaller  VMs   for  the  Cassandra  cluster.  It  is  also  worth  mentioning  that  Cassandra  has  been  originally   designed  to  run  on  number  of  low-­‐end  systems,  based  on  slow  spinning  disks.     During  the  past  couple  years,  SSD  started  to  appear  more  and  more  often  in  the  Data   Centers.  While  not  being  a  commodity  yet,  SSDs  became  a  heavily  used  component  in   modern  infrastructures  and  the  Cassandra  codebase  was  adjusted  to  make  internal   decisions  and  algorithms  more  suitable  for  use  in  conjunction  with  SSD,  and  not  only   spinning  disks.  Therefore  deploying  the  latest  stable  Cassandra  version  could  have   provided  additional  benefits  right  away.  Unfortunately,  the  specification  required   specific  version,  and  therefore  all  optimizations  have  been  performed  against  the  older   version.     Let’s  have  a  quick  look  at  Cassandra’s  architecture  and  some  key  definitions.    
  • 32.   Figure  24:  High-­‐Level  Cassandra  Architecture   Cassandra  is  a  distributed  key-­‐value  store  initially  developed  at  Facebook.  It  was   designed  to  handle  large  amounts  of  data  spread  across  many  commodity  servers.   Cassandra  provides  high  availability  through  a  symmetric  architecture  that  contains  no   single  point  of  failure  and  replicates  data  across  nodes.     Cassandra’s  architecture  is  a  combination  of  Google’s  Big-­‐  Table  and  Amazon’s  Dynamo.   Like  in  Dynamo’s  architecture,  all  Cassandra  nodes  form  a  ring  that  partitions  the  key   space  using  consistent  hashing  (see  figure  above).  This  is  known  as  distributed  hash   table  (DHT).  The  data  model  and  single  node  architecture  are  mainly  based  on  BigTable   in  its  terminology.  Cassandra  can  be  classified  as  an  extensible  row  store  since  it  can   store  a  variable  number  of  attributes  per  row.  Each  row  is  accessible  through  a  globally   unique  key.  Although  columns  can  differ  per  row,  columns  are  grouped  into  more  static   column  families.  These  are  treated  like  tables  in  a  relational  database.  Each  column   family  is  stored  in  separate  files.  In  order  to  allow  the  level  of  flexibility  of  a  different   schema  per  row,  Cassandra  stores  metadata  with  each  value.  The  metadata  contains  the   column  name  as  well  as  a  timestamp  for  versioning.     Like  BigTable,  Cassandra  has  an  in-­‐memory  storage  structure  that  is  called  Memtable,   one  instance  per  column  family.  The  Memtable  acts  as  a  write  cache  that  allows  for  fast   sequential  writes  to  disk.  Data  on  disk  is  stored  in  immutable  Sorted  String  Tables   (SSTable).  SSTables  consist  of  three  structures,  a  key  index,  a  bloom  filter  and  a  data   file.  The  key  index  points  to  the  rows  in  the  SSTable,  the  bloom  filter  enables  checking   for  the  existence  of  keys  in  the  table.  Due  to  the  limited  size  of  the  bloom  filter  it  is  also   cached  in  memory.  The  data  file  is  ordered  for  faster  scanning  and  merging.     For  consistency  and  fault  tolerance,  all  updates  are  first  written  to  a  sequential  log   (Commit  Log)  after  which  they  can  be  confirmed.  In  addition  to  the  Memtable,   Cassandra  provides  optional  row  caches  and  key  cache.  The  row  cache  stores  a   consolidated,  up-­‐to-­‐date  version  of  a  row,  while  the  key  cache  acts  as  an  index  to  the   SSTables.  If  these  are  used,  write  operations  have  to  keep  them  updated.  It  is  worth  
  • 33. mentioning  that  only  previously  accessed  rows  are  cached  in  Cassandra  in  both  caches.   As  a  result,  new  rows  will  only  be  written  to  the  Memtable  but  not  the  cache.     In  order  to  deliver  the  least  possible  latency  and  best  performance  on  low-­‐end   hardware,  data  writes  in  Cassandra  are  using  a  multi-­‐step  process,  first  writing   requests  to  the  commit-­‐log,  then  to  a  MemTable  structure  and  eventually,  when  flushed,   they  are  appended  to  and  becoming  immutable  SSTables  in  the  form  of  a  disk  file.  Over   time  as  the  number  of  SSTables  is  growing,  they  are  becoming  fragmented,  which  is   impacting  read  operations  performance.       To  make  it  simple,  flushing  and  compaction  operations  are  vitally  important  for   Cassandra.  However,  if  setup  incorrectly  or  executed  at  the  “wrong”  time,  they  can   decrease  performance  significantly,  at  times  making  the  entire  Cassandra  node   unresponsive.  Exactly  this  was  happening  and  during  the  test  when  several  nodes   stopped  responding  and  showed  very  high  system  load  and  performing  huge  amounts   of  I/O.  Obviously,  Cassandra’s  configuration  was  tuned  for  spinning  disks  on  AWS,   resulting  in  unexpected  behavior  on  the  SSD-­‐based  VCC  storage  stack.     As  a  first  measure  to  gain  better  visibility  into  Cassandra’s  operation,  the  DataStax   OpsCenter  application  was  deployed.  It  allowed  iterating  over  various  parameters  and   executing  a  number  of  tests  against  the  Cassandra  cluster  while  measuring  their  impact   and  helping  to  observe  overall  cluster  behavior.     Applying  all  the  lessons  learned  earlier  and  working  with  VCC  storage  team  the   following  configuration  changes  were  applied:     <  …  removed  …  >     Table  5:  Optimized  Cassandra  -­‐  Recommended  Settings   Similar  to  the  MySQL  optimization,  the  basic  idea  is  to  use  more  frequent  I/O  to   saturate  block  device  queues  less  and  as  a  result  more  optimally  utilizing  storage  stack   resources.     Besides  the  recommended  option  changes,  the  commit-­‐log  was  moved  to  a  separate   volume.  Those  changes  led  to  predictable  and  consistent  Cassandra  performance,   evenly  and  constantly  forcing  in-­‐memory  data  to  disk  and  avoiding  I/O  spikes  and   minimizing  stalls  due  to  compaction.  Below  is  a  summary  of  the  volumes  created  for  the   Cassandra  nodes:     xvda    600  IOPS  –  boot  and  root   xvdb    600  IOPS  –  lvm2  root  extension   xvdc  4600  IOPS  –  data  mdadm  stripe  disk  1  –  no  partitioning   xvde  4600  IOPS  –  data  mdadm  stripe  disk  2  –  no  partitioning  
  • 34. xvdf  4600  IOPS  –  data  mdadm  stripe  disk  3  –  no  partitioning   xvdg  5000  IOPS  –  commit  log  disk  –  no  partitioning     There  are  two  more  parameters  worth  mentioning,  which  are  controlling  the  streaming   and  compaction  throughput  limits  within  the  Cassandra  cluster.  Both  values  were  set  to   50MB/s,  which  is  sufficient  for  normal  cluster  operation  and  in  line  with  storage  sub-­‐ system  throughput  configured  on  the  Cassandra  nodes.  However,  sometimes  those   thresholds  may  need  to  be  changed.  In  case  of  cluster  rebalancing,  maintenance,  and   similar  operations  the  following  handy  shortcuts  may  be  used  to  control  thresholds   cluster  wide.     #  for  n  in  01  02  03  04  05  06  07  08  09  10  11  12  ;  do  ./nodetool  -­‐h  node$n  -­‐ p  9199  setcompactionthroughput  150  ;  done   #  for  n  in  01  02  03  04  05  06  07  08  09  10  11  12  ;  do  ./nodetool  -­‐h  node$n  -­‐ p  9199  setstreamthroughput  150  ;  done     Obviously,  after  maintenance  has  completed,  those  thresholds  should  be  set  back  to   appropriate  values  for  normal  production  use.     PART  X  –  HAPROXY     With  the  DB  layer  fixed,  application  performance  became  stable  across  tests,  although   two  points  were  still  raising  some  concerns:   -­‐ After  an  initial  spike  at  the  beginning  of  a  load  test,  the  number  of  concurrent   connections  abruptly  dropped  almost  two  times   -­‐ The  amount  of  Virtual  User  requests  reaching  either  application  server  was  quite   different  sometimes  reaching    a  1:2  ratio      
  • 35. Figure  25:  Jennifer  APM  -­‐  Concurrent  Connections  and  Per-­‐server  Arrival  Rate   It  was  time  to  take  a  closer  look  at  the  software  load-­‐balancers  based  on  HAProxy.  This   application  is  known  to  be  able  to  serve  100K+  concurrent  connections,  so  just  one   thousand  concurrent  connections  should  not  get  even  close  to  the  limit.     Additional  research  showed  that  the  round-­‐robin  load-­‐balancing  scheme  is  not   performing  as  expected  and  was  causing  a  concentration  of  requests  on  one  or  another   system  in  an  unpredictable  manner.  The  most  even  request  distribution  was  achieved   by  using  least-­‐connect  algorithm.    After  implementing  this  change,  the  load  eventually   evenly  spread  across  all  systems.       Figure  26:  Jennifer  APM  -­‐  Connection  Statistics  After  Optimization   Furthermore,  a  number  of  SYN  flood  kernel  warnings  in  the  log  files  as  well  as   nf_conntrac  complaints  (Linux  connection  tracking  facility  used  by  iptables)  about  its   overrun  buffers  and  dropped  connections  pointed  to  next  optimization  steps.       Initially,  it  was  decided  to  increase  the  size  of  the  connection  tracking  tables  and   internal  structures  and  disable  the  SYN  flood  protection  mechanisms.       <  …  removed  …  >     This  did  show  some  improvement,  however,  eventually  it  was  decided  to  turn  iptables   off  completely  to  remove  any  possible  obstacles  and  latency  introduced  by  this  facility.      
  • 36. During  the  subsequent  tests  when  generated  load  was  increased  further,  HAProxy  hit   another  issue  often  referred  to  as  “TCP  socket  exhaustion”.       A  quick  reminder  –  there  were  two  layers  of  HAProxies  deployed.  The  first  layer  was   load-­‐balancing  the  incoming  http  requests  originating  from  the  application  clients   between  the  java  application  server  (tomcat)  instances  and  the  second  layer  passing   requests  from  the  java  application  server  to  the  primary  and  stand-­‐by  MySQL  DB   servers.     HAProxy  works  as  a  reverse-­‐proxy  and  so  uses  its  own  IP  address  to  establish   connections  to  the  server.  Most  operating  systems  implementing  a  TCP  stack  typically   have  around  64K  (or  less)  TCP  source  ports  available  for  connections  to  a  remote   IP:port.  Once  a  combination  of  “source  IP:port  =>  destination  IP:port”  is  in  use,  it  cannot   be  re-­‐used.     As  a  consequence  there  cannot  be  more  than  64K  open  connections  from   a  HAProxy  box  to  a  single  remote  IP:port  couple.     On  the  front  layer  the  http  request  rate  was  a  few  hundreds  per  second,  so  we  never   ever  reach  the  limit  of  64K  simultaneous  open  connections  to  the  remote  service.  On   the  backend  layer  there  should  not  have  been  more  than  a  couple  of  hundred  persistent   connections  during  peak  time  since  connection  pooling  was  used  on  the  application   server.  So  this  was  not  the  problem  either.     It  turns  out  that  there  was  an  issue  with  the  MySQL  client  implementation.  When  a   client  sends  its  “QUIT”  sequence,  it  performs  a  few  internal  operations  before   immediately  shutting  down  the  TCP  connection,  without  waiting  for  the  server  to  do  it.   A  basic  tcpdump  revealed  this  behavior.  Note  that  this  issue  cannot  be  reproduced  on  a   loopback  interface  or  on  the  same  system,  because  the  server  answers  fast  enough.  But   over  a  LAN  connection  with  2  different  machines  the  latency  raises  past  the  threshold   where  the  issue  becomes  apparent.  Basically,  here  is  the  sequence  performed  by  a   MySQL  client:   MySQL  Client  ==>  "QUIT"  sequence  ==>  MySQL  Server   MySQL  Client  ==>              FIN              ==>  MySQL  Server   MySQL  Client  <==          FIN  ACK          <==  MySQL  Server   MySQL  Client  ==>              ACK              ==>  MySQL  Server   This  results  in  the  client  connection  to  remain  unavailable  for  twice  the  MSL  (Maximum   Segment  Life)  time,  which  defaults  to  2  minutes.  Note  that  this  type  of  close  has  no   negative  impact  when  the  MySQL  connection  is  established  using  a  UNIX  socket.