Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M writes/s on AWS

Presentation given in October 2011 at the High Performance Transaction Systems Workshop - describes how Netflix used AWS to run a set of highly scalable Cassandra benchmarks on hundreds of instances in only a few hours.

  • Login to see the comments

Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M writes/s on AWS

  1. Global  Ne)lix    Replacing  Datacenter  Oracle  with  Global  Apache  Cassandra  on  AWS October  24th,  2011   Adrian  Cockcro6   @adrianco  #ne9lixcloud   h=p://  
  2. Ne9lix  Inc.   With  over  25  million  members  in  the  United  States,   Canada  and  La8n  America,  Ne<lix,  Inc.  is  the  worlds   leading  Internet  subscrip8on  service  for  enjoying   movies  and  TV  shows.     Interna8onal  Expansion   Ne<lix,  Inc.,  the  leading  global  Internet  movie   subscrip8on  service,  today  announced  it  will  expand   to  the  United  Kingdom  and  Ireland  in  early  2012.  Source:  h=p://  
  3. Building  a  Global  Ne9lix  Service   Ne9lix  Cloud  MigraLon   Highly  Available  and  Globally   Distributed  Data   Scalability  and  Performance  
  4. Why  Use  Public  Cloud?  
  5. Things  We  Don’t  Do  
  6. Be=er  Business  Agility  
  7. Data  Center   Ne9lix  could  not   build  new   datacenters  fast   enough   Capacity  growth  is  acceleraLng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  XBox  
  8. Out-­‐Growing  Data  Center   h=p://­‐ne9lix-­‐api.html   37x  Growth  Jan   2010-­‐Jan  2011  Datacenter  Capacity  
  9.  is  now  ~100%  Cloud   A  few  small  back  end  data  sources  sLll  in  progress   All  internaLonal  product  is  cloud  based   USA  specific  logisLcs  remains  in  the  Datacenter   Working  aggressively  on  billing,  PCI  compliance  on  AWS  
  10. Ne9lix  Choice  was  AWS  with  our   own  pla9orm  and  tools   Unique  pla9orm  requirements  and   extreme  agility  and  flexibility  
  11. Leverage  AWS  Scale   “the  biggest  public  cloud”   AWS  investment  in  features  and  automaLon  Use  AWS  zones  and  regions  for  high  availability,   scalability  and  global  deployment  
  12. We  want  to  use  clouds,  we  don’t  have  Lme  to  build  them   Public  cloud  for  agility  and  scale   AWS  because  they  are  big  enough  to  allocate  thousands   of  instances  per  hour  when  we  need  to  
  13. Ne9lix  Deployed  on  AWS  Content   Logs   Play   WWW   API   Video   S3   DRM   Sign-­‐Up   Metadata   Masters   EMR   CDN   Device   EC2   Search   Hadoop   rouLng   Config   Movie   TV  Movie   S3   Hive   Bookmarks   Choosing   Choosing   Business   Mobile   CDN   Logging   RaLngs   Intelligence   iPhone  
  14. Datacenter  AnL-­‐Pa=erns   What  did  we  do  in  the  datacenter  that  prevented  us  from  meeLng  our   goals?    
  15. Old  Datacenter  vs.  New  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL   SLcky  In-­‐Memory  Session   Shared  Memcached  Session   Cha=y  Protocols   Latency  Tolerant  Protocols   Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  Pa=erns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
  16. The  Central  SQL  Database  •  Datacenter  has  central  Oracle  databases   –  Everything  in  one  place  is  convenient  unLl  it  fails   –  Customers,  movies,  history,  configuraLon  •  Schema  changes  require  downLme     An8-­‐paOern  impacts  scalability,  availability  
  17. The  Distributed  Key-­‐Value  Store  •  Cloud  has  many  key-­‐value  data  stores   –  More  complex  to  keep  track  of,  do  backups  etc.   –  Each  store  is  much  simpler  to  administer   –  Joins  take  place  in  java  code   DBA  •  No  schema  to  change,  no  scheduled  downLme  •  Latency  for  typical  queries   –  Memcached  is  dominated  by  network  latency  <1ms   –  Cassandra  replicaLon  takes  a  few  milliseconds   –  Oracle  for  simple  queries  is  a  few  milliseconds   –  SimpleDB  replicaLon  and  REST  auth  overheads  >10ms  
  18. Data  MigraLon  to  Cassandra  
  19. TransiLonal  Steps  •  BidirecLonal  ReplicaLon   –  Oracle  to  SimpleDB   –  Queued  reverse  path  using  SQS   –  Backups  remain  in  Datacenter  via  Oracle  •  New  Cloud-­‐Only  Data  Sources   –  Cassandra  based   –  No  replicaLon  to  Datacenter   –  Backups  performed  in  the  cloud  
  20. API  AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   API  etc.   Load  Balancer   Component   API   SQS   Services   Oracl e   Oracle   Oracle  Cassandra   memcached   ReplicaLon   memcached   EC2   Internal   Disks   Ne)lix   S3   Data  Center   SimpleDB  
  21. Cuong  the  Umbilical  •  TransiLon  Oracle  Data  Sources  to  Cassandra   –  Offload  Datacenter  Oracle  hardware   –  Free  up  capacity  for  growth  of  remaining  services  •  TransiLon  SimpleDB+Memcached  to  Cassandra   –  Primary  data  sources  that  need  backup   –  Keep  simplest  small  use  cases  for  now  •  New  challenges   –  Backup,  restore,  archive,  business  conLnuity   –  Business  Intelligence  integraLon  
  22. API  AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   Load  Balancer   Component   API   Services   memcached   Cassandra   EC2   Internal   Disks   Backup   S3   SimpleDB  
  23. High  Availability  •  Cassandra  stores  3  local  copies,  1  per  zone   –  Synchronous  access,  durable,  highly  available   –  Read/Write  One  fastest,  least  consistent  -­‐  ~1ms   –  Read/Write  Quorum  2  of  3,  consistent  -­‐  ~3ms  •  AWS  Availability  Zones   –  Separate  buildings   –  Separate  power  etc.   –  Close  together    
  24. Cassandra  Write  Data  Flows   Single  Region,  MulLple  Availability  Zone   Cassandra   • Disks   • Zone  A   2   2   4   2  1.  Client  Writes  to  any   Cassandra  3   3   Cassandra   If  a  node  goes  offline,   Cassandra  Node   • Disks   5 • Disks   5   hinted  handoff  2.  Coordinator  Node   • Zone  C   1 • Zone  A   completes  the  write   replicates  to  nodes   when  the  node  comes   and  Zones   back  up.  3.  Nodes  return  ack  to   Clients     coordinator   Requests  can  choose  to  4.  Coordinator  returns   3   wait  for  one  node,  a   Cassandra   Cassandra   ack  to  client   • Disks   • Disks   5   quorum,  or  all  nodes  to  5.  Data  wri=en  to   • Zone  C   • Zone  B   ack  the  write   internal  commit  log     disk   Cassandra   SSTable  disk  writes  and   • Disks   • Zone  B   compacLons  occur   asynchronously  
  25. Data  Flows  for  MulL-­‐Region  Writes   Consistency  Level  =  Local  Quorum  1.  Client  Writes  to  any   If  a  node  or  region  goes  offline,  hinted  handoff   Cassandra  Node   completes  the  write  when  the  node  comes  back  up.  2.  Coordinator  node  replicates   Nightly  global  compare  and  repair  jobs  ensure   to  other  nodes  Zones  and   everything  stays  consistent.   regions  3.  Local  write  acks  returned  to   coordinator   100+ms  latency   Cassandra   2 7  4.  Client  gets  ack  when  2  of  3   Cassandra   •  Disks   •  Disks   8   2   2   •  Zone  A   4   2   6   6   •  Zone  A   local  nodes  are  commi=ed   Cassandra   3   3   Cassandra   7   Cassandra   Cassandra   5   5  5.  Data  wri=en  to  internal   8   •  Disks   •  Disks   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  Zone  C   •  Zone  A   1   commit  log  disks   US   EU  6.  When  data  arrives,  remote   Clients   Clients   Cassandra   3   Cassandra   Cassandra   7   Cassandra   node  replicates  data   •  Disks   •  Zone  C   •  Disks   •  Zone  B   5   •  Disks   •  Zone  C   •  Disks   •  Zone  B   8  7.  Ack  direct  to  source  region   Cassandra   Cassandra   coordinator   •  Disks   •  Disks   •  Zone  B   •  Zone  B  8.  Remote  copies  wri=en  to   commit  log  disks  
  26. Remote  Copies  •  Cassandra  duplicates  across  AWS  regions   –  Asynchronous  write,  replicates  at  desLnaLon   –  Doesn’t  directly  affect  local  read/write  latency  •  Global  Coverage   –  Business  agility   –  Follow  AWS…  •  Local  Access   3 3 –  Be=er  latency   3 3 –  Fault  IsolaLon    
  27. Cassandra  Backup  •  Full  Backup   Cassandra   –  Time  based  snapshot   Cassandra   Cassandra   –  SSTable  compress  -­‐>  S3   Cassandra   Cassandra  •  Incremental   S3   –  SSTable  write  triggers   Cassandra   Backup   Cassandra   compressed  copy  to  S3  •  ConLnuous  OpLon   Cassandra   Cassandra   –  Scrape  commit  log   Cassandra   Cassandra   –  Write  to  EBS  every  30s  
  28. Cassandra  Restore  •  Full  Restore   Cassandra   Cassandra   Cassandra   –  Replace  previous  data  •  New  Ring  from  Backup   Cassandra   Cassandra   –  New  name  old  data   S3   Backup   Cassandra   Cassandra  •  Scripted   –  Create  new  instances   Cassandra   Cassandra   –  Parallel  load  -­‐  fast   Cassandra   Cassandra  
  29. Cassandra  Online  AnalyLcs  •  Brisk  =  Hadoop  +  Cass   Cassandra   Brisk   Cassandra   –  Use  split  Brisk  ring   –  Size  each  separately   Brisk   Cassandra  •  Direct  Access   S3   Backup   Cassandra   Cassandra   –  Keyspaces   –  Hive/Pig/Map-­‐Reduce   Cassandra   Cassandra   –  Hdfs  as  a  keyspace   Cassandra   Cassandra   –  Distributed  namenode  
  30. Cassandra  Archive   Appropriate  level  of  paranoia  needed…  •  Archive  could  be  un-­‐readable   –  Restore  S3  backups  weekly  from  prod  to  test  •  Archive  could  be  stolen   –  PGP  Encrypt  archive  •  AWS  East  Region  could  have  a  problem   –  Copy  data  to  AWS  West  •  ProducLon  AWS  Account  could  have  an  issue   –  Separate  Archive  account  with  no-­‐delete  S3  ACL  •  AWS  S3  could  have  a  global  problem   –  Create  an  extra  copy  on  a  different  cloud  vendor  
  31. Tools  and  AutomaLon  •  Developer  and  Build  Tools   –  Jira,  Perforce,  Eclipse,  Jenkins,  Ivy,  ArLfactory   –  Builds,  creates  .war  file,  .rpm,  bakes  AMI  and  launches  •  Custom  Ne9lix  ApplicaLon  Console   –  AWS  Features  at  Enterprise  Scale  (hide  the  AWS  security  keys!)   –  Auto  Scaler  Group  is  unit  of  deployment  to  producLon  •  Open  Source  +  Support   –  Apache,  Tomcat,  Cassandra,  Hadoop,  OpenJDK,  CentOS   –  Datastax  support  for  Cassandra,  AWS  support  for  Hadoop  via  EMR  •  Monitoring  Tools   –  Datastax  Opscenter  for  monitoring  Cassandra   –  AppDynamics  –  Developer  focus  for  cloud  h=p://  
  32. Developer  MigraLon  •  Detailed  SQL  to  NoSQL  TransiLon  Advice   –  Sid  Anand    -­‐  QConSF  Nov  5th  –  Ne9lix’  TransiLon   to  High  Availability  Storage  Systems   –  Blog  -­‐  h=p://   –  Download  Paper  PDF  -­‐  h=p://  •  Mark  Atwood,  "Guide  to  NoSQL,  redux”   –  YouTube  h=p://  
  33. Cloud  OperaLons   Cassandra  Use  Cases  Model  Driven  Architecture  Performance  and  Scalability  
  34. Cassandra  Use  Cases  •  Key  by  Customer  –  Cross-­‐region  clusters   –  Many  app  specific  Cassandra  clusters,  read-­‐intensive   –  Keys+Rows  in  memory  using  m2.4xl  Instances  •  Key  by  Customer:Movie  –  e.g.  Viewing  History   –  Growing  fast,  write  intensive  –  m1.xl  instances   –  Keys  cached  in  memory,  one  cluster  per  region  •  Large  scale  data  logging  –  lots  of  writes   –  Column  data  expires  a6er  Lme  period   –  Distributed  counters,  one  cluster  per  region  
  35. Model  Driven  Architecture  •  Datacenter  PracLces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa=erns  •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Jenkins  based  builds  for  everything   –  Every  producLon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaLon  is  managed  by  an  Autoscaler   Every  change  is  a  new  AMI  
  36. Ne9lix  Pla9orm  Cassandra  AMI  •  Tomcat  server   –  Always  running,  registers  with  pla9orm   –  Manages  Cassandra  state,  tokens,  backups  •  Removed  Root  Disk  Dependency  on  EBS   –  Use  S3  backed  AMI  for  stateful  services   –  Normally  use  EBS  backed  AMI  for  fast  provisioning  
  37. Chaos  Monkey  •  Make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact  •  Chaos  Monkey  hours   –  Monday-­‐Thursday  9am-­‐3pm  random  instance  kill  •  ApplicaLon  configuraLon  opLon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  
  38. AppDynamics  Monitoring  of  Cassandra  –  AutomaLc  Discovery  
  39. Ne9lix  ContribuLons  to  Cassandra  •  Cassandra  as  a  mutable  toolkit   –  Cassandra  is  in  Java,  pluggable,  well  structured   –  Ne9lix  has  a  building  full  of  Java  engineers….  •  Actual  ContribuLons  delivered  in  0.8   –  First  prototype  of  off-­‐heap  row  cache   –  Incremental  backup  SSTable  write  callback  •  Work  In  Progress   –  AWS  integraLon  and  backup  using  Tomcat  helper   –  Astyanax  re-­‐write  of  Hector  Java  client  library  
  40. Performance  TesLng  •  Cloud  Based  TesLng  –  fricLonless,  elasLc   –  Create/destroy  any  sized  cluster  in  minutes   –  Many  test  scenarios  run  in  parallel  •  Test  Scenarios   –  Internal  app  specific  tests   –  Simple  “stress”  tool  provided  with  Cassandra  •  Scale  test,  keep  making  the  cluster  bigger   –  Check  that  tooling  and  automaLon  works…   –  How  many  ten  column  row  writes/sec  can  we  do?  
  41. <DrEvil>ONE  MILLION</DrEvil>  
  42. Scale-­‐Up  Linearity   Client  Writes/s  by  node  count  –  ReplicaEon  Factor  =  3  1200000   1099837  1000000   800000   600000   537172   400000   366828   200000   174373   0   0   50   100   150   200   250   300   350  
  43. Per  Node  AcLvity   Per  Node   48  Nodes   96  Nodes   144  Nodes   288  Nodes  Per  Server  Writes/s   10,900  w/s   11,460  w/s   11,900  w/s   11,456  w/s  Mean  Server  Latency   0.0117  ms   0.0134  ms   0.0148  ms   0.0139  ms  Mean  CPU  %Busy   74.4  %   75.4  %   72.5  %   81.5  %  Disk  Read   5,600  KB/s   4,590  KB/s   4,060  KB/s   4,280  KB/s  Disk  Write   12,800  KB/s   11,590  KB/s   10,380  KB/s   10,080  KB/s  Network  Read   22,460  KB/s   23,610  KB/s   21,390  KB/s   23,640  KB/s  Network  Write   18,600  KB/s   19,600  KB/s   17,810  KB/s   19,770  KB/s   Node  specificaLon  –  Xen  Virtual  Images,  AWS  US  East,  three  zones   •  Cassandra  0.8.6,  CentOS,  SunJDK6   •  AWS  EC2  m1  Extra  Large  –  Standard  price  $  0.68/Hour   •  15  GB  RAM,  4  Cores,  1Gbit  network   •  4  internal  disks  (total  1.6TB,  striped  together,  md,  XFS)  
  44. Time  is  Money   48  nodes   96  nodes   144  nodes   288  nodes  Writes  Capacity   174373  w/s   366828  w/s   537172  w/s   1,099,837  w/s  Storage  Capacity   12.8  TB   25.6  TB   38.4  TB   76.8  TB  Nodes  Cost/hr   $32.64   $65.28   $97.92   $195.84  Test  Driver  Instances   10   20   30   60  Test  Driver  Cost/hr   $20.00   $40.00   $60.00   $120.00  Cross  AZ  Traffic   5  TB/hr   10  TB/hr   15  TB/hr   301  TB/hr  Traffic  Cost/10min   $8.33   $16.66   $25.00   $50.00  Setup  DuraLon   15  minutes   22  minutes   31  minutes   662  minutes  AWS  Billed  DuraLon   1hr   1hr   1  hr   2  hr  Total  Test  Cost   $60.97   $121.94   $182.92   $561.68   1  EsLmate  two  thirds  of  total  network  traffic     2  Workaround  for  a  tooling  bug  slowed  setup  
  45. Takeaway     Ne<lix  is  using  Cassandra  on  AWS  as  a  key     infrastructure  component  of  its  globally   distributed  streaming  product.    Also,  benchmarking  in  the  cloud  is  fast,  cheap  and   scalable     h=p://   @adrianco  #ne9lixcloud  
  46. Amazon Cloud Terminology Reference See This is not a full list of Amazon Web Service features•  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)  •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaLon  code)  •  EC2  –  ElasLc  Compute  Cloud   –  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraLons.   –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.   –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage   –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosLng  cloud  instances   –  Region  –  group  of  Availability  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan  •  ASG  –  Auto  Scaling  Group  (instances  booLng  from  the  same  AMI)  •  S3  –  Simple  Storage  Service  (h=p  access)  •  EBS  –  ElasLc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)  •  RDS  –  RelaLonal  Database  Service  (managed  MySQL  master  and  slaves)  •  SDB  –  Simple  Data  Base  (hosted  h=p  based  NoSQL  data  store)  •  SQS  –  Simple  Queue  Service  (h=p  based  message  queue)  •  SNS  –  Simple  NoLficaLon  Service  (h=p  and  email  based  topics  and  messages)  •  EMR  –  ElasLc  Map  Reduce  (automaLcally  managed  Hadoop  cluster)  •  ELB  –  ElasLc  Load  Balancer  •  EIP  –  ElasLc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)  •  VPC  –  Virtual  Private  Cloud  (extension  of  enterprise  datacenter  network  into  cloud)  •  IAM  –  IdenLty  and  Access  Management  (fine  grain  role  based  security  keys)